A Terminological Inventory for Biodiversity
Description
In order to construct the inventory, we firstly compiled a species name dictionary by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). The terms contained in this dictionary were then located within the text of English BHL documents (about 24 million pages of text) using a string matching method. We then learned vector representations of the identified terms using three different approaches, namely count-based, prediction-based and compositional distributional semantic models (DSMs). These approaches compute vector representations for both single and multi-word terms. The cosine similarity between two vectors serves as an indicator of the semantic relatedness between the corresponding terms: the higher the cosine similarity, the greater the relatedness of the two terms. We finally select the top-k candidates as the terms that are most semantically related to a given term.
The inventory contains 288,562 names of species whose frequency in BHL documents is at least five. For each term in the inventory, the 20 topmost semantically similar terms are provided, together with their corresponding similarity scores. To facilitate further digital biodiversity processes, each term is also linked to its URI, UUID and LSID indexed by Global Names.
A search interface that uses the inventory as metadata for query expansion is available at http://nactem.ac.uk/BHLQueryExpansion/.
Availability
The inventory is available to download. Please observe the terms and conditions of the licence (see below).
Licence
The Terminological Inventory for Biodiversity was created at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK.
It is licensed under a Creative Commons Attribution 4.0 International License.
Please attribute NaCTeM when using the corpus and cite the following paper:
Nguyen, N. T. H., Soto, A., Kontonatsios, G., Batista-Navarro, R. and Ananiadou, S. (2017). Constructing a Biodiversity Terminological Inventory. PLOS ONE, 12(4), e0175277.
Featured News
- Prof. Sophia Ananiadou accepted as an ELLIS fellow
- Call for papers: CL4Health @ NAACL 2025
- Invited talk at the 15th Marbach Castle Drug-Drug Interaction Workshop
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
Other News & Events
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine