A Terminological Inventory for Biodiversity


In order to construct the inventory, we firstly compiled a species name dictionary by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). The terms contained in this dictionary were then located within the text of English BHL documents (about 24 million pages of text) using a string matching method. We then learned vector representations of the identified terms using three different approaches, namely count-based, prediction-based and compositional distributional semantic models (DSMs). These approaches compute vector representations for both single and multi-word terms. The cosine similarity between two vectors serves as an indicator of the semantic relatedness between the corresponding terms: the higher the cosine similarity, the greater the relatedness of the two terms. We finally select the top-k candidates as the terms that are most semantically related to a given term.

The inventory contains 288,562 names of species whose frequency in BHL documents is at least five. For each term in the inventory, the 20 topmost semantically similar terms are provided, together with their corresponding similarity scores. To facilitate further digital biodiversity processes, each term is also linked to its URI, UUID and LSID indexed by Global Names.

A search interface that uses the inventory as metadata for query expansion is available at


The inventory is available to download. Please observe the terms and conditions of the licence (see below).


Creative Commons License
The Terminological Inventory for Biodiversity was created at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. It is licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the corpus and cite the following paper:

Nguyen, N. T. H., Soto, A., Kontonatsios, G., Batista-Navarro, R. and Ananiadou, S. (2017). Constructing a Biodiversity Terminological Inventory. PLOS ONE, 12(4), e0175277.