COnserving Philippine bIOdiversity by UnderStanding big data (COPIOUS): Integration and analysis of heterogeneous information on Philippine biodiversity


The collaborative project aims to advance the means by which Philippine biodiversity information is being collected and published, through the construction of an online knowledge repository whose content will be semi-automatically curated by text mining-based analytics. Currently, information on Philippine biodiversity is largely fragmented due to the siloed formats in which different local institutions store their data. Scientific literature offers invaluable information that can fill in the knowledge gaps, but because of its overwhelming volume and lack of structure, its thorough manual examination has become impossible. Consequently, a comprehensive body of knowledge on Philippine biodiversity remains unavailable, hampering the timely formulation of environmental policies, and the discovery of new natural products that can potentially provide medicinal benefits.

Project Aims

The project aims to produce a knowledge repository of Philippine biodiversity by combining the domain-relevant expertise and resources of Philippine partners with the text mining-based big data analytics of the University of Manchester's National Centre for Text Mining. The repository will be a synergy of different types of information, e.g., taxonomic, occurrence, ecological, biomolecular, biochemical, thus providing users with a comprehensive view on species of interest that will allow them to (1) carry out predictive analysis on species distributions, and (2) investigate potential medicinal applications of natural products derived from Philippine species.

Project Framework

In order to construct such a repository, several advanced text mining technologies will be applied to biodiversity documents. Most of these documents, e.g., legacy literature in the Biodiversity Heritage Library, have undergone optical character recognition (OCR) and thus contain a significant amount of noise. We will therefore perform a rule-based approach for cleaning up the text. We will then incorporate active learning methods into Argo, a Web-based text mining workbench, to extract targeted named entities and relations from the documents. All extracted information will be combined with structured information sourced from various Philippine biodiversity research groups, and stored in a database over which a search engine will be built to facilitate knowledge discovery.

It is intended that the repository will support the Philippine government's efforts on conserving the country's natural resources, which in turn can translate to benefits for the Philippine population, in terms of ecosystem resilience and access to alternative medicines.

Our resources will be made publicly available wherever possible.

Award Nomination

COPIOUS features in the Better World Showcase 2018, and has been shortlisted as a potential recipient of the People's Vote award. The showcase celebrates the important contribution that the Faculty of Science and Engineering makes to social and environmental impact, and highlights the efforts of staff and students who are 'making a difference', and will hopefully inspire others to do the same. COPIOUS features as a project that is bringing outstanding benefit to society through research. You can find and vote for us at the virtual showcase:

Related Work

It is worth noting that the project is closely relevant to the Mining Biodiversity project, which is aimed towards the enrichment of the Biodiversity Heritage Library with automatically generated semantic metadata (e.g., terms, entities and events). Unlike Mining Biodiversity, however, COPIOUS focusses on Philippine species and attempts to directly exploit the extracted information in use cases relevant to the discovery of alternative medicines and the preservation of natural resources.

As in Mining Biodiversity, we will employ Argo which allows text mining processing pipelines to be built and evaluated with minimal effort, in developing our system.


Our work on COPIOUS has been presented in the following:

A talk entitled Developing a knowledge base on the habitats and reproductive conditions of Dipterocarps through information extraction at the Annual Conference of Biodiversity Information Standards (TDWG) 2017 in Ottawa, Canada.

A talk entitled "Argo as a platform for integrating distinct biodiversity analytics tools into workflows for building graph databases" at the Annual Conference of Biodiversity Information Standards (TDWG) 2017 in Ottawa, Canada.

Text mining tools and infrastructure for biomedical applications: cancer biology, history of medicine, monitoring biodiversity. Lecture delivered by Sophia Ananiadou, 5th April 2016, Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece.

A talk entitled Enhancing Semantic Search through the Automatic Construction of a Biodiversity Terminological Inventory at the Annual Conference of Biodiversity Information Standards (TDWG) 2016 in Costa Rica.

A talk entitled Understanding mass flowering of dipterocarps through semantic occurrence information extraction at the Annual Conference of Biodiversity Information Standards (TDWG) 2016 in Costa Rica.

Re-usable text mining workflows for advanced search. Invited talk delivered by Sophia Ananiadou at the First Workshop on Text Mining in Natural Sciences (TMINS-1): Exploring Text Mining in Marine, Climate and Environmental Science, 12-13th November 2015, Norwegian University of Science and Technology (NTNU), Trondheim, Norway.


Nhung T.H. Nguyen, Rosalyn Gabud, Sophia Ananiadou (2019). COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal, 7:e29626.

Sabado, A.J., Solano, G., Nicolas, M., Batista-Navarro, R., Gabud, R., Hilomen, V. Modelling the Coverage of Dipterocarp Trees in Central Visayas, Philippines. IEEExplore 2017 IEEE 13th International Conference on e-Science, pp. 561-565.

Gabud, R.S., Batista-Navarro, R., Mariano, V., Mendoza, R., Yap, S. Literature Mining on Dipterocarps: Towards Better Informed Regeneration and Reforestation in Luzon, Philippines. 1st International Conference on Integrated Natural Resources and Environment Management. 21-23 February, 2017, Manila, Philippines.

Maolin Li, Nhung Nguyen, Sophia Ananiadou (2017). Proactive Learning for Named Entity Recognition. In Proceedings of the BioNLP Workshop 2017, pp. 117-125.

Nguyen NTH, Soto AJ, Kontonatsios G, Batista-Navarro R, Ananiadou S (2017) Constructing a biodiversity terminological inventory. PLoS ONE 12(4): e0175277.

Batista-Navarro R., Zerva C., Nguyen N.T.H., Ananiadou S. (2017) A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository. In: Lossio-Ventura J., Alatrista-Salas H. (eds) Information Management and Big Data. SIMBig 2015, SIMBig 2016. Communications in Computer and Information Science, vol 656. Springer


Two models of Taxon and Habitat detection have been incorporated into the Argo component of NERSuite Custom Tagger, in which users can select the model they would like to apply to their text. We also complied two dictionaries that can be used to automatically ground, i.e., to assign an identifier to, a detected Taxon or Habitat. The dictionary for grounding Taxon entities was created by collecting available names from the Catalogue of Life (CoL) . Regarding Habitat entities, we constructed the dictionary by extracting all terms provided by the Environment Ontology (ENVO) . Given an ID provided by CoL or ENVO, we can link the entity back to the original ontology. The two dictionaries are available in an Argo component named Concept Normaliser. A demonstratiion workflow, named COPIOUS-Taxon and Habitat, has been made publicly available at


Terminological inventory

We have compiled a terminological inventory for biodiversity by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). More information is available at

A visual text analytics search system for biodiversity that uses the terminological inventory as metadata for query expansion is available at

COPIOUS corpus

To support further tasks on biodiversity text mining, e.g., named entity recognition and species distribution extraction, we have constructed a corpus with five categories of entities: taxon, geographical location, habitat, temporal expression, and person. The complete corpus, available in brat format can be downloaded here.

If you use the corpus, please cite the following publication:

Nhung T.H. Nguyen, Rosalyn Gabud, Sophia Ananiadou. COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal. In press.

DipteroMine corpus

This corpus is constructed to support extracting information from documents related to the Philippine Dipterocarps. Specifically, we target two main case studies (1) analysing species distribution and (2) detecting phenological patterns of Philippine Dipterocarps. In order to perform such case studies, one of the most important resources that we need is a labelled corpus of ten types of biodiversity entities, including taxon, geographic location, habitat, habitat attribute, habitat attribute value, temporal expression, person, reproductive condition, specimen location, and specimen number. A sample of the corpus in brat format can be downloaded here.

The Philippine Atlas of Biodiversity

The Philippine Atlas of Biodiversity is a biodiversity information platform implemented built upon the Atlas of Living Australia (ALA) e-infrastructure. PhAB aims to support comprehensive collaborative research, environmental monitoring, information dissemination, biosecurity activities, and long-term conservation planning. This aids in educating Filipinos both in-school and out-of-school on the state of Philippine biodiversity and on the importance of an individual's contribution to conservation efforts. PhAB provides online access to information about the Philippines mega-diverse resources enabling high quality research and innovation outcomes to address national and global challenges.

Project Team

United Kingdom

School of Computer Science, University of Manchester Principal Investigator: Prof. Sophia Ananiadou
Researchers: Dr. Nhung Nguyen, Dr. Axel Soto, Mr. Paul Thompson.


Department of Physical Sciences and Mathematics, University of the Philippines Manila Principal Investigator: Prof. Marilou Nicolas

Department of Computer Science, University of the Philippines Diliman Co-Investigator: Dr. Prospero Naval

Biodiversity Management Bureau, Department of Environment and Natural Resources Co-Investigator: Dr. Vincent Hilomen


This project is being funded by the British Council for two years from March 31, 2015.

Futher information