NaCTeM
Copious

COnserving Philippine bIOdiversity by UnderStanding big data (COPIOUS): Integration and analysis of heterogeneous information on Philippine biodiversity

Background

The collaborative project aims to advance the means by which Philippine biodiversity information is being collected and published, through the construction of an online knowledge repository whose content will be semi-automatically curated by text mining-based analytics. Currently, information on Philippine biodiversity is largely fragmented due to the siloed formats in which different local institutions store their data. Scientific literature offers invaluable information that can fill in the knowledge gaps, but because of its overwhelming volume and lack of structure, its thorough manual examination has become impossible. Consequently, a comprehensive body of knowledge on Philippine biodiversity remains unavailable, hampering the timely formulation of environmental policies, and the discovery of new natural products that can potentially provide medicinal benefits.

Project Aims

The project aims to produce a knowledge repository of Philippine biodiversity by combining the domain-relevant expertise and resources of Philippine partners with the text mining-based big data analytics of the University of Manchester's National Centre for Text Mining. The repository will be a synergy of different types of information, e.g., taxonomic, occurrence, ecological, biomolecular, biochemical, thus providing users with a comprehensive view on species of interest that will allow them to (1) carry out predictive analysis on species distributions, and (2) investigate potential medicinal applications of natural products derived from Philippine species.

Project Framework

In order to construct such a repository, several advanced text mining technologies will be applied to biodiversity documents. Most of these documents, e.g., legacy literature in the Biodiversity Heritage Library, have undergone optical character recognition (OCR) and thus contain a significant amount of noise. We will therefore perform a rule-based approach for cleaning up the text. We will then incorporate active learning methods into Argo, a Web-based text mining workbench, to extract targeted named entities and relations from the documents. All extracted information will be combined with structured information sourced from various Philippine biodiversity research groups, and stored in a database over which a search engine will be built to facilitate knowledge discovery.

It is intended that the repository will support the Philippine government's efforts on conserving the country's natural resources, which in turn can translate to benefits for the Philippine population, in terms of ecosystem resilience and access to alternative medicines.

Our resources will be made publicly available wherever possible.

Related Work

It is worth noting that the project is closely relevant to the Mining Biodiversity project, which is aimed towards the enrichment of the Biodiversity Heritage Library with automatically generated semantic metadata (e.g., terms, entities and events). Unlike Mining Biodiversity, however, COPIOUS focusses on Philippine species and attempts to directly exploit the extracted information in use cases relevant to the discovery of alternative medicines and the preservation of natural resources.

As in Mining Biodiversity, we will employ Argo which allows text mining processing pipelines to be built and evaluated with minimal effort, in developing our system.

Presentations

Our work on COPIOUS has been presented in the following:

Text mining tools and infrastructure for biomedical applications: cancer biology, history of medicine, monitoring biodiversity. Lecture delivered by Sophia Ananiadou, 5th April 2016, Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece.

Re-usable text mining workflows for advanced search. Invited talk delivered by Sophia Ananiadou at the First Workshop on Text Mining in Natural Sciences (TMINS-1): Exploring Text Mining in Marine, Climate and Environmental Science, 12-13th November 2015, Norwegian University of Science and Technology (NTNU), Trondheim, Norway.

A talk entitled Enhancing Semantic Search through the Automatic Construction of a Biodiversity Terminological Inventory at the Annual Conference of Biodiversity Information Standards (TDWG) 2016 in Costa Rica.

Publications

Maolin Li, Nhung Nguyen, Sophia Ananiadou (In Press) Proactive Learning for Named Entity Recognition. In Proceedings of the BioNLP Workshop 2017.

Nguyen NTH, Soto AJ, Kontonatsios G, Batista-Navarro R, Ananiadou S (2017) Constructing a biodiversity terminological inventory. PLoS ONE 12(4): e0175277.

Batista-Navarro R., Zerva C., Nguyen N.T.H., Ananiadou S. (2017) A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository. In: Lossio-Ventura J., Alatrista-Salas H. (eds) Information Management and Big Data. SIMBig 2015, SIMBig 2016. Communications in Computer and Information Science, vol 656. Springer

Tools

Two models of Taxon and Habitat detection have been incorporated into the Argo component of NERSuite Custom Tagger, in which users can select the model they would like to apply to their text. We also complied two dictionaries that can be used to automatically ground, i.e., to assign an identifier to, a detected Taxon or Habitat. The dictionary for grounding Taxon entities was created by collecting available names from the Catalogue of Life (CoL) . Regarding Habitat entities, we constructed the dictionary by extracting all terms provided by the Environment Ontology (ENVO) . Given an ID provided by CoL or ENVO, we can link the entity back to the original ontology. The two dictionaries are available in an Argo component named Concept Normaliser. A demonstratiion workflow, named COPIOUS-Taxon and Habitat, has been made publicly available at http://argo.nactem.ac.uk/test/.

Resources

We have compiled a terminological inventory for biodiversity by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). More information is available at http://www.nactem.ac.uk/bhl_inventory/.

A search interface that uses the inventory as metadata for query expansion is available at http://nactem.ac.uk/BHLQueryExpansion/.

Project Team

United Kingdom

School of Computer Science, University of Manchester Principal Investigator: Prof. Sophia Ananiadou
Co-Investigator: Dr. Riza Batista-Navarro
Researchers: Dr. Nhung Nguyen, Dr. Axel Soto, Mr. Paul Thompson.

Philippines

Department of Physical Sciences and Mathematics, University of the Philippines Manila Principal Investigator: Prof. Marilou Nicolas

Department of Computer Science, University of the Philippines Diliman Co-Investigator: Dr. Prospero Naval

Biodiversity Management Bureau, Department of Environment and Natural Resources Co-Investigator: Dr. Vincent Hilomen

Funding

This project is being funded by the British Council for two years from March 31, 2015.