COnserving Philippine bIOdiversity by UnderStanding big data (COPIOUS): Integration and analysis of heterogeneous information on Philippine biodiversity
Background
The collaborative project aims to advance the means by which Philippine biodiversity information is being collected and published, through the construction of an online knowledge repository whose content will be semi-automatically curated by text mining-based analytics. Currently, information on Philippine biodiversity is largely fragmented due to the siloed formats in which different local institutions store their data. Scientific literature offers invaluable information that can fill in the knowledge gaps, but because of its overwhelming volume and lack of structure, its thorough manual examination has become impossible. Consequently, a comprehensive body of knowledge on Philippine biodiversity remains unavailable, hampering the timely formulation of environmental policies, and the discovery of new natural products that can potentially provide medicinal benefits.
Project Aims
The project aims to produce a knowledge repository of Philippine biodiversity by combining the domain-relevant expertise and resources of Philippine partners with the text mining-based big data analytics of the University of Manchester's National Centre for Text Mining. The repository will be a synergy of different types of information, e.g., taxonomic, occurrence, ecological, biomolecular, biochemical, thus providing users with a comprehensive view on species of interest that will allow them to (1) carry out predictive analysis on species distributions, and (2) investigate potential medicinal applications of natural products derived from Philippine species.
Project Framework
In order to construct such a repository, several advanced text mining technologies will be applied to biodiversity documents. Most of these documents, e.g., legacy literature in the Biodiversity Heritage Library, have undergone optical character recognition (OCR) and thus contain a significant amount of noise. We will therefore perform a rule-based approach for cleaning up the text. We will then incorporate active learning methods into Argo, a Web-based text mining workbench, to extract targeted named entities and relations from the documents. All extracted information will be combined with structured information sourced from various Philippine biodiversity research groups, and stored in a database over which a search engine will be built to facilitate knowledge discovery.
It is intended that the repository will support the Philippine government's efforts on conserving the country's natural resources, which in turn can translate to benefits for the Philippine population, in terms of ecosystem resilience and access to alternative medicines.
Our resources will be made publicly available wherever possible.
Award Nomination
COPIOUS features in the Better World Showcase 2018, and has been shortlisted as a potential recipient of the People's Vote award. The showcase celebrates the important contribution that the Faculty of Science and Engineering makes to social and environmental impact, and highlights the efforts of staff and students who are 'making a difference', and will hopefully inspire others to do the same. COPIOUS features as a project that is bringing outstanding benefit to society through research. You can find and vote for us at the virtual showcase: http://www.manchester.ac.uk/betterworldshowcase.Related Work
It is worth noting that the project is closely relevant to the Mining Biodiversity project, which is aimed towards the enrichment of the Biodiversity Heritage Library with automatically generated semantic metadata (e.g., terms, entities and events). Unlike Mining Biodiversity, however, COPIOUS focusses on Philippine species and attempts to directly exploit the extracted information in use cases relevant to the discovery of alternative medicines and the preservation of natural resources.
As in Mining Biodiversity, we will employ Argo which allows text mining processing pipelines to be built and evaluated with minimal effort, in developing our system.
Presentations
Our work on COPIOUS has been presented in the following:
A talk entitled Developing a knowledge base on the habitats and reproductive conditions of Dipterocarps through information extraction at the Annual Conference of Biodiversity Information Standards (TDWG) 2017 in Ottawa, Canada.
A talk entitled "Argo as a platform for integrating distinct biodiversity analytics tools into workflows for building graph databases" at the Annual Conference of Biodiversity Information Standards (TDWG) 2017 in Ottawa, Canada.
Text mining tools and infrastructure for biomedical applications: cancer biology, history of medicine, monitoring biodiversity. Lecture delivered by Sophia Ananiadou, 5th April 2016, Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece.
A talk entitled Enhancing Semantic Search through the Automatic Construction of a Biodiversity Terminological Inventory at the Annual Conference of Biodiversity Information Standards (TDWG) 2016 in Costa Rica.
A talk entitled Understanding mass flowering of dipterocarps through semantic occurrence information extraction at the Annual Conference of Biodiversity Information Standards (TDWG) 2016 in Costa Rica.
Re-usable text mining workflows for advanced search. Invited talk delivered by Sophia Ananiadou at the First Workshop on Text Mining in Natural Sciences (TMINS-1): Exploring Text Mining in Marine, Climate and Environmental Science, 12-13th November 2015, Norwegian University of Science and Technology (NTNU), Trondheim, Norway.
Publications
Nhung T.H. Nguyen, Rosalyn Gabud, Sophia Ananiadou (2019). COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal, 7:e29626.
Sabado, A.J., Solano, G., Nicolas, M., Batista-Navarro, R., Gabud, R., Hilomen, V. Modelling the Coverage of Dipterocarp Trees in Central Visayas, Philippines. IEEExplore 2017 IEEE 13th International Conference on e-Science, pp. 561-565.
Gabud, R.S., Batista-Navarro, R., Mariano, V., Mendoza, R., Yap, S. Literature Mining on Dipterocarps: Towards Better Informed Regeneration and Reforestation in Luzon, Philippines. 1st International Conference on Integrated Natural Resources and Environment Management. 21-23 February, 2017, Manila, Philippines.
Maolin Li, Nhung Nguyen, Sophia Ananiadou (2017). Proactive Learning for Named Entity Recognition. In Proceedings of the BioNLP Workshop 2017, pp. 117-125.
Nguyen NTH, Soto AJ, Kontonatsios G, Batista-Navarro R, Ananiadou S (2017) Constructing a biodiversity terminological inventory. PLoS ONE 12(4): e0175277.
Batista-Navarro R., Zerva C., Nguyen N.T.H., Ananiadou S. (2017) A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository. In: Lossio-Ventura J., Alatrista-Salas H. (eds) Information Management and Big Data. SIMBig 2015, SIMBig 2016. Communications in Computer and Information Science, vol 656. Springer
Tools
Two models of Taxon and Habitat detection have been incorporated into the Argo component of NERSuite Custom Tagger, in which users can select the model they would like to apply to their text. We also complied two dictionaries that can be used to automatically ground, i.e., to assign an identifier to, a detected Taxon or Habitat. The dictionary for grounding Taxon entities was created by collecting available names from the Catalogue of Life (CoL) . Regarding Habitat entities, we constructed the dictionary by extracting all terms provided by the Environment Ontology (ENVO) . Given an ID provided by CoL or ENVO, we can link the entity back to the original ontology. The two dictionaries are available in an Argo component named Concept Normaliser. A demonstratiion workflow, named COPIOUS-Taxon and Habitat, has been made publicly available at http://argo.nactem.ac.uk/test/.
Resources
Terminological inventory
We have compiled a terminological inventory for biodiversity by combining all of the names available in Catalogue of Life (CoL), Encyclopedia of Life (EoL) and Global Biodiversity Information Facility (GBIF). More information is available at http://www.nactem.ac.uk/bhl_inventory/.
A visual text analytics search system for biodiversity that uses the terminological inventory as metadata for query expansion is available at http://www.nactem.ac.uk/BHLVisualSearch/.
COPIOUS corpus
To support further tasks on biodiversity text mining, e.g., named entity recognition and species distribution extraction, we have constructed a corpus with five categories of entities: taxon, geographical location, habitat, temporal expression, and person. The complete corpus, available in brat format can be downloaded here.
If you use the corpus, please cite the following publication:
Nhung T.H. Nguyen, Rosalyn Gabud, Sophia Ananiadou. COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal. In press.
DipteroMine corpus
This corpus is constructed to support extracting information from documents related to the Philippine Dipterocarps. Specifically, we target two main case studies (1) analysing species distribution and (2) detecting phenological patterns of Philippine Dipterocarps. In order to perform such case studies, one of the most important resources that we need is a labelled corpus of ten types of biodiversity entities, including taxon, geographic location, habitat, habitat attribute, habitat attribute value, temporal expression, person, reproductive condition, specimen location, and specimen number. A sample of the corpus in brat format can be downloaded here.
The Philippine Atlas of Biodiversity
The Philippine Atlas of Biodiversity is a biodiversity information platform implemented built upon the Atlas of Living Australia (ALA) e-infrastructure. PhAB aims to support comprehensive collaborative research, environmental monitoring, information dissemination, biosecurity activities, and long-term conservation planning. This aids in educating Filipinos both in-school and out-of-school on the state of Philippine biodiversity and on the importance of an individual's contribution to conservation efforts. PhAB provides online access to information about the Philippines mega-diverse resources enabling high quality research and innovation outcomes to address national and global challenges.
Project Team
United Kingdom
School of Computer Science, University of Manchester
Principal Investigator: Prof. Sophia Ananiadou
Researchers: Dr. Nhung Nguyen, Dr. Axel Soto, Mr. Paul Thompson.
Philippines
Department of Physical Sciences and Mathematics, University of the Philippines Manila Principal Investigator: Prof. Marilou Nicolas
Department of Computer Science, University of the Philippines Diliman Co-Investigator: Dr. Prospero Naval
Biodiversity Management Bureau, Department of Environment and Natural Resources Co-Investigator: Dr. Vincent Hilomen
Funding
This project is being funded by the British Council for two years from March 31, 2015.Futher information
- Contact us: Prof. Sophia Ananiadou.
Featured News
- Call for papers: CL4Health @ NAACL 2025
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
Other News & Events
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine