NaCTeM Metabolite and Enzyme Corpus
NOTE: Please observe the terms of the NaCTeM Metabolite and Enzyme Corpus licence when downloading the corpus.
UPDATE 09/01/12: A number of inconsistencies in the corpus have been corrected, and EC numbers have been tagged as enzymes.
Background
Text mining methods have added considerably to our capacity to extract biological knowledge from the literature. Recently, the field of systems biology has begun to model and simulate metabolic networks, requiring knowledge of the set of molecules involved. While genomics and proteomics technologies are able to supply the macromolecular parts list, the metabolites are less easily assembled. Most metabolites are known and reported through the scientific literature, rather than through large-scale experimental surveys. Thus, it is important to recover them from the literature.
In order to provide a means for text mining systems to be trained to recognise metabolites automatically, a corpus has been created in which metabolite names, as well as enzyme names, have been manually annotated by two domain experts. The documents correspond to 296 MEDLINE abstracts from 2007, which were originally included in the version 1 of the yeast metabolic network reconstruction (Herrgård et al., 2008). Annotations of metabolites and enzymes were restricted to only those names that appear in the context of metabolic pathways. For example, in the sentence "glucose is an economically important chemical in the food industry", the role of glucose is not as a metabolite.
The gold-standard (consensus) corpus was created by integrating the manual annotations of the two annotators. Both annotators discussed and checked the gold-standard data. Annotator A is senior to annotator B in terms of annotation experience and years in working in biochemistry, and therefore made the final decision. The two sets of manual annotations were compared to the gold-standard data. The F-scores are 88.49 for Annotator A and 78.35 for Annotator B.
Copus format
The corpus is provided in XML format. The original XML markup provided on the MEDLINE abstracts is retained, and METABOLITE and ENZYME elements have been added. METABOLITE annnotations may be embedded inside ENZYME annotations.
The corpus download distribution contains 2 directories:
- customize - containing DTDs and a CSS file, allowing the metabolita and enzyme annotations to be viewed visually in a web browser.
- xmls - The XML files containing the annotations. There is one file per abstract
References
Herrgård, M. J., Swainston, N., Dobson, P., Dunn, W. B., Arga, K. Y., Arvas, M., Büthgen, N., Borger, S., Costenoble, R., Heinemann, M., Hucka, M., Novère, N. L., Li, P., Liebermeister, W., Mo, M. L., Oliveira, A. P., Petranovic, D., Pettifer, S., Simeonidis, E., Smallbone, K., Spasíc, I.,Weichart, D., Brent, R., Broomhead, D. S., Westerhoff, H. V., Kürdar, B., Penttilä, M., Klipp, E., Palsson, B. Ø., Sauer, U., Oliver, S. G., Mendes, P., Nielsen, J. & Kell, D. B. (2008). A consensus yeast metabolic reconstruction obtained from a community approach to systems biology. Nature Biotechnology 26, 1155–1160.
Nobata, C., Dobson, P., Iqbal, S. A., Mendes, P., Tsujii, J., Kell, D. B. and Ananiadou, S. (2011). Mining Metabolites: Extracting the Yeast Metabolome from the Literature. Metabolomics, 7(1), 94-101.
NaCTeM Metabolite and Enzyme Corpus Licence
1. Copyright of abstracts
Any abstracts contained in this corpus are from PubMed(R), a database of the U.S. National Library of Medicine (NLM).
NLM data are produced by a U.S. Government agency and include works of the United States Government that are not protected by U.S. copyright law but may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. copyright law.
NLM assumes no responsibility or liability associated with use of copyrighted material, including transmitting, reproducing, redistributing, or making commercial use of the data. NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. Persons contemplating any type of transmission or reproduction of copyrighted material such as abstracts are advised to consult legal counsel.
3. Copyright of Metabolite and Enzyme Annotations
The metabolite and enzyme annotations in the NacTeM Metabolite and Enzyme Corpus are licensed by NaCTeM under a Creative Commons Attribution 3.0 Unported License.
Please attribute the corpus by citing the following paper:
Nobata, C., Dobson, P., Iqbal, S. A., Mendes, P., Tsujii, J., Kell, D. B. and Ananiadou, S. (2011). Mining Metabolites: Extracting the Yeast Metabolome from the Literature. Metabolomics, 7(1), 94-101.
Contact
For any queries relating to the corpus, please contact: sophia.ananiadou at manchester.ac.ukFeatured News
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Congratulations to PhD student Panagiotis Georgiades
Other News & Events
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine
- Advances in Data Science and Artificial Intelligence Conference 2024
- New review article on emotion detection for misinformation