The GREC Corpus
Download
Updates
- 01/08/2011: A file in the Human part of the corpus (ID 8205615) was found to contain one character offset problem for a named entity. This problem has now been resolved, and the appropriate standoff annotation file (8205615.a1) has been replaced. The U-Compare corpus reader has also been updated accordingly.
- 2/12/2010: A corpus reader component for the GREC corpus is now available for use in the U-Compare text mining/natural language processing system. The component should be saved and imported into U-Compare, following the steps outlined here.
- 12/11/2010: Three files in the E. coli part of the corpus (IDs 9852003, 14996803 and 15995204) were found to contain minor errors. 1499603 contained an event that was not centred on a verb or nominalised verb, so the event was deleted. The abstracts 9852003 and 15995204 each contained one event in which a type had been assigned to the event itself which should have been assigned to one of the event arguments. This has now been corrected in both files.
- 10/12/2009: Two files in the Human part of the corpus (IDs 8205615 and 9778250) were found to have character offset problems due to foreign characters. Offsets in these files were previously based on bytes rather than characters. The corpus has now been updated so that all offsets are based on characters.
The corpus in available for download in 2 formats:
- A standoff format, based on the BioNLP'09 Shared Task format
- An XML format, based on the GENIA event annotation format
Background
Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. Effective IE systems require training data or annotated corpora, in which instances of biomedical events are explicitly identified in texts. The trained IE systems can then recognise instances of new events in texts, facilitating a number of text mining applications, such as pathway maintenance and semantic searching.
The Corpus
The GREC corpus is a semantically annotated corpus of 240 MEDLINE abstracts (167 on the subject of E. coli species and 73 on the subject of the Human species) which is intended for training IE systems and/or resources which are used to extract events from biomedical literature.
The corpus has been manually annotated with events relating to gene regulation by biologists. Each event is centred on either a verb (e.g. transcribe) or nominalised verb (e.g. transcription) and annotation consists of identifying, as exhaustively as possible, the structually-related arguments of the verb or nominalised verb within the same sentence. Each event argument is then assigned the following information:
- A semantic role from a fixed set of 13 roles which are tailored to the biomedical domain.
- A biomedical concept type (where appropriate).
As a simple example, consider the following sentence:
The narL gene product activates the nitrate reductase operon
The sentence contains a single event, centred on the verb activates, with 2 arguments, i.e.:
- The narL gene product
- the nitrate reductase operon
Other types of argument include:
- LOCATION, e.g. In Escherichia Coli, glnAP2 may be activated by NifA
- MANNER, e.g. cpxA gene increases the levels of csgA transcription by dephosphorylation of CpxR
- CONDITION, e.g. Strains carrying a mutation in the crp structural gene fail to repress ODC and ADC activities in response to increased cAMP
Full details of the annotation scheme can be found in the annotation guidelines.
GREC Licence
1. Copyright of abstracts
The abstracts contained in the GREC corpus are from PubMed(R), a database of the U.S. National Library of Medicine (NLM).
NLM data are produced by a U.S. Government agency and include works of the United States Government that are not protected by U.S. copyright law but may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. copyright law.
NLM assumes no responsibility or liability associated with use of copyrighted material, including transmitting, reproducing, redistributing, or making commercial use of the data. NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. Persons contemplating any type of transmission or reproduction of copyrighted material such as abstracts are advised to consult legal counsel.
2. Copyright of annotations
The annotations within the abstracts of the GREC corpus are the result of work carried out at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. The annotations are copyrighted and licenced by NaCTeM under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Please attribute the corpus by citing the following paper:
Thompson, P., Iqbal, S. A., McNaught, J. and Ananiadou, S. (2009). Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics 10:349
Contact
For any queries relating to the corpus, please contact:paul.thompson at manchester.ac.uk
Featured News
- Prof. Sophia Ananiadou accepted as an ELLIS fellow
- Call for papers: CL4Health @ NAACL 2025
- Invited talk at the 15th Marbach Castle Drug-Drug Interaction Workshop
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
Other News & Events
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine