BioCause Corpus


Biomedical corpora annotated with event-level information represent an important resource for domain-specific information extraction (IE) systems. However, bio-event annotation alone cannot cater for all the needs of biologists. Unlike work on relation and event extraction, most of which focusses on specific events and named entities, we aim to build a comprehensive resource, covering all statements of causal association present in discourse. Causality lies at the heart of biomedical knowledge, such as diagnosis, pathology or systems biology, and, thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. A biomedical text corpus annotated with such relations is, hence, crucial for developing and evaluating biomedical text mining.

BioCause annotation scheme

We have defined an annotation scheme that aims to enrich events in the ID event corpus (as well as other corpora annotated with biomedical events) with causality relations. This schema has subsequently been used to annotate 851 causal relations to form BioCause, a collection of open-access full-text biomedical journal articles belonging to the subdomain of infectious diseases. These documents have been pre-annotated with named entity and event information in the context of a previous shared task, BioNLP 2011 ST ID.

The BioNLP 2011 ST ID corpus consists of 19 full-text documents that have been manually annotated with biomedical entities and events. The annotations provide classified, structured representations of relationships between biomedical terms, and as such, the corpus consitututes a valuable resource for the training of IE systems.

The original version of the ID corpus concentrated on the following:

  • Identification of the event trigger (the word or phrase around which the event is organised)
  • Assignment of the event type
  • Identification of the event participants, usually:
    • THEME: the entity or event affected by the current event
    • CAUSE: the entity or event that causes the current event to occur

On top of these events, we have added causality annotations. The annotation structure is similar to that of an event:

  • Identification of the causal trigger (the word or phrase around which the relation is organised). This can be an empty trigger too, case in which a zero-length span is placed in between the arguments.
  • Identification of the relation arguments, usually:
    • CAUSE: the span of text describing the situation that causes the current event to occur
    • EFECT: the span of text describing the situation occurring because of the current event

The complete version of the ID corpus enriched with causality annotation is available for download. NOTE: Please observe the terms of the BioCause corpus licence when downloading the corpus.

BioCause corpus licence

1. Copyright of abstracts

Any abstracts contained in this corpus are from PubMed(R), a database of the U.S. National Library of Medicine (NLM).

NLM data are produced by a U.S. Government agency and include works of the United States Government that are not protected by U.S. copyright law but may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. copyright law.

NLM assumes no responsibility or liability associated with use of copyrighted material, including transmitting, reproducing, redistributing, or making commercial use of the data. NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. Persons contemplating any type of transmission or reproduction of copyrighted material such as abstracts are advised to consult legal counsel.

2. Copyright of full texts

Any full texts contained in this corpus are from the PMC Open Access Subset of PubMed Central (PMC), the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature.

Articles in the PMC Open Access Subset are protected by copyright, but are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. Please refer to the license of each article for specific license terms.

3. Copyright of Named Entity and Event Annotations

See the GENIA Project License for Annotated Corpora

4. Copyright of BioCause Annotations

Creative Commons License

The causality annotations within BioCause are the result of work carried out at the National Centre for Text Mining (NaCTeM), School of Computer Science, the University of Manchester, UK. These annotations are copyrighted and licenced by NaCTeM under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Please attribute the corpus by citing the following paper:

Mihăilă, C., Ohta, T., Pyysalo, S. and Ananiadou, S. (2013) BioCause: Annotating and analysing causality in the biomedical domain. In BMC Bioinformatics, 14(1):2.