BioCause Corpus
Background
Biomedical corpora annotated with event-level information represent an important resource for domain-specific information extraction (IE) systems. However, bio-event annotation alone cannot cater for all the needs of biologists. Unlike work on relation and event extraction, most of which focusses on specific events and named entities, we aim to build a comprehensive resource, covering all statements of causal association present in discourse. Causality lies at the heart of biomedical knowledge, such as diagnosis, pathology or systems biology, and, thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. A biomedical text corpus annotated with such relations is, hence, crucial for developing and evaluating biomedical text mining.
BioCause annotation scheme
We have defined an annotation scheme that aims to enrich events in the ID event corpus (as well as other corpora annotated with biomedical events) with causality relations. This schema has subsequently been used to annotate 851 causal relations to form BioCause, a collection of open-access full-text biomedical journal articles belonging to the subdomain of infectious diseases. These documents have been pre-annotated with named entity and event information in the context of a previous shared task, BioNLP 2011 ST ID.
The BioNLP 2011 ST ID corpus consists of 19 full-text documents that have been manually annotated with biomedical entities and events. The annotations provide classified, structured representations of relationships between biomedical terms, and as such, the corpus consitututes a valuable resource for the training of IE systems.
The original version of the ID corpus concentrated on the following:
- Identification of the event trigger (the word or phrase around which the event is organised)
- Assignment of the event type
- Identification of the event participants, usually:
- THEME: the entity or event affected by the current event
- CAUSE: the entity or event that causes the current event to occur
On top of these events, we have added causality annotations. The annotation structure is similar to that of an event:
- Identification of the causal trigger (the word or phrase around which the relation is organised). This can be an empty trigger too, case in which a zero-length span is placed in between the arguments.
- Identification of the relation arguments, usually:
- CAUSE: the span of text describing the situation that causes the current event to occur
- EFECT: the span of text describing the situation occurring because of the current event
The complete version of the ID corpus enriched with causality annotation is available for download. NOTE: Please observe the terms of the BioCause corpus licence when downloading the corpus.
BioCause corpus licence
1. Copyright of abstracts
Any abstracts contained in this corpus are from PubMed(R), a database of the U.S. National Library of Medicine (NLM).
NLM data are produced by a U.S. Government agency and include works of the United States Government that are not protected by U.S. copyright law but may be protected by non-US copyright law, as well as abstracts originating from publications that may be protected by U.S. copyright law.
NLM assumes no responsibility or liability associated with use of copyrighted material, including transmitting, reproducing, redistributing, or making commercial use of the data. NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. Persons contemplating any type of transmission or reproduction of copyrighted material such as abstracts are advised to consult legal counsel.
2. Copyright of full texts
Any full texts contained in this corpus are from the PMC Open Access Subset of PubMed Central (PMC), the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature.
Articles in the PMC Open Access Subset are protected by copyright, but are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. Please refer to the license of each article for specific license terms.
3. Copyright of Named Entity and Event Annotations
See the GENIA Project License for Annotated Corpora4. Copyright of BioCause Annotations
Please attribute the corpus by citing the following paper:
Mihăilă, C., Ohta, T., Pyysalo, S. and Ananiadou, S. (2013) BioCause: Annotating and analysing causality in the biomedical domain. In BMC Bioinformatics, 14(1):2.
Featured News
- Invited talk at the 15th Marbach Castle Drug-Drug Interaction Workshop
- Call for papers: CL4Health @ NAACL 2025
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
Other News & Events
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine