PhenoCHF Download
The annotations may be downloaded for research purposes (please observe the terms of the licence below).
NOTES:
- Information about annotations is provided in separate files from the text that has been annotated. The format of these annotation files is described in detail on the annotation format page.
-
The associated text files for each part of the two documents types in the corpus are obtained in different ways, as detailed below.
- Full text literature articles - These are open acess papers, and we provide the plain text files that were used as a basis for the annotation as part of the corpus download. The basename of each file name is the PMID of the associated article.
- Narrative EHR reports - These form part of the dataset of de-identified clinical records released as part of the i2b2 2008 Obesity Challenge (NLP Dataset #2). The dataset must be obtained individually from Partners Healthcare by signing a Data Use Agreement.
-
IMPORTANT NOTE: The i2b2 2008 Obesity Challenge Dataset is obtained as a single XML file, containing all clinical records. Within the XML file, each document is contained within a <doc> element, and the doc element has an id attribute, which assigns a unique id to each clinical record. Within each <doc> element, there is a <text> element, which contains the text of the clinical record.
- Annotation files are provided separately for each clinical record, in the format described on the annotation format page. The basename of the annotation files corresponds to the id of the clincal record, as specified in the id attribute of the corresponding document element in the original dataset file.
- The annotation files assume that the text for each clinical record corresponds to the text that occurs betwen the <text> and </text> tags for the record in the original dataset file.
-
IMPORTANT NOTE: The i2b2 2008 Obesity Challenge Dataset is obtained as a single XML file, containing all clinical records. Within the XML file, each document is contained within a <doc> element, and the doc element has an id attribute, which assigns a unique id to each clinical record. Within each <doc> element, there is a <text> element, which contains the text of the clinical record.
PhenoCHF corpus licence
1. Copyright of Literature Articles
The full text literature articles in the PhenoCHF corpus are drawn from the PMC Open Access Subset. These articles are protected by copyright, but are made available under a Creative Commons or similar licence that generally allows more liberal redistribution and reuse than a traditional copyrighted work. Please refer to the license of each article for specific licence terms.
2. Copyright of PhenoCHF annotations
The entity mention, relation and normalisation annotations in the PhenoCHF corpus were created at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. They are licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the corpus and cite one or more of the following papers, depending on which annotations are used:
Entity Annotations
Alnazzawi, N., Thompson, P., Batista-Navarro, R. and Ananiadou, S. (2015). Using text mining techniques to extract phenotypic information from the PhenoCHF corpus. BMC Medical Informatics and Decision Making, 15(Suppl. 2): S3Normalisation Annotations
Alnazzawi, N., Thompson, P. and Ananiadou, S. (2016). Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource. PLOS ONE, 11(9): e0162287
Relation Annotations
Alnazzawi, N., Thompson, P. and Ananiadou, S. (2014). Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature. In Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi), pp. 69-74.Featured News
- Prof. Sophia Ananiadou accepted as an ELLIS fellow
- Call for papers: CL4Health @ NAACL 2025
- Invited talk at the 15th Marbach Castle Drug-Drug Interaction Workshop
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
Other News & Events
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine