PhenoCHF is an annotated corpus consisting of documents belonging to two different text types (i.e., narrative reports from electronic health records (EHRs) and literature articles). It is manually annotated by medical doctors with detailed information relating to mentions of phenotype concepts and disease-phenotype relations.

The composition and annotations in PhenoCHF are aimed at allowing the development of robust text mining (TM) systems that can extract comprehensive phenotypic information from multiple textual sources with differing characteristics. For example, narrative EHRs typically exhibit non-standard grammatical structure and high levels of lexical and semantic variability, coupled with many domain-specific abbreviations, complex sentences and spelling errors (around 10% of words).

The documents in PhenoCHF focus on a specific medical condition, i.e., congestive heart failure (CHF). This focus is motivated by CHF's current standing as the world's most deadly disease. However, our experiments using the corpus have demonstrated that it can be used to develop systems that can recognise information relating to a wider range of diseases in a broader variety of text types than those included in PhenoCHF.


The annotations may be downloaded for research purposes from the download page (please observe the terms of the licence below).

Annotation Levels

Three levels of information are annotated in the PhenoCHF:

  • Entity mention annotation - Each word or phrase that describes a concept belonging to one a pre-defined set of concept types is annotated and assigned a category label corresponding to the concept type. We refer to the these annotations as entity mentions. All documents in PhenoCHF are annotated with entity mentions. The same types of concepts are annotated in both narrative EHRs and literature articles.
  • Normalisation annotation - Each entity mention is annotated with an associated UMLS CUI (Concept Unique Identifier). The CUI provides a link to the entry in the UMLS Metathesaurus (a large terminological resource that contains an inventory of biomedical concepts) that corresponds to the unique concept described by the entity mention. Since a given concept can be expressed in text in many different ways, these links between entity mentions and unique concepts provide the means to develop methods to automatically determine the exact concept that is referred to by each entity mention. This is especially challenging for narrative EHRs, in which concepts can be expresed in very diverse ways.
  • Relation annotation - In text, it is usual for entity mentions to be linked together to describe more complex pieces of information. In the narrative EHR reports in PhenoCHF, all instances of a number of different types of relations between entity mentions are annotated, when the text provides evidence that specific types of links may be understood to hold between the entity mentions. By developing systems that can identify relations automatically, it is possible to develop sophistated semantic search systems that allow documents to be filtered based on presence of specific types of relations within them.


The study of disease-phenotype relationships has been hampered by the scarcity of suitable large-scale, machine-readable knowledge bases. Existing resources, such as the Online Mendelian Inheritance in Man (OMIM) and the Human Phenotype Ontology (HPO) are manually constructed, making them difficult to update and maintain. They could be enriched by exploiting the vast amounts of phenotypic information available in various textual sources, including the ever-growing volumes of published biomedical literature, and narrative patient EHRs, which provide a large amount of detail about patient conditions, such as diagnoses, findings, signs and symptoms, procedures, family history, etc. There is thus an urgent need to develop TM methods that can automate the extraction and integration of vital phenotypic information hidden in narrative text, to help to derive information about disease correlations and thus support clinical decisions.

Developing TM tools for use in new domains is reliant upon textual corpora, in which pertinent information has been explicitly marked up by experts. Such annotated corpora serve both as training data for machine learning (ML) techniques and as a gold standard for systematic evaluation of new methodologies. Whilst TM techniques have been widely applied in the extraction of relationships involving genes and proteins from the biomedical literature, there has been little research into the extraction of disease-phenotype relationships, either from the literature or from narrative EHRs. This is largely due to the lack of suitably annotated EHR corpora, owing both to their sensitive data and the difficulty of applying de-identification techniques.

The PhenoCHF corpus is intended to provide the means to develop and evaluate novel methods that aim to extract phenotype-related information from text. The detail of annotation in the corpus, and its composition of heterogeneous document types, make it particularly suitable for developing robust and wide coverage TM systems that are able to extract and link together complex information about phenotypes that may be dispersed across many documents of different types.


This work has been supported by the Medical Research Council (Supporting Evidence-based Public Health Interventions using Text Mining [Grant MR/L01078X/1]) and by the Defense Advanced Research Projects Agency (Big Mechanism [Grant DARPA-BAA-14-14]).

The deidentified clinical records used in this research (i.e., the narrative EHR reports) were provided by the i2b2 National Center for Biomedical Computing funded by U54LM008748 and were originally prepared for the Shared Tasks for Challenges in NLP for Clinical Data organized by Dr. Ozlem Uzuner, i2b2 and SUNY.

PhenoCHF corpus licence

1. Copyright of Literature Articles

The full text literature articles in the PhenoCHF corpus are drawn from the PMC Open Access Subset. These articles are protected by copyright, but are made available under a Creative Commons or similar licence that generally allows more liberal redistribution and reuse than a traditional copyrighted work. Please refer to the license of each article for specific licence terms.

2. Copyright of PhenoCHF annotations

Creative Commons License
The entity mention, relation and normalisation annotations in the PhenoCHF corpus were created at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. They are licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the corpus and cite one or more of the following papers, depending on which annotations are used:

Entity Annotations

Alnazzawi, N., Thompson, P., Batista-Navarro, R. and Ananiadou, S. (2015). Using text mining techniques to extract phenotypic information from the PhenoCHF corpus. BMC Medical Informatics and Decision Making, 15(Suppl. 2): S3

Normalisation Annotations

Alnazzawi, N., Thompson, P. and Ananiadou, S. (2016). Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource. PLOS ONE, 11(9): e0162287

Relation Annotations

Alnazzawi, N., Thompson, P. and Ananiadou, S. (2014). Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature. In Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi), pp. 69-74.