PhenoCHF Annotation Format
The downloadable corpus consists of:
- A set of annotation files, containing the manually-added annotations associated with each document file.
- A set of text files corresponding to the literature articles only.
NOTE: The text files for the narrative EHR reports form part of the corpus de-identified clinical records released as part of the i2b2 2008 Obesity Challenge (NLP Dataset #2). The dataset must be obtained individually from Partners Healthcare by signing a Data Use Agreement.
Literature Articles
For literature articles, the text file and associated annotation files have the same base name, i.e., the PMIDs of the articles.
Narrative EHR reports
The i2b2 2008 Obesity Challenge Dataset is obtained as a single XML file, containing all clinical records. Within the XML file, each document is contained within a <doc> element, and the doc element has an id attribute, which assigns a unique id to each clinical record. Within each doc element, there is a <text> element, which contains the text of the clinical record.
- Annotation files are provided separately for each clinical record. The basename of the annotation files corresponds to the id of the clincal record, as specified in the id attribute of the corresponding document element in the original dataset file.
- The annotation files assume that the text for each clinical record corresponds to the text that occurs between the <text> and </text> tags for the record in the original dataset file.
Annotation file formats
Annotations are encoded in the BioNLP Shared Task 2013 format, with some custom additions to allow normalisation annotations to be encoded. Based on this format, there are two annotation files associated with each text file:
- a1 files - encode information about entity annotations, polarity cues and normalisation annotations
- a2 files - encode information about relation annotations.
a1 Files
In a1 files, each line corresponds to an annotation. There are two formats of lines, depending on whether they encode an entity mention annotation or a normalisation. The format of each type of line is described below:
Entity mention annotations
Entity mention annotations encode the text spans corresponding to phenotype concept mentions (or polarity cues for Negate relations, see below), and assign a semantic label, according to the type of concept being mentioned.
A sample of lines encoding entity annotations is shown below:
T1 Cause 128 151 coronary artery disease T5 NontradRF 285 291 anemia T6 SignOrSymptom 393 412 shortness of breath T2 RiskFactor 211 233 deep venous thrombosis T8 SignOrSymptom 451 469 bilateral crackles T9 Organ 440 445 Lungs T10 RiskFactor 6272 6281;6282 6290
Each line that encodes an entity mention consists of the following information:
- A unique id for the entity. By convention, this starts with T, followed by a numerical value.
- A TAB character.
- The concept type label assigned to the annotation (or PolCue for words or phrases that denote negation, i.e., polarity cues). The labels corresponding to each concept type are shown in Table 1.
- The character-based offsets of the entity annotation in the corresponding text file. There are two formats for the offsets, depending on whether the annotated span consists of a single, continuous span or a discontinuous span, consisting of multiple, connected spans. A discontinuous span may occur, for example, when an entity mention is broken over two lines.
- For continuous spans (as in the first 6 lines in the sample above), there are two offsets, corresponding to the start and end offsets of the span. The first offset is separated by a space from the entity type label, and there is a space between the start and end offsets.
- For discontinuous spans (as in the final line of the sample above), there are two or more pairs of start and end offsets, each separated by a semi-colon. Each pair of offsets corresponds to a part of the complete annotated span.
- Another TAB character
- The text covered by the annotated span in the corresponding text file.
Table 1 provides the labels used for each concept type.
Concept type | Label used in annotation file |
---|---|
Cause | Cause |
Risk Factor | RiskFactor |
Sign & Symptom | SignOrSymptom |
Non-traditional risk factor | NonTradRF |
Organ | Organ |
Polarity Cue | PolCue |
Chief Complaint | ChiefComplaint |
Normalisation annotations
The normalisation annotations provide a mapping between each entity mention annotation and the identifier for a concept in the UMLS Metathesaurus (i.e., a UMLS CUI).
A sample of lines encoding normalisation annotations is shown below:
#1 UMLS_CUI T1 C1956346 #2 UMLS_CUI T5 C0002871 #3 UMLS_CUI T6 C0013404 #4 UMLS_CUI T2 C0149871 #5 UMLS_CUI T8 C2071429The format of these lines is as follows:
- A unique numeric identifier for the normalisation annotation. This is preceded by a hash character (#)
- A TAB character.
- The string "UMLS_CUI"
- The identifier of the entity mention annotation to which the UMLS CUI has been assigned.
- A TAB character.
- The UMLS CUI that represents the concept described by the entity mention.
a2 Files
In a2 files, each line corresponds to a relation annotation.
Relation annotations have the following format:
R12 Causality Arg1:T18 Arg2:T17 R25 Finding Arg1:T64 Arg2:T66 R13 Negate Arg1:T41 Arg2:T37
Each line consists of:
- A unique id for the relation annotation. By convention, this starts with R, followed by a numerical value.
- A TAB character.
- The Relation type label assigned to the annotation. This is either Casuality, Finding or Negate.
- Details of the two text spans that are linked in the relation.
- In the case of Causality and Finding relations, both text spans correspond to entity mentions.
- In the case of Negate relations, the first of the text spans is a polarity cue for negation, while the second is an entity mention.
- Each text span that is linked in a relation annotation is referred to as an argument. The first argument is denoted by the label Arg1 and the second argument is denoted by the label Arg2. In each case, the argument label is followed by a colon, and then by the ID of the corresponding text span (which corresponds to one of the T annotations introduced above).
Featured News
- Call for papers: CL4Health @ NAACL 2025
- Prof. Sophia Ananiadou accepted as an ELLIS fellow
- Invited talk at the 15th Marbach Castle Drug-Drug Interaction Workshop
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
Other News & Events
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine