COPD Corpus Annotation Format
The downloadable corpus consists of:
- The configuration files need to display the annotations in brat: annotation.conf, visual.conf and tools.conf (see here for more details)
- Three directories ("train", "dev" and "test") which contain the annotated data, split into the training, development and test sets that were used in the experiments described in the associated article. Each of the directories contains the following two types of files:
- A set of text files (.txt), each corresponding to a paragraph in a full-text article.
- A set of annotation files (.ann), containing the manually-added annotations associated with each paragraph file
The text file and associated annotation fileshave the same base name, which denotes the article's PMCID.
The naming covention is as follows:
[PMCID]_[paragraphNumber]
The paragraphs are numbered consecutively, starting from 0. So, for example, the file PMC2528206_0.txt contains the text of the first paragraph of the full text in the article with the PMCID PMC2528206, while the file PMC2740954_25.ann contains the annotations associated with the 26th paragraph of the full text in the article , etc.
Annotation file format
Annotations in the ".ann" files are encoded in the format used by the brat annotation software
Within each ".ann" file, each line corresponds to one of the following:
- An NE mention
- A link to a concept identifier (CUI) in the UMLS Metathesaurus.
A sample of lines encoding entity annotations and their links to concept identifiers is shown below:
T1 AnatomicalConcept 33 42 pulmonary N8000 Reference T1 UMLSCUI:C0024109 pulmonary T2 Drug 191 206 corticosteroids N11000 Reference T2 UMLSCUI:C0001617 corticosteroids T3 MedicalCondition 92 96 COPD N1 Reference T3 UMLSCUI:C0024117 COPD T4 SignOrSymptom 33 55 pulmonary inflammation N3000 Reference T4 UMLSCUI:C0032285 pulmonary inflammation N3001 Reference T4 UMLSCUI:C3714636 pulmonary inflammation T5 Treatment 183 206 inhaled corticosteroids N6000 Reference T5 UMLSCUI:C0001617 inhaled corticosteroids
There are two types of lines, beginning either with "T" or with "N"
Lines beginning with "T" (NE annotations) consist of the following information:
- A unique id for the annotation. By convention, this starts with T, followed by a numerical value.
- A TAB character.
- The NE type assigned to the annotation.
- The character-based offsets of the annotated span in the corresponding text file. There are two offsets, corresponding to the start and end offsets of the span. The first offset is separated by a space from the entity type label, and there is a space between the start and end offsets.
- Another TAB character
- The text covered by the annotated span in the corresponding text file.
Lines beginning with "N" provide information about normalisations, i.e. links to CUIs in the UMLS Metathesurus. They consist of the following information:
- A unique id for the annotation. By convention, this starts with N, followed by a numerical value.
- A TAB character.
- The word Reference
- The id of the NE annotation to which the normalisation applies
- Information about the concept to which the NE has been normalised. This consists of:
- The string "UMLSCUI"
- A colon
- The unique concept identifier assigned to the NE within the specified resource
- Another TAB character.
- The text covered by the NE to which the concept ID has been assigned.
Note that, in the example above, the NE with the ID T4, i.e., pulmonary inflammation, has been normalised to two separate concepts in the UMLS Metathesaurus, as denoted by the normalisation lines with the IDs N3000 and N3001.