Named Entity Annotation
Annotation scheme
The PHAEDRA corpus is annotated with three named entity (NE) types, as a basis for linking the effects of drugs with information about the medical subject in which they occur. The scope of each NE type is outlined in Table 1.
Table 1. Types of NEs annotated
NE Type | Description | Examples |
Pharmacological_substance |
Pharmacological substance that may or may not be approved for human use. |
Genes/gene products used as therapeutic agents: echistatin |
Generic drug names: didanosine |
IUPAC and IUPAC-like chemical names of drugs: 5-hydroxy-L-tryptophan |
Endogenous substances administered as exogenous drugs: insulin |
Toxins: 1-methyl-4-phenyl-1,2,3,4-tetrahydropyridine |
Excipients: isopropyl myristate |
Generic or chemical names of metabolites: threohydrobupropion |
Drug brand names: DIAMOX |
Names of groups of drugs: fluoroquinolones |
Expressions characterising general classes of drugs: dopamine D1 receptor antagonist |
|
Disorder |
Observation about a medical subject's body or mind that is considered to be abnormal or caused by a disease, pharmacological substance or DDI. |
Medical conditions: pulmonary embolism |
Abnormality in physiological function: hyperlocomotion |
Pathological process: fibrosis |
Neoplastic process: intestinal adenocarcinomas |
Damage caused by disease or drugs: cerebellar damage |
Mental or behavioural issue: drug abuse |
Injury or poisoning: clinical toxicities |
Viruses/bacteria: Micrococcus luteus |
Sign or symptom: nausea |
Abnormality in clinical attributes or measurements: increased urine sodium |
|
Subject |
An organism, cell line, bacterium or group thereof, whose characteristics are under discussion. The organism may be human or otherwise. |
General references to groups of subjects: children |
Names of specific species under discussion: mice |
Names of bacteria under discussion: Klebsiella oxytoca |
Expressions that specify a number of subjects: 16 patients |
Descriptions of subject characteristics: 50-year old male patient |
|
NE mention statistics
All mentions of the concepts of the types shown in Table 1 were annotated in all abstracts in the corpus. The total number of annotated instances and total number of unique annotated spans are shown in Table 2.
Table 2. Statistics of NE Mentions
NE Type | Total number of annotated mentions | Number of unique annotated spans |
Pharmacological_substance | 8099 | 1853 |
Disorder | 4075 | 1998 |
Subject | 1552 | 712 |
Agreement
The entity annotations were undertaken by annotators with domain expertise. The quality and consistency of the annotations were verified through the calculation of inter-annotator agreement (IAA) on one quarter of the complete corpus (i.e., 150 abstracts). We calculated IAA in terms of F-Score, for both exact span matches, where the start and end of the annotated text spans chosen by both annotators must match exactly, and relaxed span matches, where it is sufficient for the annotated text spans to include some degree of overlap. The IAA statistics, in terms of F-scores, are shown in Table 3.
Table 3. Inter-annotator agreement rates (F-score)
NE Type | Relaxed Match | Exact Match |
Pharmacological_substance | 96.0 | 92.8 |
Disorder | 91.9 | 80.7 |
Subject | 81.1 | 81.1 |
TOTAL | 92.6 | 86.0 |