In literature, information is mainly written in a natural language, which means the information is not directly accessible to computers. It is a general expectation that the better a computer can decode a language, the better it can access the information which is encoded in the language. With this background, application of natural language processing (NLP) technology is becoming more and more popular for text mining (TM) from biomedical literature.
In recent decades, text corpora have been in the center of NLP research. A corpus is a large collection of texts which is often intended to represent a certain domain or style of texts or language expressions. In NLP, text annotation often means attaching a computable interpretation to texts, providing computers with direct access to desired information which is aligned to relevant text pieces. It has been many times observed that a well-created corpus with annotations establishes or promotes research in NLP, IR or IE, providing reference materials for the development of computational systems (Penn Treebank)(MUC data sets)(TREC data sets).
The GENIA corpus is a collection of Medline abstracts which is intended to represent the literature of molecular biology. The main value of the corpus comes from annotations made to the corpus at various levels. These GENIA annotations have contributed to the community by providing reference materials, and many works based on the resource have been reported.
GENIA annotation has been done from two perspectives; to make the biomedical knowledge encoded in the text transparent (semantic annotation), and to reveal the syntactic structure behind the text (linguistic annotation).
Although the linguistic structure of text, such as the phrasal or dependency structure, may not be the main interest of text mining practitioners, it is often studied by text mining researchers to improve their systems. It is generally accepted that information about the linguistic structure of text is helpful in accessing the information encoded in the text. Knowing about the linguistic structure of text is like having a map, perhaps not complete, of a mine’s topography, showing paths and suggesting potential places to dig for pieces of knowledge.
Tokenization and POS labeling are often regarded as the first steps of NLP processing. These steps determine the basic units of a sentence and their properties, e.g., grammatical or syntactic identity. Figure 1 shows an example of a sentence which has been tokenized and POS-labeled. Note that punctuation and parentheses are usually split from adjoining words to make separate tokens.
We have annotated 1,999 Medline abstracts with 42 part-of-speech labels according to Penn Treebank POS tagging scheme which is a de-facto standard. Tsuruoka et al (2005) reported that they could improve part-of-speech tagging accuracy on Medline text from 91.6% to 98.5% by using the Part-of-speech-annotated GENIA corpus as training material.
Training corpus | Test corpus | |
---|---|---|
WSJ | GENIA | |
WSJ | 97.2% |
91.6% |
WSJ + GENIA | 97.2% |
98.5% |
Syntactic analysis reveals how words in a sentence are organized to form the meaning of the sentence. In a sentence, words may be grouped together into phrases. Similarly, phrases together with or without other words may also be grouped to form larger phrases. This process may continue until eventually the whole sentence is grouped together, yielding a tree structure of phrases with the root element covering the whole sentence, internal elements corresponding to phrases and leaf elements corresponding to words.
Figure 2 shows a sentence annotated for syntactic structures.
We have annotated 1,200 Medline abstracts with their parse trees. Generally we tried to follow Penn Treebank II (PTB) bracketing guidelines (Beis et al, 1995) which is a de facto standard. Hara et al (2005) reported that they could improve the parsing performance of an HPSG parser from f-score of 85.1% to 86.9%.
Biological entities like proteins or genes are the most fundamental structures of interest in biomedical research, and detecting mention of such entities in texts is considered crucial to accessing useful information. In GENIA, biological entities are annotated during the process of term annotation, which covers technical terms from biology including entity names. The definition and classification of such terms comes from the GENIA ontology.
Figure 3 shows a sentence annotated for terms. For example, the text span “Mice” is annotated as a multi cell organism (“Multi_cell”). Term annotations may be recursive. For example, the three text spans, “human T cell leukemia virus”, “HTLV-1” and “Tax”, are annotated as terms inside a bigger text span, “human T cell leukemia virus (HTLV-1) Tax gene”, which is also annotated as a term.
The GENIA term annotation is grounded on the GENIA ontology which defines biomedically meaningful nominal concepts. Figure 4 shows the GENIA ontology, where concepts are classified in a hierarchy. Note that the terminal concepts are presented in bold boxes. They define the terms that need to be identified from the literature and thus become the target of annotation. The numbers appearing next to the labels of terminal concepts indicate their frequency in the GENIA corpus version 3.01.
1,999 Medline abstracts have been annotated with term labels defined in the GENIA ontology. The GENIA term-annotated corpus is now widely used as one of the de facto standards of bio-entity-annotation. Table 2 lists some state-of-the-art systems of bio-entity recognition which are trained using the GENIA term-annotated corpus.
Bio-entity recognition system | Recall | Precision | F-score |
---|---|---|---|
SVM+HMM (Zhou et al., 2004) |
76.0 |
69.4 |
72.6 |
Semi-Markov CRFs (in prep.) |
72.7 |
70.4 |
71.5 |
Two-Phase (Kim et al., 2005) |
72.8 |
69.7 |
71.2 |
Sliding Window (in prep.) |
71.5 |
70.2 |
70.8 |
CRF (Settles, 2005) |
72.0 |
69.1 |
70.5 |
MEMM (Finkel et al, 2004) |
71.6 |
68.6 |
70.1 |
... | ... | ... | ... |
The GENIA corpus has been annotated for part-of-speech, syntactic tree and biomedical terms. The corpus together with the rich annotation is used by many researchers and practitioners in the bio-text mining community, and many state-of-the-art IE systems have been developed by making use of it as reference material.
Future work includes extending the scope of the corpus to include full papers, as we believe more serious knowledge could be found from full papers than from abstracts.