The GENIA tagger analyzes English sentences and outputs the base forms, part-of-speech tags, chunk tags, and named entity tags. The tagger is specifically tuned for biomedical text such as MEDLINE abstracts. If you need to extract information from biomedical documents, this tagger might be a useful preprocessing tool.
You need gcc to build the tagger.
> tar xvzf geniatagger.tar.gz
> cd geniatagger/
> make
Prepare a text file containing one sentence per line, then
> ./geniatagger < RAWTEXT > TAGGEDTEXT
The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.
word1 base1 POStag1 chunktag1 NEtag1
word2 base2 POStag2 chunktag2 NEtag2
: : : : :
Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE).
> echo "Inhibition of NF-kappaB activation reversed the anti-apoptotic effect of isochamaejasmin." | ./geniatagger
Inhibition Inhibition NN B-NP O
of of IN B-PP O
NF-kappaB NF-kappaB NN B-NP B-protein
activation activation NN I-NP O
reversed reverse VBD B-VP O
the the DT B-NP O
anti-apoptotic anti-apoptotic JJ I-NP O
effect effect NN I-NP O
of of IN B-PP O
isochamaejasmin isochamaejasmin NN B-NP O
. . . O O
You can easily extract four noun phrases ("Inhibition", "NF-kappaB activation", "the anti-apoptotic effect", and "isochamaejasmin") from this output by looking at the chunk tags. You can also find a protein name with the named entity tags.
General-purpose part-of-speech taggers do not usually perform well on biomedical text because lexical characteristics of biomedical documents are considerably different from those of newspaper articles, which are often used as the training data for a general-purpose tagger. The GENIA tagger is trained not only on the Wall Street Journal corpus but also on the GENIA corpus and the PennBioIE corpus [1], so the tagger works well on various types of biomedical documents. The table below shows the tagging accuracies of a tagger trained with different sets of documents. For details of the performance, see [2] (the latest version uses a different tagging algorithm [3] and gives slightly better performance than reported in the paper).
Wall Street Journal | GENIA corpus | |
---|---|---|
A tagger trained on the WSJ corpus | 97.05% | 85.19% |
A tagger trained on the GENIA corpus | 78.57% | 98.49% |
GENIA tagger | 96.94% | 98.26% |
Entity Type | Recall | Precision | F-score |
---|---|---|---|
Protein | 81.41 | 65.82 | 72.79 |
DNA | 66.76 | 65.64 | 66.20 |
RNA | 68.64 | 60.45 | 64.29 |
Cell Line | 59.60 | 56.12 | 57.81 |
Cell Type | 70.54 | 78.51 | 74.31 |
Overall | 75.78 | 67.45 | 71.37 |
This pages is maintained by Yoshimasa Tsuruoka. Questions and suggestions are welcome.
Department of Information Science, Faculty of Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan.