DECA: A Species Disambiguation System for Biological Named Entities

Overview

The DECA (Disease Extraction with Concept Association) project, a one year project funded by Pfizer, concerns automatically associating concepts to entity mentions in biomedical text (e.g., MEDLINE abstracts). A considerable amount of research was put into lexical disambiguation of the biomedical names. This is because a string of words often refers to different meanings depending on the context, hence causing ambiguity. A more sensible way to organise information is by concepts, where a concept has unambiguous meaning and can be associated with a unique identifier. To make text mining useful for the community of biological sciences, one crucial step is to link the hidden and ambiguous mentions of named entities in text to unique concepts in knowledge bases.

In particular, DECA tackled one major source of ambiguity in entity mentions: model organisms. Model organisms are species studied to understand particular biological phenomena. Biological experiments are often conducted on one species, with the expectation that the discoveries will provide insight into the workings of others, including humans, which are more difficult to study directly. From viruses, prokaryotes, to plants and animals, there are dozens of organisms commonly used in biological studies, such as E. coli, C. elegans, Drosophila, Homo sapiens, and hundreds more are frequently mentioned in biological research papers. Given an article, it is often essential for readers to understand what organisms the biomedical entities (e.g., proteins) belong to, and on what organisms the experiments were carried out.

The approach to organism disambiguation in DECA is automatically identifying the species-indicating words (e.g., human) and biomedical named entities (e.g., protein P53) in text, and then judging whether the species-entity relations are positive, where a positive relation means that an entity belongs to the organism indicated by the species-indicating word. Natural lanauge syntactic parsers and machine learning techniques were applied to classify species-entity relations.

System

DECA is a pipeline system consisting of the following Natural Language Processing software components:

Genia Tagger: Performs linguistic pre-processing including tokenisation, lemmatisation, part-of-speech tagging. It also identifies a number of biomedical named entities: protein, DNA, RNA, Cell Line, and Cell Type.
Species Word Detector: Marks up words that indicate orgnisms (e.g., human, homo sapiens, mouse).
Abbreviation Detection: Identifies abbreviations and their corresponding long forms.
Syntactic Parser: A parser generates dependency or predicate-argument relations between words. The demo version uses Minipar Parser.
Species Disambiguation System: Using the information gathered from the aforementioned applications, it disambiguates the entity mentions with respect to their species and assigns a unique NCBI Taxonomy ID to each entity.

Demo

A web based demonstration of DECA is available here.

References

Xinglong Wang and Michael Matthews. (2008). Distinguishing the Species of Biomedical Named Entities for Term Identification. BMC Bioinformatics, 9(Suppl 11):S6
Xinglong Wang and Claire Grover. (2008). Learning the Species of Biomedical Named Entities from Annotated Corpora. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008). Marrakech, Morocco.