The DECA (Disease Extraction with Concept Association) project, a one year project funded by Pfizer, concerns automatically associating concepts to entity mentions in biomedical text (e.g., MEDLINE abstracts). A considerable amount of research was put into lexical disambiguation of the biomedical names. This is because a string of words often refers to different meanings depending on the context, hence causing ambiguity. A more sensible way to organise information is by concepts, where a concept has unambiguous meaning and can be associated with a unique identifier. To make text mining useful for the community of biological sciences, one crucial step is to link the hidden and ambiguous mentions of named entities in text to unique concepts in knowledge bases.
In particular, DECA tackled one major source of ambiguity in entity mentions: model organisms. Model organisms are species studied to understand particular biological phenomena. Biological experiments are often conducted on one species, with the expectation that the discoveries will provide insight into the workings of others, including humans, which are more difficult to study directly. From viruses, prokaryotes, to plants and animals, there are dozens of organisms commonly used in biological studies, such as E. coli, C. elegans, Drosophila, Homo sapiens, and hundreds more are frequently mentioned in biological research papers. Given an article, it is often essential for readers to understand what organisms the biomedical entities (e.g., proteins) belong to, and on what organisms the experiments were carried out.
The approach to organism disambiguation in DECA is automatically identifying the species-indicating words (e.g., human) and biomedical named entities (e.g., protein P53) in text, and then judging whether the species-entity relations are positive, where a positive relation means that an entity belongs to the organism indicated by the species-indicating word. Natural lanauge syntactic parsers and machine learning techniques were applied to classify species-entity relations.
DECA is a pipeline system consisting of the following Natural Language Processing software components:
A web based demonstration of DECA is available here.