NaCTeM

DECA

Overview

DECA (Disease Extraction with Concept Association) was a one year project funded by Pfizer. It concerns automatically extracting associations between concepts in the biomedical domain, such as diseases and symptoms, from collections of biomedical texts (e.g., MEDLINE). A considerable amount of research was put into lexical disambiguation of the biomedical names.

Motivation

Manually searching pieces of information in the ocean of research papers can be very difficult and time-consuming. This task becomes even more demanding if one performs searches on biomedical texts, which tend to involve large numbers of named entities in specialised domains. In addition, biologists are often interested in knowing certain types of associations between the entities, such as interactions between proteins and relations between symptoms and diseases. Traditional information retrieval techniques bring little help in speeding up such search tasks, because they do not recognise and index biomedical named entities (e.g., proteins, genes and diseases), and neither the links among them.

Therefore, it is desirable to build a software tool that uses natural language processing and text mining technologies to automatically recognise and disambiguate biomedical named entities and find their associations. Then, a search engine based on that would hopefully make searches more efficient and enjoyable.

Challenges

The challenges of this project are, among others :
  • Recognising different types of biomedical named entities. Named entity recognition systems can achieve relatively good performance in terms precision and recall providing they have sufficient resources such as human-annotated training data and comprehensive dictionaries for development. However, such resources are scarce for many types of entities, and recognising which remain challenging.
  • Resolving lexical ambiguity in biomedical named entities. For example, the same text string can refer to different types of entities and/or to the same type of entity but different species (e.g., human or mouse). Distinguishing their meanings according to the context that they occur in can be very tricky.
  • As the sheer size of amount of text to be processed, making the software tool efficient is not a trivial task.

More Details

Please click here for more details and a Web demonstration of DECA. Please click here for a corpus in which mentions of gene/gene products are disambiguated with respect to model organisms and assigned NCBI Taxonomy species IDs.

Project Team

Principal Investigator: Prof. Sophia Ananiadou
Co-investigator: Prof. Jun'ichi Tsujii

Publications

Xinglong Wang, Jun'ichi Tsujii and Sophia Ananiadou. (2010). Disambiguating the Species of Biomedical Named Entities Using Natural Language Parsers. Bioinformatics, 26(5):661-667; doi: 10.1093/bioinformatics/btq002