Species disambiguation of biomedical named entities- release of software, corpus and article


Text mining technologies have been shown to reduce the laborious work involved in organising the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts.

The DECA project has released a corpus for organism disambiguation where every occurrence of protein/gene entity is manually tagged with a species ID

Software trained on the corpus has also been released, both as a web-based demo and as U-Compare components

The creation of the corpus and the training of the sotware are described more fully in a newly-released article:

Xinglong Wang, Jun'ichi Tsujii and Sophia Ananiadou (2010). Disambiguating the Species of Biomedical Named Entities Using Natural Language Parsers. Bioinformatics 2010

