DECA
Overview
DECA (Disease Extraction with Concept Association) was a one year project funded by Pfizer. It concerns automatically extracting associations between concepts in the biomedical domain, such as diseases and symptoms, from collections of biomedical texts (e.g., MEDLINE). A considerable amount of research was put into lexical disambiguation of the biomedical names.Motivation
Manually searching pieces of information in the ocean of research papers can be very difficult and time-consuming. This task becomes even more demanding if one performs searches on biomedical texts, which tend to involve large numbers of named entities in specialised domains. In addition, biologists are often interested in knowing certain types of associations between the entities, such as interactions between proteins and relations between symptoms and diseases. Traditional information retrieval techniques bring little help in speeding up such search tasks, because they do not recognise and index biomedical named entities (e.g., proteins, genes and diseases), and neither the links among them.
Therefore, it is desirable to build a software tool that uses natural language processing and text mining technologies to automatically recognise and disambiguate biomedical named entities and find their associations. Then, a search engine based on that would hopefully make searches more efficient and enjoyable.
Challenges
The challenges of this project are, among others :- Recognising different types of biomedical named entities. Named entity recognition systems can achieve relatively good performance in terms precision and recall providing they have sufficient resources such as human-annotated training data and comprehensive dictionaries for development. However, such resources are scarce for many types of entities, and recognising which remain challenging.
- Resolving lexical ambiguity in biomedical named entities. For example, the same text string can refer to different types of entities and/or to the same type of entity but different species (e.g., human or mouse). Distinguishing their meanings according to the context that they occur in can be very tricky.
- As the sheer size of amount of text to be processed, making the software tool efficient is not a trivial task.
More Details
Please click here for more details and a Web demonstration of DECA. Please click here for a corpus in which mentions of gene/gene products are disambiguated with respect to model organisms and assigned NCBI Taxonomy species IDs.Project Team
Principal Investigator: Prof. Sophia AnaniadouCo-investigator: Prof. Jun'ichi Tsujii
Publications
Xinglong Wang, Jun'ichi Tsujii and Sophia Ananiadou. (2010). Disambiguating the Species of Biomedical Named Entities Using Natural Language Parsers. Bioinformatics, 26(5):661-667; doi: 10.1093/bioinformatics/btq002
Featured News
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Congratulations to PhD student Panagiotis Georgiades
Other News & Events
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine
- Advances in Data Science and Artificial Intelligence Conference 2024
- New review article on emotion detection for misinformation