Automated Biological Event Extraction from the Literature for Drug Discovery
Overview
This is a collaborative project between NaCTeM and AstraZeneca, started on 1 September 2009 for 3 years. The aim is to enhance our abilities to extract information from the growing corpus of literature, to make the process of synthesising the information more efficient and manageable, and as comprehensive and precise as possible. The hypothesis is that the outcome of the project will help enable the decision-making processes in a drug discovery project to take place using as much pertinent and up-to-date information as possible, and thus maximise the quality of pre-clinical decision making.
To achieve this aim, the objectives of this project and the research novelties are: a) customise deep semantic text mining techniques to extract protein-bioprocess associations automatically; b) to extract biological events pertaining to protein-disease associations automatically from the literature; c) to support the semi-automatic production of annotated texts pertaining to biological information for text mining applications; d) to identify automatically bioprocesses linked with protein-disease events; e) to produce a text mining service supporting biologists researhing into protein-bioprocesses from the vast amount of literature.
NaCTeM will carry out research on automatic event and biological process recognition from texts, with help from AstraZeneca's domain expertiese.
Background
Over the last decade, despite a doubling in industrial and public funding for biomedical research, approval of new medical entities by regulatory agencies has halved. Only 11% of molecules that enter the pre-clinical development reach the market, resulting in the average R&D costs for a new medicine at an estimated $454m. Among the most commonly cited reasons for this high clinical attrition rate, especially in later phase III clinical trials, are idiosyncratic drug induced toxicity and the lack of drug efficacy over and above placebos, especially if the compound has a novel mechanism of action.
To reduce this high drug attrition rate in late phase clinical, we crucially need to improve `Confidence in Rationale' of the candidate drug target(s). Such confidence comes from clear scientific evidence of how, when modulated, a target affects critical pathophysiological processes leading to either the disease cure, prevention or amelioration of symptoms in the clinical setting. Typically a bank of pre-clinical evidence is developed using cell lines, model organisms and clinical samples associating a target with key bioprocesses (and so disease phenotype). However, the primary starting point for target choice, and the context for interpretation of all pre-clinical observations, is literature. However, manual techniques and conventional information retrieval techniques are unable to deliver timely, reliable, exhaustive and specific results given the vastness of the literature and its speed of growth. There is thus an immediate and urgent need to advanced automated means to support drug target identification by flagging protein involvement in key bioprocesses and tracking accumulation of evidence over time (hypothesis generation).
Text mining (TM) is increasingly used to suport knowledge discovery, hypothesis generation and to manage the mass of biological literature. However, the TM systems commonly seen that help researchers to discover direct associations between biomedical terms typically rely on co-occurrence approaches, looking at the frequency of co-occurrence of entities in the same articles or sentences. Such approaches often fail to recognise the underlying mechanisms in terms of involvement of biological entities in biological processes, because surface words are often ambiguous (i.e., have different meanings depending on the context), in which case deep semantic analysis of the text is required.
Project Team
- NaCTeM Team
-
- Principal Investigators: Prof. Sophia Ananiadou and Prof. Jun-ichi Tsujii
- Researcher: Dr. Xinglong Wang
- AstraZeneca Team
-
- Advisory Group: Dr. Ian Dix, Dr. Tim French, Dr. Mark Pearson and Dr. Darren Cross
- Delivery Team: Mr. Iain McKendrick and Dr. Ian Barrett

Featured News
- Text mining enhances Educational Evidence Portal - new article and demo site
- Medal of honour awarded to Professor Tsujii
- Improved acronym disambiguation - release of updated software service and paper
- Species disambiguation of biomedical named entities- release of software, corpus and article
- Launch of new features on UKPMC website
- New Biomedical Event Corpus (GREC) released
- ELRA Distribution Agreement signed for BioLexicon





