Automated Biological Event Extraction from the Literature for Drug Discovery


This is a collaborative project between NaCTeM and AstraZeneca, started on 1 September 2009 for 3 years. The aim is to enhance our abilities to extract information from the growing corpus of literature, to make the process of synthesising the information more efficient and manageable, and as comprehensive and precise as possible. The hypothesis is that the outcome of the project will help enable the decision-making processes in a drug discovery project to take place using as much pertinent and up-to-date information as possible, and thus maximise the quality of pre-clinical decision making.

To achieve this aim, the objectives of this project and the research novelties are: a) customise deep semantic text mining techniques to extract protein-bioprocess associations automatically; b) to extract biological events pertaining to protein-disease associations automatically from the literature; c) to support the semi-automatic production of annotated texts pertaining to biological information for text mining applications; d) to identify automatically bioprocesses linked with protein-disease events; e) to produce a text mining service supporting biologists researhing into protein-bioprocesses from the vast amount of literature.

NaCTeM will carry out research on automatic event and biological process recognition from texts, with help from AstraZeneca's domain expertiese.

Demonstration Systems

FACTA+ for Cancer Research

Angiogenesis is the physiological process involving the growth of new blood vessels from pre-existing vessels. Identifying genes and other molecules that regulate this bio-process has become an important line of research in cancer treatment. A new version of FACTA+ is available for finding Angiogenesis-associated genes and other biological entities from text.

Angiogenesis Event Extraction

We built a text mining pipeline that extracts and highlights terms and events describing angiogenesis bioprocess, as well as biological entities such as gene, gene product, tissue and cell. A web-based demonstration system is available HERE.


Multi-Level Event Extraction (MLEE) corpus

To be able to present a comprehensive picture of the workings of biological systems, information extraction approaches must take into account not only the molecular-level reactions but also the cellular, tissue, and organ-level processes that produce the organism-level effects that are of primary interest in much of biomedical domain research. To extend the coverage of the event extraction approach to domain information extraction, we have created the Multi-Level Event Extraction (MLEE) corpus, consisting of manually annotated abstracts of publications on angiogenesis, the development of new blood vessels from existing ones, an area of high interest in cancer research. The corpus annotation was created with reference to previously introduced annotation created by subdomain experts to identify spans of text that expressing statements relevant to their interests. To create the MLEE corpus, we have established ontological foundations for the annotation with reference to the community-standard OBO Foundry resources such as the Gene Ontology (GO) and the Common Anatomy Reference Ontology (CARO), revising existing span annotations accordingly to identify over 8,000 entities with fine-grained types and introducing structured annotation for over 6,000 events.

AnEM corpus

To advance automatic anatomical entity mention detection, we have created the AnEM corpus, a domain- and species-independent resource manually annotated for anatomical entity mentions using a fine-grained classification system. The corpus consists of 500 documents (over 90,000 words) selected randomly from citation abstracts and full-text papers with the aim of making the corpus representative of the entire available biomedical scientific literature. The corpus annotation covers mentions of both healthy and pathological anatomical entities and contains over 3,000 annotated mentions.


Success in the BioCreAtIvE III Challeges

NaCTeM took part in the BioCreAtIvE (Critical Assessment of Information Extraction in Biology) challenge for 2010. The team participated in the protein-protein interaction (PPI) challenge and achieved the best performance, in the Interaction Method Task (IMT). This involves automatically detecting experimental techniques used in research articles that support given PPIs. Such detection is crucial not only for the correct annotation of experimentally determined protein interactions but also for other annotations, such as evidence codes in the Gene Ontology, and assigning other controlled vocabulary terms to an article. Among systems submitted by 8 international teams, NaCTeM's yielded the best overall performance as measured by a range of evaluation metrics. The NaCTeM BioCreAtIvE team consisted of S.Ananiadou, R.T. Batista-Navarro, R. Nawaz, C. Nobata, R. Rak, A. Restificar, C.J. Rupp and X. Wang.


Over the last decade, despite a doubling in industrial and public funding for biomedical research, approval of new medical entities by regulatory agencies has halved. Only 11% of molecules that enter the pre-clinical development reach the market, resulting in the average R&D costs for a new medicine at an estimated $454m. Among the most commonly cited reasons for this high clinical attrition rate, especially in later phase III clinical trials, are idiosyncratic drug induced toxicity and the lack of drug efficacy over and above placebos, especially if the compound has a novel mechanism of action.

To reduce this high drug attrition rate in late phase clinical, we crucially need to improve `Confidence in Rationale' of the candidate drug target(s). Such confidence comes from clear scientific evidence of how, when modulated, a target affects critical pathophysiological processes leading to either the disease cure, prevention or amelioration of symptoms in the clinical setting. Typically a bank of pre-clinical evidence is developed using cell lines, model organisms and clinical samples associating a target with key bioprocesses (and so disease phenotype). However, the primary starting point for target choice, and the context for interpretation of all pre-clinical observations, is literature. However, manual techniques and conventional information retrieval techniques are unable to deliver timely, reliable, exhaustive and specific results given the vastness of the literature and its speed of growth. There is thus an immediate and urgent need to advanced automated means to support drug target identification by flagging protein involvement in key bioprocesses and tracking accumulation of evidence over time (hypothesis generation).

Text mining (TM) is increasingly used to suport knowledge discovery, hypothesis generation and to manage the mass of biological literature. However, the TM systems commonly seen that help researchers to discover direct associations between biomedical terms typically rely on co-occurrence approaches, looking at the frequency of co-occurrence of entities in the same articles or sentences. Such approaches often fail to recognise the underlying mechanisms in terms of involvement of biological entities in biological processes, because surface words are often ambiguous (i.e., have different meanings depending on the context), in which case deep semantic analysis of the text is required.

Project Team

NaCTeM Team
AstraZeneca Team
  • Advisory Group: Dr. Ian Dix, Dr. Tim French, Dr. Mark Pearson and Dr. Darren Cross
  • Delivery Team: Mr. Iain McKendrick and Dr. Ian Barrett


Kano, Y., Björne, J., Ginter, F., Salakoski, T., Buyko, E., Hahn, U., Cohen, K. B., Verspoor, K., Roeder, C., Hunter, L., Kilicoglu, H., Bergler, S., Van Landeghem, S., Van Parys, T., Van de Peer, Y., Miwa, M., Ananiadou, S., Neves, M., Pascual-Montano, A., Ozgur, A., Radev, D. R., Riedel, S., Sætre, R., Chun, H.-W., Kim, J.-D., Pyysalo, S., Ohta, T. and Tsujii, J. (2011). U-Compare bio-event meta-service: compatible BioNLP event extraction services. BMC Bioinformatics, 12:481

Miwa, M., Miyao, Y., Sætre, R. and Tsujii, J. (2010). Entity-Focused Sentence Simplification for Relation Extraction. In: Proceedings of the 23rd International Conference on Computational Linguistics (COliNG 2010), pp. 788-796

Miwa, M., Pyysalo, S., Hara, T. and Tsujii, J. (2010). Evaluating Dependency Representation for Event Extraction. In Proceedings of COliNG 2010, pp. 779-787

Miwa, M., Pyysalo, S., Hara, T. and Tsujii, J. (2010). A Comparative Study of Syntactic Parsers for Event Extraction. In Proceedings of the BioNLP 2010 Workshop, pp. 37-45.

Miwa, M., Sætre, R., Kim, J.-D. and Tsujii, J. (2010). Event Extraction with Complex Event Classification Using Rich Features. Journal of Bioinformatics and Computational Biology 8(1), 131-146

Miwa, M., Thompson, P. and Ananiadou, S. (2012). Boosting automatic event extraction from the literature using domain adaptation and coreference resolution. Bioinformatics

Miwa, M., Thompson, P., McNaught, J., Kell, D. B. and Ananiadou, S. (2012). Extracting semantically enriched events from biomedical literature. BMC Bioinformatics, 13:108.

Mu, T. and Ananiadou, S. (2010). Proximity-based graph embeddings for multi-label classification. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR 2010), Valencia, Spain.

Mu, T., Goulermas, J. Y, Tsujii, J. and Ananiadou, S. (2012). Proximity-based Frameworks for Generating Embeddings from Multi-Output Data. IEEE Transactions on Pattern Analysis and Machine Intelligence

Mu, T., Miwa, M., Tsujii, J. and Ananiadou, S. (2012). Discovering Robust Embeddings in (Dis)Similarity Space for High-Dimensional Lingustic Features. Computatational Intelligence

Mu, T., Wang, X, Tsujii, J. and Ananiadou, S. (2010). Imbalanced classification using dictionary-based Prototypes and Hierarchical Decision Rules for Entity Sense Disambiguation. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, pp. 851-859.

Ohta, T., Matsuzaki, T., Okazaki, N., Miwa, M., Sætre, R., Pyysalo, S. and Tsujii, J. (2010). Medie and Info-pubmed: 2010 update. BMC Bioinformatics, 11(Suppl 5), P7

Ohta, T., Pyysalo, S., Ananiadou, S. and Tsujii, J. (2011). Pathway Curation Support as an Information Extraction Task. In Proceedings of the Fourth International Symposium on Languages in Biology and Medicine (LBM 2011) .

Ohta, T., Pyysalo, S., Kim, J.-D. and Tsujii, J. (2010). A re-evaluation of biomedical named entity-term relations. Journal of Bioinformatics and Computational Biology (JBCB), 8(5), 917--928

Ohta, T., Pyysalo, S., Miwa, M. and Tsujii, J. (2010). Event Extraction for DNA Methylation. In Proceedings of the Fourth International Symposium for Semantic Mining in Biomedicine (SMBM 2010)

Ohta, T., Pyysalo, S. and Tsujii, J. (2011). From Pathways to Biomolecular Events: Opportunities and Challenges. In Proceedings of the BioNLP 2011 Workshop, pp. 105-113.

Ohta, T., Pyysalo, S., Tsujii, J. and Ananiadou, S. (2012). Open-domain Anatomical Entity Mention Detection. In Proceedings of the ACL Workshop on Detecting Stucture in Scholarly Discourse (DSSD), pp. 27-36.

Pyysalo, S., Ohta, T. and Ananiadou, S. (2011). Anatomical Entity Recognition with Open Biomedical Ontologies. In Proceedings of the Fourth International Symposium on Languages in Biology and Medicine (LBM 2011) .

Pyysalo, S., Ohta, T. and Tsujii, J. (2010). An Analysis of Gene/Protein Associations at PubMed Scale. In Proceedings of the Fourth International Symposium for Semantic Mining in Biomedicine (SMBM 2010).

Pyysalo, S., Ohta, T., Cho, H.-C., Sullivan, D., Mao, C., Sobral, B., Tsujii, J. and Ananiadou, S. (2010). Towards Event Extraction from Full Texts on Infectious Diseases. In Proceedings of BioNLP 2010, pp. 132--140.

Pyysalo, S., Ohta, T., Miwa, M., Cho, H. -C., Tsujii, J. and Ananiadou, S. (2012). Event extraction across multiple levels of biological organization. Bioinformatics, 28(18), i575-i581

Pyysalo, S., Ohta, T., Miwa, M. and Tsujii, J. (2011). Towards Exhaustive Event Extraction for Protein Modifications. In Proceedings of the BioNLP 2011 Workshop, pp. 114-123.

Pyysalo, S., Stenetorp, P., Ohta, T., Kim, J.-D. and Ananiadou, S. (2012). New Resources and Perspectives for Biomedical Event Extraction. In Proceedings of the BioNLP 2012 Workshop, pp. 100-108.

Y. Tsuruoka, M. Miwa, K. Hamamoto, J. Tsujii, J. and S. Ananiadou (2011) Discovering and visualising indirect associations between biomedical concepts. Bioinformatics, 27 (13), i111-i119.

X. Wang, I. McKendrick, I. Barrett, I. Dix, T. French, J. Tsujii and S. Ananiadou (2011). Automatic Extraction of Angiogenesis Bio-Process from Text. Bioinformatics, 27(19), 2730-2737.

X. Wang, R. Rak, A. Restificar, C. Nobata, C.J. Rupp, R.T.B. Batista-Navarro, R. Nawaz and S. Ananiadou. (2010). NaCTeM Systems for BioCreative III PPI Tasks. In Proceedings of the BioCreative III Workshop. Bethesda, MD, USA.

X. Wang, R. Rak, A. Restificar, C. Nobata, C. Rupp, R. Batista-Navarro, R. Nawaz and S. Ananiadou. (2011). Detecting Experimental Techniques and Selecting Relevant Documents for Protein-Protein Interactions from Biomedical Literature. BMC Bioinformatics, 12(Suppl 8):S11 (Best performing system in BioCreative III's Interaction Method Task.)