Text Mining Research Projects

In addition to the development of text mining services and software tools, members of the National Centre for Text Mining are involved in a variety of other projects which influence and contribute to the work of the centre:

Current Projects


Prof. Sophia Ananiadou, director of NaCTeM and Prof. Angelo Cangelosi's Cognitive Robotics Lab at Manchester are collaborating with the Artificial Intelligence Research Center (AIRC), in Tokyo, Japan, to carry out fundamental NLP and text mining research. AIRC is directed by Prof. Jun'ichi Tsujii, who also holds the position of Professor of Text Mining at the University of Manchester, and is NaCTeM's scientific advisor. This research was also linked with Prof. Sophia Ananiadou's Alan Turing fellowship. AIRC is exploring possible directions of NLP/machine learning research with ATI fellows, and funding was allocated to facilitate this exploration.


The Exposome Project for Health and Occupational Research (EPHOR) project will lay the groundwork for evidence-based and cost-effective prevention for improving health at work, by developing a working life exposome toolbox. This project consortium consists of 19 exposure, health, and data scientists and technology partners from 12 different countries, who will work together to advance occupational health science in a unique way to reduce the burden of disease. NaCTeM's text mining expertise will feed into the development of novel methods to model and assess complex individual exposures and exposure interactions, and to link these to health outcomes via biological pathway analysis.

Mental Health

Over recent years, there has been an increased focus on how advances in NLP can be used to aid mental health research. In our work, we predominantly focus on work in the areas of depression, suicide note and suicide ideation detection, using deep learning and knowledge-based approaches.


NaCTeM is collaborating with the New Energy and Industrial Technology Development Organization (NEDO), Japan and the Artificial Intelligence Research Center (AIRC), Japan, to increase the accessibility of AI technology.

Past Projects


Prof. Sophia Ananiadou, director of NaCTeM, and Mr. John Kwan, director at the 10BE5 Ltd. join hands forthis Knowledge Transfer Partnership (KTP) program to automate a critical, time-consuming fact-checking and verification process for companies listing, or looking to list, securities on public markets. It will apply UK-based cutting-edge research in natural language processing (NLP). The project will deliver novel AI-based software capabilities for public companies, private companies preparing for securities offerings, and their advisors for automated claim detection, fact-checking, and verification that addresses consistency.


The ADVISES project will create a new way of communicating with computers for scientists. At present they have to use difficult tools which require them to speak the computers' language rather than express what they want in English. Worse still, the computer tools don't talk to each other so they have to use separate tools for statistics, then visually display results on a map, etc. We will deal with these problems by analysing the way scientists express their requests in English to create a 'sub-language' - that is, a restricted set of English for asking scientific questions and saying how results should be displayed. A video explaining the implemented demo system is available.


The JISC-funded ASSERT (Automatic Summarisation for Systematic Reviews using Text Mining) project is a continuation of the National Centre for Text Mining into the area of social sciences. The overall aim of ASSERT is to encourage greater participation by the social sciences community in e-Research by developing a summarisation service to facilitate the production of systematic reviews and to support a number of community projects related with text mining applications.


The ASSIST project investigates the benefits of text mining in two case studies within the social science disciplines. This includes a review of the requirements gathering stage in order to advise future projects in this area and the development of high profile exemplars demonstrating how text mining solutions can solve, in part at least, major challenges facing e-Researchers across all domains.

AstraZeneca Project (Automated Biological Event Extraction from the Literature for Drug Discovery)

This is a collaborative project between NaCTeM and AstraZeneca, started on 1 September 2009 for 3 years. The aim is to enhance our abilities to extract information from the growing corpus of literature, to make the process of synthesising the information more efficient and manageable, and as comprehensive and precise as possible. The hypothesis is that the outcome of the project will help enable the decision-making processes in a drug discovery project to take place using as much pertinent and up-to-date information as possible, and thus maximise the quality of pre-clinical decision making.

To achieve this aim, the objectives of this project and the research novelties are: a) customise deep semantic text mining techniques to extract protein-bioprocess associations automatically; b) to extract biological events pertaining to protein-disease associations automatically from the literature; c) to support the semi-automatic production of annotated texts pertaining to biological information for text mining applications; d) to identify automatically bioprocesses linked with protein-disease events; e) to produce a text mining service supporting biologists researhing into protein-bioprocesses from the vast amount of literature.

Arabic WordNet

Arabic WordNet involves the construction of an Arabic WordNet, following the development process of Princeton WordNet and Euro WordNet. It utilizes the Suggested Upper Merged Ontology as an interlingua to link Arabic WordNet to previously developed wordnets.

Automated screening for systematic reviews

This project aims to develop new text mining methods to assisy with screening in systematic reviews and health technology assesments. The new methods developed will include automated screening prioritisation, such that studies at the top of the list are those that are most likely to be relevant for manual screening, and automatic classification of documents, according to whether they should be included or excluded from the screening process. The new methods aim to reduce the burden of screening in reviews, to allow reviews to be completed more quickly, as well as minimising the impact of publication bias and reduce the chances that relevant research will be missed.

Big Mechanism

Big mechanisms are large, explanatory models of complicated systems in which interactions have important causal effects. Whilst the collection of big data is increasingly automated, the creation of big mechanisms remains a largely human effort, which is becoming made increasingly challenging, according to the fragmentation and distribution of knowledge. The ability to automate the construction of big mechanisms could have a major impact on scietific research. As one of a number of different projects that make up the big mechanism programme, our aim is to assemble an overarching big mechanism from the literature and prior experiments and to utilise this for the probabilistic interpretation of new patient panomics data. We will integrate machine reading of the cancer literature with probabilistic reasoning across cancer claims using specially-designed ontologies, computational modeling of cancer mechanisms (pathways), automated hypothesis generation to extend knowledge of the mechanisms and a 'Robot Scientist' that performs experiments to test the hypotheses. A repetitive cycle of text mining, modelling, experimental testing, and worldview updating is intended to lead to increased knowledge about cancer mechanisms.


The BBC News Browser Pilot Project aims to analyse, structure and visualise BBC news available on the Web according to a user's que ry using advanced text mining techniques. The outcomes include a web demonstrator of two concept clustering tools and presentations to identified sets of potential users within New Media & Technology, News and BBC Monitoring, and to BBC Research/Technology Group.


BOOTStrep (Bootstrapping Of Ontologies and Terminologies STrategic REsearch Project) is an international joint EU project (Ref. FP6 - 028099), which aims at building reusable wide-coverage lexical, conceptual and factual knowledge resources for the biology domain, involving the exploitation and combination of existing terminological resources (thesauri, classification systems, etc.) within a common, standardized representation framework.

Bott and Co.

Bott and Co (B&C) is a leading consumer law firm specialising in personal injury, flight delay and holiday illness compensation claims working on a no-win no-fee basis. The firm is the UK's most experienced and trusted authority on flight delay compensation regulation EU261. The firm handles over 50,000 flight delay compensation claims each year. The project will embed process innovation to automate, improve decision making, reduce wasted costs and improve service delivery. Some basic level automation has been introduced but claims that pass the initial rule-based check are then manually triaged. Given the low margin of the work, maximising automation is fundamental to ensure a profitable return. Introducing predictive analysis will empower staff to take more complex decisions without referral to senior staff.


CheTA integrated Cambridge's chemical text mining tool OSCAR with the U-Compare workflow infrastructure developed by NaCTeM and others. This integration has added chemistry to the world's largest public collection of interoperable text mining tools and will be highly valued by influential stakeholders both in the JISC community and the wider chemistry community. After a baseline study (UCC and RSC) and the integration were accomplished, the project used the CheTA tools to index a corpus of documents of different types and provenance. CheTA developed a rigorous evaluation framework with annotation studies for a formal scientific evaluation of the system ('Are we extracting metadata correctly' - RSC/NaCTeM), user requirements studies for the metadata needs of 'real world users' ('What metadata is useful?' - RSC/UCC) and comparing extracted metadata against the usefulness (all project partners). Finally, the economic cost of metadata generation by both human indexers and robots was quantified.

Clinical Trials

The aim of the Clinical Trials project is to develop an efficient search application customised to clinical trials, that aims to address the information overload problem and to assist in the creation of new protocols. Text and data mining methods will be applied to large clinical trial collections in order to enrich clinical trial documents with metadata, that in turn serve as effective tools to narrow down searches.


This project aims to produce a knowledge repository of Philippine biodiversity by combining the domain-relevant expertise and resources of Philippine partners with the text mining-based big data analytics of the University of Manchester's National Centre for Text Mining. The repository will be a synergy of different types of information, e.g., taxonomic, occurrence, ecological, biomolecular, biochemical, thus providing users with a comprehensive view on species of interest that will allow them to (1) carry out predictive analysis on species distributions, and (2) investigate potential medicinal applications of natural products derived from Philippine species.


The DECA (Disease Extraction with Concept Association) project concerned automatically extracting associations between concepts in the biomedical domain, such as diseases and symptoms, from collections of biomedical texts (e.g., MEDLINE). The aim of this project was to combine the strengths of the NaCTeM text mining software tools, Kleio and FACTA, and to create an efficient search facility for associations between biomedical concepts. Also, a considerable amount of research was put into lexical disambiguation of the biomedical names.


The EMPATHY project aims to support metobolic pathway model curation through the integration of text mining methodologies into a pathway reconstruction platform. Specifically, we set out to accomplish the following: creation of a web-based platform that will allow users to develop their reconstructions using a graphical, user-interactive interface; development of advanced text mining (TM) methods for extracting information on metabolic reactions from literature; integration of TM methods into the reconstruction platform to facilitate the automatic provision of literature-based evidence and revision suggestions to the user; development of an active learning-like mechanism that iteratively captures a user's feedback on text-mined evidence/suggestions and recalibrates the underlying tools in order to produce improved results.

eScholar project

The University of Manchester's eScholar is a search facility that gives researchers access to scholarly work produced by individuals associated with the university. The project involves enriching the current faceted search capabilities of eScholar by customising, adapting and combining existing text mining tools and algorithms, such as keyword extraction, named entity recognition and topic clustering, to foster the discovery of interdisciplinary links. This project will impact on the advancement of new interdisciplinary research, which is reliant on identifying potential synergies between the work of different groups within the university.

Europe PMC

This is a collaboration with the Text-Mining group at the European Bioinformatics Institute (EBI) and MIMAS, forming a work package in the Europe PMC project (formerly UKPMC) hosted and coordinated by the British Library. Europe PMC, as a whole, forms a European version of the PuBMed Central paper repository, in collaboration with the National Institutes of Health (NIH) in the United States. Europe PMC is funded by a consortium of key funding bodies from the biomedical research funders. Our contribution to this major project is in the application of text mining solutions to enhance information retrieval and knowledge discovery. As such this is an application of technology developed in other NaCTeM projects on a large scale and in a prominent resource for the Biomedicine community.


This joint project from UKOLN, NaCTeM and Knowledge Integration brought together the experience of each partner in text analysis and information extraction techniques in order to complete a practical evaluation of formal metadata generation methods within real world workflows. These included the well-known problem of metadata deposit, and workflows from later in the metadata lifecycle; triage - incremental improvement of metadata through error identification and correction - and normalisation, the increase of consistency for a specific purpose, such as republishing of the record as part of an overlay journal. The suitability of extracted formal metadata for purposes such as creation of metadata records, input into existing services for external subject classification or geographical localisation, and for reviewing resource accessibility and preservation were evaluated.


The FLaReNet project (Fostering Language Resources Network) is an eContentPlus project funded by the European Commission. The purpose of the project is to develop a common vision of the area of Language Resources and Language Technologies for the coming years, and to foster a European strategy for consolidating the sector and enhancing competitiveness at EU level and worldwide. FLaReNet will analyse the sector along various dimensions: technical, scientific but also organisational, economic, political and legal. Once the more pressing issues have been selected, the mission of FLaReNet is to identify priorities as well as long-term strategic objectives and provide consensual recommendations in the form of a plan of action for EC, national organisations and industry.

HSE Lloyds

The Lloyds HSE project is under the umbrella of the £10 million Discovering Safety programme funded by Lloyd's Register Foundation. Central to the programme is the development of new technologies to analyse data and aggregate data from sources worldwide, the key output being new learning to help prevent future accidents occurring. This ambitious programme is a collaboration between the Health and Safety Executive (HSE) and the University of Manchester, resulting in the Thomas Ashton Institute. As part of the programme, we are using state of the art in text mining and natural language processing to extract health and safety insights from free-text sources.

Infectious Diseases

In September 2009, the National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), awarded a 5-year contract to to support the biomedical research community's work on infectious diseases.

As part of the contract, NaCTeM is collaborating with Virginia Bioinformatics Insititute (VBI) to integrate vital information on pathogens, provide key resources and tools to scientists, and help researchers to analyze genomic, proteomic and other data arising from infectious disease research.

Integrated Social History Environment for Research (ISHER) - Digging into Social Unrest

ISHER aims to enhance search over digitised resources for social history. Enhancement comes through text mining-based rich semantic metadata extraction for collection indexing, clustering and classification. This then allows semantic search while reducing the manual costs currently involved in such activities.

Interoperability of text mining tools is a key objective and an organizing principle for the software architecture of our project. IBM's Unstructured Information Management Architecture (UIMA) forms the basis of our interoperable text mining platform U-Compare, which has over 50 text mining components in its library, and is extensible so can accommodate ISHER's requirements by including also text mining tools from third parties.


The INTUTE Project aimed to develop an intelligent semantic search service using NaCTeM's text mining tools, which will grant users the benefit of searching within an enhanced subset of the Intute repository, a collection of academic/technical reports under the domain-heading of Bio-medical Science or Social Science.

Japan Science and Technology Agency Project

The aim of the project was to investigate the acquisition of lexical and terminological information for a machine translation environment. The main aspects of the JST project were to investigate the use of machine learning techniques for the development of efficient clustering and classification algorithms to be used for text mining applications, and in particular machine translation.

KISTI Pathway Project

NaCTeM is collaborating with the Korea Institute of Science and Technology Information (KISTI) to develop the next generation of information extraction and text mining systems for supporting and automating various aspects of biomolecular pathway model curation.

Building on the PathText text mining integration technology for pathways, text mining systems such as MEDIE and event extraction tools such as EventMine, we are developing methods for identifying literature relevant to specific reactions in pathway models and for automatically analysing documents to extract event structures that capture the full semantics of pathway reactions.

Mining Biodiversity

The Mining Biodiversity project aims to transform the Biodiversity Heritage Library (BHL) into a next-generation social digital library resource to facilitate the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community and to raise awareness of the changes in biodiversity over time in the general public. The project integrates novel text mining methods, visualisation, crowdsourcing and social media into the BHL. The resulting digital resource will provide fully interlinked and indexed access to the full content of BHL library documents, via semantically enhanced and interactive browsing and searching capabilities, allowing users to locate precisely the information of interest to them in an easy and efficient manner.

The project will apply text mining methods to add semantic metadata to two digitised medical textual resources with archives dating back to the 1840s, i.e. the British Medical Journal (BMJ) and London-area Medical Officer of Health (MOH) reports. Major outcomes of the project will be a novel temporal terminological resource, which will identify and record terminological variation and semantic shift over time, and a new semantic search system over the enriched archives, which will help historians in broadening and deepening their work to ask 'big' questions that cover long periods, without losing sensitivity to changes in terminology and meaning.

Mining for Public Health

This project aims to conduct novel research in text mining and machine learning to transform the way in which evidence-based public health (EBPH) reviews are conducted. The aims of the project are to develop new text mining unsupervised methods for deriving term similarities, to support screening while searching in EBPH reviews and to develop new algorithms for ranking and visualising meaningful associations of multiple types in a dynamic and iterative manner. These newly developed methods will be evaluated in EBPH reviews, based on implementation of a pilot, to ascertain the level of transformation in EBPH reviewing.

Mining the History of Medicine

This project, a cross-disciplinary collaboration between the National Centre for Text Mining (NaCTeM) and the Centre for the History of Science, Technology and Medicine (CHSTM) at the University of Manchester, seeks to demonstrate the potential of text mining in medical history.


META-NET aims to build the foundations of building the technological foundations of a multilingual European information society. Through the Multilingual Europe Technology Alliance (META), META-NET aims to bring together researchers, commercial technology providers, private and corporate language technology users, language professionals and other information society stakeholders. META will prepare the necessary ambitious joint effort towards furthering language technologies as a means towards realising the vision of a Europe united as one single digital market and information space.


The aim of this project is to create an environment which enables new biomarker tests, based on molecular pathology techniques, to be developed. These can then be used to stratify patients, to allow more accurate diagnosis or prediction of the best treatments to use. The initial focus will be on people who suffer from inflammatory disease (psoriasis, rheumatoid arthritis and lupus), given the availability of a large number of patient samples for these diseases. Text mining will be employed to carry out automated semantic analysis of various "unstructured" textual information sources thet may contain information that is relevant to the development of biomarker tests, including biomedical literature and electronic health records. Given that each of these sources constitutes vast numbers of documents, information contained within them may be hidden and easily overlooked. TM techniques will be used in a number of ways to enhance the ease and efficiency with which unstructured textual information sources can be exploited to support the development of biomarker tests.


This project aims to establish the toxicoligical profile of every plant extract based on its composition. The idea is to gather, within the same predictive database, all the knowledge published on molecular groups and the toxicological data collected by the industry, extracts manufacturers and cosmetic companies. Aiming in the long-term to provide a "Predictive database to determine the toxicological profile of Natural Complex Substances - Plants extracts", the project is very important for all stakeholders involved in the manufacture and use of plants and plant extracts, and especially for cosmetic and nutraceutical industries. The innovative feature of this project lies in its creation of a predictive database contaiinng toxicological profiles of NCSs, which are obtained from safety data and information on their constituents.


The ONDEX project addresses the problem that a prerequisite to a systems approach to biological research is the integration and analysis of heterogeneous experimental data, which are stored in hundreds of life-science databases and millions of scientific publications. Its aims are to produce a robust, fully featured, extensible, easy to use and professionally-supported data integration framework for systems biology projects to use. A more detailed overview is available in this poster presentation video, given at ISMB 2009 and presented by Chris Rawlings.


The Open Mining Infrastructure for Text and Data (OpenMinTeD) project seeks to develop an interoperable text mining infrastructure that will unite the efforts of several key players in the text mining world. Crucially, this project involves the communities at the heart of using text mining with partners in the life sciences, the social sciences and scholarly communication. The project will develop an infrastructure which combines the power of several established text mining systems (including our platform, Argo). We will publish interoperability guidelines that will allow other systems to integrate with the OpenMinted platform. The broad aim of this project is to unite the efforts of text miners across Europe and the world, simultaneously promoting reusability and community uptake.


OSSMETER aims to extend the state-of-the-art in the field of automated analysis and measurement of Open Source Software, and develop a platform that will support decision makers in the process of discovering, comparing, assessing and monitoring the health, quality, impact and activity of open-source software.

To achieve this, OSSMETER will compute trustworthy quality indicators by performing advanced analysis and integration of information from diverse sources including the project metadata, source code repositories, communication channels and bug tracking systems of Open Source Software projects.

Pacific Life Re

The role of reliable data in medicine cannot be underestimated. This applies not only to information describing general population-level phenomena covered in scientific publications, but also to health service records describing individuals. Although text mining methods have been widely applied to the former category, the latter has attracted much less attention. One of the main reasons is that these data were previously stored in a format that made them less accessible for digital processing. i.e., as paper documents, which were frequently handwritten. However, increasing adoption of digital solutions both by health service institutions and individual medical practitioners has started to change the picture. This new situation poses both new challenges and opportunities for text mining methods, since there is potentially valuable knowledge contained in individual medical records. In this project, we aim to analyse medical reports using text mining techniques, with the specific goal of quantifying the risk associated with the evidence described.


The ParTeM project (Massively Parallel Processing of Full Text Articles using DEISA) presented a combination of expertise in text mining and high performance computing to enable and run massively parallel text mining applications to scale beyond thousands of processors, since there is an urgent need to find amenable solutions to tackle the problem of data deluge for large-scale text mining applications. The motivation is to process large text datasets from multiple scientific domains within reasonable time. Processing full text articles instead of abstracts will allow researchers/scientists across the world to find increased relationships within text that was not known before. This will only be possible with a system that exploits storage capabilities and the parallel nature of high performance computing platforms by porting a number of advanced text mining techniques to the DEISA platform.


Many systems have been developed in the past few years to assist researchers in the discovery of knowledge published as English text, for example in the PubMed database. At the same time, higher level collective knowledge is often published using a graphical notation representing all the entities in a pathway and their interactions. We believe that these pathway visualizations could serve as an effective user interface for knowledge discovery if they can be linked to the text in publications. Since the graphical elements in a Pathway are of a very different nature to their corresponding descriptions in English text, we have developed PathText to serve as a bridge between these two systems.


There is now more research published than ever before. The primary bibliographic database for biomedical research, PubMed, adds around 3,500 new references every day. Our random sample of 2,000 publications in PubMed suggests that in 2013 there were 98,000 publications describing in vivo experiments, of which 21,000 were in pharmacology and 14,500 in neuroscience. No one individual can read, let alone critically appraise or use even a small fraction of this new information, which is the product of months of investigator effort and substantial investment of research funds. This mismatch between the amount of research produced and the amount that can be effectively used, is a major challenge to biomedical research.

In the SLiM project, we propose to exploit recent developments in text mining and machine learning, and to evaluate their potential to assist with the challenges of systematic reviews of in vivo data outlined above.


Thalia (Text mining for Highlighting, Aggregating and Linking Information in Articles) is a semantic search engine that can recognise concepts occurring in biomedical abstracts indexed on Pubmed. It currently recognises eight types of concepts, namely: chemicals, diseases, drugs, genes, metabolites, proteins, species and anatomical entities.

Turing Project

In this 6-month project, we will develop neural machine reading methods to support the automation and effective identification of knowledge for systematic review (SR) development. The project will leverage RobotAnalyst [1], which was developed by NaCTeM in cooperation with the National Institute of Clinical Excellence (NICE). The system automates screening in SRs by reading the title and the abstract of a document collection, and ranking the relevancy of a reference using a model trained on human-assessed relevant and irrelevant examples