Related Text Mining Research Projects
In addition to the development of text mining services and software tools, members of the National Centre for Text Mining are involved in a variety of other projects which influence and contribute to the work of the centre:Current Projects
AstraZeneca Project (Automated Biological Event Extraction from the Literature for Drug Discovery)
This is a collaborative project between NaCTeM and AstraZeneca, started on 1 September 2009 for 3 years. The aim is to enhance our abilities to extract information from the growing corpus of literature, to make the process of synthesising the information more efficient and manageable, and as comprehensive and precise as possible. The hypothesis is that the outcome of the project will help enable the decision-making processes in a drug discovery project to take place using as much pertinent and up-to-date information as possible, and thus maximise the quality of pre-clinical decision making.
To achieve this aim, the objectives of this project and the research novelties are: a) customise deep semantic text mining techniques to extract protein-bioprocess associations automatically; b) to extract biological events pertaining to protein-disease associations automatically from the literature; c) to support the semi-automatic production of annotated texts pertaining to biological information for text mining applications; d) to identify automatically bioprocesses linked with protein-disease events; e) to produce a text mining service supporting biologists researhing into protein-bioprocesses from the vast amount of literature.
CheTA
CheTA will integrate Cambridge's chemical text mining tool OSCAR with the U-Compare workflow infrastructure developed by NaCTeM and others. This integration adds chemistry to the world's largest public collection of interoperable text mining tools and will be highly valued by influential stakeholders both in the JISC community and the wider chemistry community. After a baseline study (UCC and RSC) and the integration have been accomplished, the project will use the CheTA tools to index a corpus of documents of different types and provenance. CheTA will develop a rigorous evaluation framework with annotation studies for a formal scientific evaluation of the system ('Are we extracting metadata correctly' - RSC/NaCTeM), user requirements studies for the metadata needs of 'real world users' ('What metadata is useful?' - RSC/UCC) and comparing extracted metadata against the usefulness (all project partners). Furthermore the CheTA system will be compared with the performance of the Thomson Reuters OpenCalais service enhanced with a chemistry lexicon. Finally, the economic cost of metadata generation by both human indexers and robots will be quantified.
DECA
The DECA (Disease Extraction with Concept Association) project concerns automatically extracting associations between concepts in the biomedical domain, such as diseases and symptoms, from collections of biomedical texts (e.g., MEDLINE). The aim of this project is to combine the strengths of the NaCTeM text mining software tools, Kleio and FACTA, and to create an efficient search facility for associations between biomedical concepts. Also, a considerable amount of research will be put into lexical disambiguation of the biomedical names.
FixRep
This joint project from UKOLN, NaCTeM and Knowledge Integration brings together the experience of each partner in text analysis and information extraction techniques in order to complete a practical evaluation of formal metadata generation methods within real world workflows. These include the well-known problem of metadata deposit, and workflows from later in the metadata lifecycle; triage - incremental improvement of metadata through error identification and correction - and normalisation, the increase of consistency for a specific purpose, such as republishing of the record as part of an overlay journal. The suitability of extracted formal metadata for purposes such as creation of metadata records, input into existing services for external subject classification or geographical localisation, and for reviewing resource accessibility and preservation are evaluated.FLaReNet
The FLaReNet project (Fostering Language Resources Network) is an eContentPlus project funded by the European Commission. The purpose of the project is to develop a common vision of the area of Language Resources and Language Technologies for the coming years, and to foster a European strategy for consolidating the sector and enhancing competitiveness at EU level and worldwide. FLaReNet will analyse the sector along various dimensions: technical, scientific but also organisational, economic, political and legal. Once the more pressing issues have been selected, the mission of FLaReNet is to identify priorities as well as long-term strategic objectives and provide consensual recommendations in the form of a plan of action for EC, national organisations and industry.
ONDEX
The ONDEX project addresses the problem that a prerequisite to a systems approach to biological research is the integration and analysis of heterogeneous experimental data, which are stored in hundreds of life-science databases and millions of scientific publications. Its aims are to produce a robust, fully featured, extensible, easy to use and professionally-supported data integration framework for systems biology projects to use. A more detailed overview is available in this poster presentation video, given at ISMB 2009 and presented by Chris Rawlings.
PathText/Refine
Many systems have been developed in the past few years to assist researchers in the discovery of knowledge published as English text, for example in the PubMed database. At the same time, higher level collective knowledge is often published using a graphical notation representing all the entities in a pathway and their interactions. We believe that these pathway visualizations could serve as an effective user interface for knowledge discovery if they can be linked to the text in publications. Since the graphical elements in a Pathway are of a very different nature to their corresponding descriptions in English text, we have developed PathText to serve as a bridge between these two systems.
UIMA
One of the core challenges facing text mining and natural language processing (NLP) researchers and tool developers is the general lack of interoperability between different tools and resources. At NaCTeM we have looked to solve this by adapting our tools to function within the UIMA framework enabling direct interaction with those tools provided by other groups around the world.
Our UIMA work is widely recognised, and Dr. Sophia Ananiadou, the director of NaCTeM, has received IBM UIMA Innovation Awards successively for the years of 2006, 2007 and 2008
UKPMC
This is a collaboration with the Text-Mining group at the European Bioinformatics Institute (EBI) and MIMAS forming a work package in the UKPMC project hosted and coordinated by the British Library. UKPMC, as a whole, forms a UK-based version of the PuBMed Central paper repository, in collaboration with the National Institutes of Health (NIH) in the United States. UKPMC is funded by a consortium of key funding bodies from the biomedical research funders. Our contribution to this major project is in the application of text mining solutions to enhance information retrieval and knowledge discovery. As such this is an application of technology developed in other NaCTeM projects on a large scale and in a prominent resource for the Biomedicine community.
Past Projects
ADVISES
The ADVISES project will create a new way of communicating with computers for scientists. At present they have to use difficult tools which require them to speak the computers' language rather than express what they want in English. Worse still, the computer tools don't talk to each other so they have to use separate tools for statistics, then visually display results on a map, etc. We will deal with these problems by analysing the way scientists express their requests in English to create a 'sub-language' - that is, a restricted set of English for asking scientific questions and saying how results should be displayed.
ASSERT
The JISC-funded ASSERT (Automatic Summarisation for Systematic Reviews using Text Mining) project is a continuation of the National Centre for Text Mining into the area of social sciences. The overall aim of ASSERT is to encourage greater participation by the social sciences community in e-Research by developing a summarisation service to facilitate the production of systematic reviews and to support a number of community projects related with text mining applications.
ASSIST
The ASSIST project investigates the benefits of text mining in two case studies within the social science disciplines. This includes a review of the requirements gathering stage in order to advise future projects in this area and the development of high profile exemplars demonstrating how text mining solutions can solve, in part at least, major challenges facing e-Researchers across all domains.
Arabic WordNet
Arabic WordNet involves the construction of an Arabic WordNet, following the development process of Princeton WordNet and Euro WordNet. It utilizes the Suggested Upper Merged Ontology as an interlingua to link Arabic WordNet to previously developed wordnets.
BBC
The BBC News Browser Pilot Project aims to analyse, structure and visualise BBC news available on the Web according to a user's que ry using advanced text mining techniques. The outcomes include a web demonstrator of two concept clustering tools and presentations to identified sets of potential users within New Media & Technology, News and BBC Monitoring, and to BBC Research/Technology Group.
BOOTStrep
BOOTStrep (Bootstrapping Of Ontologies and Terminologies STrategic REsearch Project) is an international joint EU project (Ref. FP6 - 028099), which aims at building reusable wide-coverage lexical, conceptual and factual knowledge resources for the biology domain, involving the exploitation and combination of existing terminological resources (thesauri, classification systems, etc.) within a common, standardized representation framework.
INTUTE
The INTUTE Project aims to develop an intelligent semantic search service using NaCTeM's text mining tools, which will grant users the benefit of searching within an enhanced subset of the Intute repository, a collection of academic/technical reports under the domain-heading of Bio-medical Science or Social Science.
Japan Science and Technology Agency Project
The aim of the project was to investigate the acquisition of lexical and terminological information for a machine translation environment. The main aspects of the JST project were to investigate the use of machine learning techniques for the development of efficient clustering and classification algorithms to be used for text mining applications, and in particular machine translation.
ParTeM
The ParTeM project (Massively Parallel Processing of Full Text Articles using DEISA) presents a combination of expertise in text mining and high performance computing to enable and run massively parallel text mining applications to scale beyond thousands of processors, since there is an urgent need to find amenable solutions to tackle the problem of data deluge for large-scale text mining applications. The motivation is to process large text datasets from multiple scientific domains within reasonable time. Processing full text articles instead of abstracts will allow researchers/scientists across the world to find increased relationships within text that was not known before. This will only be possible with a system that exploits storage capabilities and the parallel nature of high performance computing platforms by porting a number of advanced text mining techniques to the DEISA platform.
Featured News
- Text mining enhances Educational Evidence Portal - new article and demo site
- Medal of honour awarded to Professor Tsujii
- Improved acronym disambiguation - release of updated software service and paper
- Species disambiguation of biomedical named entities- release of software, corpus and article
- Launch of new features on UKPMC website
- New Biomedical Event Corpus (GREC) released
- ELRA Distribution Agreement signed for BioLexicon





