NaCTeM

BBC News Browser Pilot Project

The aim of this pilot project is to analyse, structure and visualise BBC news available on the Web according to a user's query using advanced text mining techniques. The major outcomes will include a web demonstrator of two concept clustering tools and also presentations to identified sets of potential users within New Media & Technology, News and BBC Monitoring, and to BBC Research/Technology Group.

Duration: July - December 2007
Principal Investigator: Sophia Ananiadou
Research Associate: Brian Rea

Due to restrictions on the use of the news data we unfortunately cannot make the tool available for general use. Instead we have created a video demonstration to show you they key functionality and benefits of the project output.

Application 1: Concept Discovery and Retrieval

The proposed system will linguistically process and analyse the terminology within all of the news articles provided by the BBC, in order to discover the most important concepts and the relations between them. The interface allows a user to enter a query across the document collection and automatically calculate a list of concepts specific to the query and ranked by perceived importance. An example from a biomedical collection of document s would be a query for documents relating to "myocardial infarction". The ranked set of results returned includes 'myocardial infarction', 'coronary artery', 'risk factor', 'artery disease', 'acute coronary syndrome', 'heart disease', 'heart failure', 'ventricular tachycardia', 'blood pressure' and 'unstable angina'.

The basic method for this includes advanced indexing of these concepts as well as standard keyword based approaches of other more common search engines. This allows for more complete retrieval of document collections without having to know the key terminology and variants ahead of time. This also enables the user to drill down inside the results with each step becoming more focussed on a particular goal and the irrelevant documents being discarded. Finally, as the articles are all stored within the system during processing it is possible to offer multiple visualisations of the documents, ranging from raw text or styled html, to annotated and enhanced versions highlighting key concepts and providing links to related material.

Application 2: Concept Visualisation

This application takes the results of the concept discovery process which are then visualised with the aim to create user oriented knowledge maps. The generation of knowledge maps is achieved by recognising clusters of articles and their automatic categorisation based on concept (terminological) processing. The user selects a collection of online news, specifies a set of query terms and topic maps are created automatically. The figure below exemplifies a topic map that has been generated from news articles. The target information is extracted from a small number of articles concerning terrorism and suggests the documents (yellow dots) that relate the topics.

Example visualisation of news topics

The basic method includes categorization and mapping of concepts in order to enhance information presentation. The system integrates automatic term recognition, concept clustering, information retrieval, and visualization. Its main objective is to facilitate knowledge presentation and discovery from documents through concept similarities and automatically visualizing them in news stories. Additionally, in order to accelerate information discovery, we propose a visualization method for generating similarity-based knowledge maps. This method is based on real-time terminology-based knowledge clustering and categorization, and it allows users to observe the generated knowledge maps graphically and in real time. This technique can be applied to compare news stories from current or past news articles and/or different channels showing differences in perspective.

Deliverables:

  • Explore how the two applications can be used effectively with news articles, and an interim report on this for each.
  • A web based demonstrator of both applications and results.