TerMine

Terms of Use

By using the TerMine service, you agree to the general Terms and Conditions of Use for the NaCTeM Website, in addition to the following Terms of Use for TerMine:

Please let us know that you are using TerMine by email.
Please cite the following when publishing work that uses TerMine:
Frantzi, K., Ananiadou, S. and Mima, H. (2000) Automatic recognition of multi-word terms. International Journal of Digital Libraries 3(2), pp.117-132.
Please credit and link to the NaCTeM website (http://www.nactem.ac.uk/) in any electonic services based on the TerMine service or resulting data.
Please contact us in advance if you plan to use the service for bulk processing. TerMine is a freely available service from the academic domain. This means that it is necessary to limit server load and give preference to individual users. Excessive server load may result in IP addresses or institutions being blocked from using the TerMine service. There is a limit enforced on how many times unregistered users may use this service per day.

Web Demonstration: integrated system for lightweight uses
Batch Service: for processing documents larger than 2MB (request access)

Examples of TerMine Analyses

Web Demonstration

After using this service, let us know what you think by completing our online feedback form.

Usage

Quick start:

Texts may be submitted for analysis through any of the following ways:
Select the appropriate radio button (defaults to text window) to indicate the data entry menthod required and then
- enter the text you would like to analyze in to the topmost text window... or
- specify a text file (*.txt or *.pdf) from your computer's hard drive... or
- enter a URL of the Web resource (*.html or *.pdf)
Select the POS tagger to be used, 'Tree Tagger version 3.1' is most suited to generic text whereas the 'GENIA Tagger version 2.1' has been customised for texts from the bio-medical sciences.
Press the [Analyse] button.

This demonstration system annotates the input text with candidate multiword terms recognised by the C-value method and acronyms recognised by AcroMine. Please note that C-value term extraction requires a fair amount of text to produce reasonable termhood scores, as these rely on the key terms occurring multiple times. If you would like to try this system quickly, push one of [Try] buttons.

This demonstration system has following limitations:

A source document greater than 2MB will be rejected to protect the server. Please opt to TerMine Processing Service if you need to process documents exceeding this quota.
The text content must be written in ASCII encoding.
Due to the file conversions required, the system may not accurately reproduce the layout of the original PDF/HTML contents.
This system may not extract texts from some PDF/HTML contents.

Feedback

If you have used TerMine, please complete our feedback form to tell us how useful you found the service.

Background

Technical terms are important for knowledge mining, especially in the bio-medical area where vast amount of documents are available. The amount of terms (e.g., names of genes, proteins, chemical compounds, drugs, organisms, etc) is increasing at an astounding rate in the bio-medical literature. Existing terminological resources and scientific databases cannot keep up-to-date with the growth of neologisms. A domain independent method for term recognition is very useful to automatically recognize terms from documents. The TerMine demonstrator intergrates C-Value multiword term extraction and AcroMine acronym recognition.

C-value is a domain-independent method for automatic term recognition (ATR) which combines linguistic and statistical analyses, emphasis being placed on the statistical part. The linguistic analysis enumerates all candidate terms in a given text by applying part-of-speech tagging, extracting word sequences of adjectives/nouns based, and stop-list. The statistical analysis assigns a termhood to a candidate term by using the following four characteristics:

the occurrence frequency of the candidate term
the frequency of the candidate term as part of other longer candidate terms
the number of these longer candidate terms
the length of the candidate term

Our implementation of the C-value method is optimized for scalability and processing speed: given a set of 1.3 million MEDLINE abstracts (2GB text), the implementation extracts 9.8 million term candidates and their termhood scores in about ten minutes. This demonstration system highlights multi-word terms found in the text presented by a user.

Acronyms result from a highly productive type of term variation which substitutes fully expanded terms (e.g., retinoic acid receptor alpha}) with shortened term-forms (e.g., RARA). Even though no generic rules or exact patterns have been established for dealing with acronym creation, acronyms often appears in documents without the expanded form explicitly stated. Thus, an acronym dictionary is necessary for advanced text-mining tasks to establish associations between acronyms and their expanded forms.

AcroMine is an acronym dictionary automatically constructed from the whole MEDLINE. Assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form, AcroMine identifies acronym definitions in a similar manner to the C-value method. Applied to the whole MEDLINE (7,811,582 abstracts), the implemented system extracted 886,755 acronym candidates and recognized 300,954 expanded forms in reasonable time (ca. 48 hours). The current AcroMine achieves 99% precision and 82-95% recall on our evaluation corpus that roughly emulates the whole MEDLINE.

References

Frantzi, K., Ananiadou, S. and Mima, H. (2000) Automatic recognition of multi-word terms. International Journal of Digital Libraries 3(2), pp.117-132.
Okazaki , N. and Ananiadou, S. (2006) Building an Abbreviation Dictionary using a Term Recognition Approach, in Bioinformatics
GENIA Tagger: part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text. (Department of Information Science, University of Tokyo)
TreeTagger: a language independent part-of-speech tagger. (Institute for Computational Linguistics of the University of Stuttgart)
Multivalent: digital documents research and development. (University of Liverpool)

Contact

If you need more information about TerMine, please contact Prof. Sophia Ananiadou.