NaCTeM

Text Mining Resources

BOOTStrep Bio-Lexicon

Overview

Biological terminology is a frequent cause of analysis errors when processing literature written in the biology domain. For example, "retro-regulate" is a terminological verb often used in molecular biology but it is not included in conventional dictionaries.

The BioLexicon is a linguistic resource tailored for the biology domain to cope with these problems. It contains the following types of entries:

  • a set of terminological verbs
  • a set of derived forms of the terminological verbs
  • general English words frequently used in the biology domain
  • domain terms

This comprehensive coverage of biological terms makes the lexicon a unique linguistic resource within the domain.

Background

Over the past twenty years, there have been remarkable advances in natural language processing (NLP) and text mining (TM) technologies. Various practical NLP/TM tools, such as part-of-speech taggers, chunkers, syntactic parsers and named entity recognizers, are now widely available. However, text in biology exhibits different characteristics from general language documents such as newspaper articles. The biology domain demonstrates strong demands for the results of NLP/TM. However, it is also one of the most challenging domains for text processing. Lack of coverage of the following types of terminological information makes NLP/TM tasks in this domain difficult:

  • Large-scale domain-specific terminologies
  • Domain-specific word usage
  • Domain-specific relations between words

Technical terms are a major barrier to bio-text processing. A huge number of biological, chemical and medical terms appear in the literature and new terms are coined every day. Furthermore, there are many spelling and semantic variants of these terms representing the same biomedical entities in different written forms. For example, the BioThesaurus contains more than 15 million gene/protein names, but still it does not cover the wide variety of variants of gene/protein names actually appearing in the literature. Word usage can be idiosyncratic to the bio-domain as well. For example, express often indicates a specific biological process, gene expression, and takes as arguments specific types of named entities, such as gene and protein names. In addition, there are many cases where words are related in a biology-specific manner. For example, the verb retroregulate has retroregulation as its nominal form and retroregulatory as its adjectival form. This extent of derivational relations between words in the biological domain cannot be fully covered by general English dictionaries and thesauri, e.g., WordNet. To the best of our knowledge, there is no biology-specific lexicon that addresses the above linguistic issues.

Availability

The BioLexicon is available from the ELRA catalogue (ref T0373). It is a collective achievement by EBML-EBI, CNR-ILC, and the University of Manchester in the EC BOOTStrep Project.

References

2011

2010

2009

2008

  • Sasaki, Yutaka, Simonetta Montemagni, Piotr Pezik, Dietrich Rebholz-Schuhmann, John McNaught and Sophia Ananiadou. BioLexicon: A Lexical Resource for the Biology Domain. In Proc. of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), 2008.
  • Rebholz-Schuhmann, Dietrich, Piotr Pezik, Vivian Lee, Jung-Jae Kim, Riccardo del Gratta, Yutaka Sasaki, Jock McNaught, Simonetta Montemagni, Monica Monachini, Nicoletta Calzolari and Sophia Ananiadou. Towards a Reference Terminological Resource in the Biomedical Domain. In Proc. of 16th Ann. Int. Conf. on Intelligent Systems for Molecular Biology (ISMB-2008), Toronto, Canada, 2008.