NaCTeM

Cluster Parsing and Indexing with University of Tokyo's GXP Make

2011-01-19

NaCTeM's deployed applications that demonstrate semantic search are built by running natural language processing tools such as the ENJU parser and trained tools for named entity recognition over huge volumes of text. Full-text parsing is a very compute-intensive and memory-intensive process, although some of the other steps in a typical analysis pipeline are relatively quicker to run. The undertaking is complicated by problems such as text encoding and formatting, which could invalidate the analysis of whole batches of documents if not handled robustly during processing.

The whole analysis pipeline applied to a typical large corpus of 20 million document abstracts or 2 million full-text documents can take up to 100,000 processor hours. Doing this in order to create a search application needs either a very long lead time or the use of parallel computing power. The availability of HPC clusters is one prerequisite, but given that, managing the distribution of processing resources and data over the cluster requires considerable organization. Where the job is big enough to warrant the use of multiple clusters, file sharing is possible within the cluster but not across clusters.

Large-scale document collection analysis is a multi-disciplinary activity combining computational linguistics and other text mining expertise with high-performance computing know-how. Benefitting from the support and expertise of Professor Kenjiro Taura's group at the University of Tokyo, NaCTeM is able to undertake collection-scale analysis in order to build the comprehensive indexes of scientific article repositories that underpin its semantic search applications such as KLEIO, MEDIE and AcroMine.

Prof Taura's GXP make tool is a system allowing workflow execution instructions expressed declaratively as Makefiles to be executed in a distributed environment. This is combined with the open source user level file system SSHFS. Input, intermediate, and output data are put in a single cluster and shared by all nodes.

Previous itemNext item
Back to news summary page