Cluster Parsing and Indexing with University of Tokyo's GXP Make
2011-01-19
NaCTeM's deployed applications that demonstrate semantic search are built by running natural language processing tools such as the ENJU parser and trained tools for named entity recognition over huge volumes of text. Full-text parsing is a very compute-intensive and memory-intensive process, although some of the other steps in a typical analysis pipeline are relatively quicker to run. The undertaking is complicated by problems such as text encoding and formatting, which could invalidate the analysis of whole batches of documents if not handled robustly during processing.
The whole analysis pipeline applied to a typical large corpus of 20 million document abstracts or 2 million full-text documents can take up to 100,000 processor hours. Doing this in order to create a search application needs either a very long lead time or the use of parallel computing power. The availability of HPC clusters is one prerequisite, but given that, managing the distribution of processing resources and data over the cluster requires considerable organization. Where the job is big enough to warrant the use of multiple clusters, file sharing is possible within the cluster but not across clusters.
Large-scale document collection analysis is a multi-disciplinary activity combining computational linguistics and other text mining expertise with high-performance computing know-how. Benefitting from the support and expertise of Professor Kenjiro Taura's group at the University of Tokyo, NaCTeM is able to undertake collection-scale analysis in order to build the comprehensive indexes of scientific article repositories that underpin its semantic search applications such as KLEIO, MEDIE and AcroMine.
Prof Taura's
Previous item | Next item |
Back to news summary page |
Featured News
- Talk at Generative AI Summit
- Talk at Open Data Science Conference (ODSC)
- BioLaySumm 2023 - Shared Task @ BioNLP 2023
- Prof. Ananiadou appointed as Senior Area Chair for ACL 2023
- Recent funding successes for Prof. Sophia Ananiadou
- Junichi Tsujii awarded Order of the Sacred Treasure, Gold Rays with Neck Ribbon
Other News & Events
- Prof. Ananiadou gives talk as part of Women in AI speaker series
- New Knowledge Knowledge Transfer Partnership with 10BE5
- Keynote Talk at the Festival of AI
- New article on using neural architectures to aggregate sequence labels from multiple annnotators
- New article on improving biomedical extractive summarisation using domain knowledge