Cluster Parsing and Indexing with University of Tokyo's GXP Make
2011-01-19
NaCTeM's deployed applications that demonstrate semantic search are built by running natural language processing tools such as the ENJU parser and trained tools for named entity recognition over huge volumes of text. Full-text parsing is a very compute-intensive and memory-intensive process, although some of the other steps in a typical analysis pipeline are relatively quicker to run. The undertaking is complicated by problems such as text encoding and formatting, which could invalidate the analysis of whole batches of documents if not handled robustly during processing.
The whole analysis pipeline applied to a typical large corpus of 20 million document abstracts or 2 million full-text documents can take up to 100,000 processor hours. Doing this in order to create a search application needs either a very long lead time or the use of parallel computing power. The availability of HPC clusters is one prerequisite, but given that, managing the distribution of processing resources and data over the cluster requires considerable organization. Where the job is big enough to warrant the use of multiple clusters, file sharing is possible within the cluster but not across clusters.
Large-scale document collection analysis is a multi-disciplinary activity combining computational linguistics and other text mining expertise with high-performance computing know-how. Benefitting from the support and expertise of Professor Kenjiro Taura's group at the University of Tokyo, NaCTeM is able to undertake collection-scale analysis in order to build the comprehensive indexes of scientific article repositories that underpin its semantic search applications such as KLEIO, MEDIE and AcroMine.
Prof Taura's
Previous item | Next item |
Back to news summary page |
Featured News
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Congratulations to PhD student Panagiotis Georgiades
Other News & Events
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine
- Advances in Data Science and Artificial Intelligence Conference 2024
- New review article on emotion detection for misinformation