Cluster Parsing and Indexing with University of Tokyo's GXP Make
2011-01-19
NaCTeM's deployed applications that demonstrate semantic search are built by running natural language processing tools such as the ENJU parser and trained tools for named entity recognition over huge volumes of text. Full-text parsing is a very compute-intensive and memory-intensive process, although some of the other steps in a typical analysis pipeline are relatively quicker to run. The undertaking is complicated by problems such as text encoding and formatting, which could invalidate the analysis of whole batches of documents if not handled robustly during processing.
The whole analysis pipeline applied to a typical large corpus of 20 million document abstracts or 2 million full-text documents can take up to 100,000 processor hours. Doing this in order to create a search application needs either a very long lead time or the use of parallel computing power. The availability of HPC clusters is one prerequisite, but given that, managing the distribution of processing resources and data over the cluster requires considerable organization. Where the job is big enough to warrant the use of multiple clusters, file sharing is possible within the cluster but not across clusters.
Large-scale document collection analysis is a multi-disciplinary activity combining computational linguistics and other text mining expertise with high-performance computing know-how. Benefitting from the support and expertise of Professor Kenjiro Taura's group at the University of Tokyo, NaCTeM is able to undertake collection-scale analysis in order to build the comprehensive indexes of scientific article repositories that underpin its semantic search applications such as KLEIO, MEDIE and AcroMine.
Prof Taura's
Previous item | Next item |
Back to news summary page |
Featured News
- ELLIS Workshop on Misinformation Detection - 16th June 2025
- 1st Workshop on Misinformation Detection in the Era of LLMs (MisD)- 23rd June 2025
- Prof. Sophia Ananiadou accepted as an ELLIS fellow
- Invited talk at the 15th Marbach Castle Drug-Drug Interaction Workshop
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- New Named Entity Corpus for Occupational Substance Exposure Assessment
Other News & Events
- CL4Health @ NAACL 2025 - Extended submission deadline - 04/02/2025
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens