Cluster Parsing and Indexing with University of Tokyo's GXP Make
2011-01-19
NaCTeM's deployed applications that demonstrate semantic search are built by running natural language processing tools such as the ENJU parser and trained tools for named entity recognition over huge volumes of text. Full-text parsing is a very compute-intensive and memory-intensive process, although some of the other steps in a typical analysis pipeline are relatively quicker to run. The undertaking is complicated by problems such as text encoding and formatting, which could invalidate the analysis of whole batches of documents if not handled robustly during processing.
The whole analysis pipeline applied to a typical large corpus of 20 million document abstracts or 2 million full-text documents can take up to 100,000 processor hours. Doing this in order to create a search application needs either a very long lead time or the use of parallel computing power. The availability of HPC clusters is one prerequisite, but given that, managing the distribution of processing resources and data over the cluster requires considerable organization. Where the job is big enough to warrant the use of multiple clusters, file sharing is possible within the cluster but not across clusters.
Large-scale document collection analysis is a multi-disciplinary activity combining computational linguistics and other text mining expertise with high-performance computing know-how. Benefitting from the support and expertise of Professor Kenjiro Taura's group at the University of Tokyo, NaCTeM is able to undertake collection-scale analysis in order to build the comprehensive indexes of scientific article repositories that underpin its semantic search applications such as KLEIO, MEDIE and AcroMine.
Prof Taura's
Previous item | Next item |
Back to news summary page |
Featured News
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Postdoctoral research position in Athens, Greece. Application deadline: 18th March 2024
- Four-year funded PhD in collaboration with A*STAR, Singapore. Deadline 20 March 2024
- PhD opportunity in collaboration with Athens Univ. of Economics and Business. Deadline 31 Mar 2024
- iCASE EPSRC funded PhD- multimodal NLP - UoM & BAE - Application deadline 30th March 2024
- CFP: BIONLP 2024 and Shared Tasks @ ACL 2024
- Advances in Data Science and Artificial Intelligence Conference 2024
- New review article on emotion detection for misinformation
Other News & Events
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine
- BioNLP 2024 accepted as workshop at ACL 2024
- Junichi Tsujii awarded Order of the Sacred Treasure, Gold Rays with Neck Ribbon
- Chinese Government AwardAward for PhD student Tianlin Zhang
- Keynote talk at EMBL-EBI industry club Machine Learning for Text Mining