New paper on learning robust embeddings from high-dimensional lingustic feature spaces
2012-08-09
We are pleased to announce the publication on a new journal article about that describes a framework for learning a small number of robust embeddings from extremely high-dimensional linguistic features (millions) to facilitate classification problems with fairly low number (thousands) of training instances. This type of classification problem appears in many recent high-level NLP tasks, where it is quite expensive to obtain annotated training instances, but easier to generate rich features using sophisticated text analysis tools.Evaluated on two NLP tasks with six data sets, the proposed framework provides better classification performance than the support vector machine without using any dimensionality reduction technique. It also generates embeddings with better class discriminability as compared to many existing embedding algorithms.
Mu, T., Miwa, M., Tsujii, J. and Ananiadou, S. (2012). Discovering Robust Embeddings in (Dis)Similarity Space for High-Dimensional Lingustic Features. Computatational Intelligence
Full abstract: Recent research has shown the effectiveness of rich feature representation for tasks in natural language processing (NLP). However, exceedingly large number of features do not always improve classification performance. They may contain redundant information, lead to noisy feature presentations, and also render the learning algorithms intractable. In this paper, we propose a supervised embedding framework that modifies the relative positions between instances to increase the compatibility between the input features and the output labels and meanwhile preserves the local distribution of the original data in the embedded space. The proposed framework attempts to support flexible balance between the preservation of intrinsic geometry and the enhancement of class separability for both interclass and intraclass instances. It takes into account characteristics of linguistic features by using an inner product-based optimization template. (Dis)similarity features, also known as empirical kernel mapping, is employed to enable computationally tractable processing of extremely high-dimensional input, and also to handle nonlinearities in embedding generation when necessary. Evaluated on two NLP tasks with six data sets, the proposed framework provides better classification performance than the support vector machine without using any dimensionality reduction technique. It also generates embeddings with better class discriminability as compared to many existing embedding algorithms.
Previous item | Next item |
Back to news summary page |
Featured News
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Postdoctoral research position in Athens, Greece. Application deadline: 18th March 2024
- Four-year funded PhD in collaboration with A*STAR, Singapore. Deadline 20 March 2024
- PhD opportunity in collaboration with Athens Univ. of Economics and Business. Deadline 31 Mar 2024
- iCASE EPSRC funded PhD- multimodal NLP - UoM & BAE - Application deadline 30th March 2024
- CFP: BIONLP 2024 and Shared Tasks @ ACL 2024
- Advances in Data Science and Artificial Intelligence Conference 2024
- New review article on emotion detection for misinformation
Other News & Events
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine
- BioNLP 2024 accepted as workshop at ACL 2024
- Junichi Tsujii awarded Order of the Sacred Treasure, Gold Rays with Neck Ribbon
- Chinese Government AwardAward for PhD student Tianlin Zhang
- Keynote talk at EMBL-EBI industry club Machine Learning for Text Mining