Paragraph Vector-based Topic Detection
This is the home page for the PV-based topic detection, an open-source method that identifies latent topics in a collection of documents and selects a set of meaningful and comprehensive words to describe the content of each topic. The PV-based topic detection method is presented in
- Hashimoto, K., Kontonatsios, G., Miwa, M. and Ananiadou, S.. (2016). Topic Detection Using Paragraph Vectors to Support Active Learning in Systematic Reviews. In: Journal of Biomedical Informatics [link].
if you use the topic detection method available from this page, please cite this paper.
Overview
The topic detection method uses a distributed representation model, i.e., paragraph vector (PV), to firstly generate a feature representation of documents and then cluster the documents into a predefined number of clusters. By treating the centroids of the clusters as latent topics, the method represents each document as a mixture of latent topics. Given that the paragraph vector model is trained by solving word prediction tasks, cluster labels are selected according to the conditional probability that a word w is generated given a cluster centroid (i.e., topic). The method is shown to substantially accelerate the citation screening task of systematic reviews achieving a superior performance over existing topic models (e.g., LDA).Usage
The code uses a template library for linear algebra: Eigen (http://eigen.tuxfamily.org/index.php?title=Main_Page), and the version 2.8 is currently included in this package for the topic detection method. To install the tool run:sample_data/wikipedia.sample.doc
is a collection of documents (one line per document), k
is the number of clusters in k-means and itr
is the number of training epochs. For help with setting the different parameters of the mode run:
Download
- PV-based topic detection (1.5MB) (LICENCES)
Featured News
- Prof. Ananiadou appointed as Senior Area Chair for ACL 2023
- Recent funding successes for Prof. Sophia Ananiadou
- New article on using neural architectures to aggregate sequence labels from multiple annnotators
- New article on improving biomedical extractive summarisation using domain knowledge
- New article on automated detection and analysis of depression and stress in social media data
- Junichi Tsujii awarded Order of the Sacred Treasure, Gold Rays with Neck Ribbon
- Prof Juni'ichi Tsujii receives ACL Lifetime Achievement Award 2021
- Prof. Sophia Ananiadou featured on the 2021 North Innovation Women list