NaCTeM

Paragraph Vector-based Topic Detection

This is the home page for the PV-based topic detection, an open-source method that identifies latent topics in a collection of documents and selects a set of meaningful and comprehensive words to describe the content of each topic. The PV-based topic detection method is presented in

  • Hashimoto, K., Kontonatsios, G., Miwa, M. and Ananiadou, S.. (2016). Topic Detection Using Paragraph Vectors to Support Active Learning in Systematic Reviews. In: Journal of Biomedical Informatics [link].

if you use the topic detection method available from this page, please cite this paper.

Overview

The topic detection method uses a distributed representation model, i.e., paragraph vector (PV), to firstly generate a feature representation of documents and then cluster the documents into a predefined number of clusters. By treating the centroids of the clusters as latent topics, the method represents each document as a mixture of latent topics. Given that the paragraph vector model is trained by solving word prediction tasks, cluster labels are selected according to the conditional probability that a word w is generated given a cluster centroid (i.e., topic). The method is shown to substantially accelerate the citation screening task of systematic reviews achieving a superior performance over existing topic models (e.g., LDA).

Usage

The code uses a template library for linear algebra: Eigen (http://eigen.tuxfamily.org/index.php?title=Main_Page), and the version 2.8 is currently included in this package for the topic detection method. To install the tool run: ./install You can train a PV-based topic detection model using the following command: ./pv_topic -input sample_data/wikipedia.sample.doc -k 3 -itr 10 where sample_data/wikipedia.sample.doc is a collection of documents (one line per document), k is the number of clusters in k-means and itr is the number of training epochs. For help with setting the different parameters of the mode run: ./pv_topic -help After training, the method produces the following files result.pv: vector representations for documents result.wv: vector representations for words <b>result.topic_dist: PV-based topic representation for documents</b> result.word_dist: word distribution for each topic

Download