Improving Topic Models with Latent Feature Word Representations
Author(s) -
Dat Quoc Nguyen,
Richard Billingsley,
Lan Du,
Mark Johnson
Publication year - 2015
Publication title -
transactions of the association for computational linguistics
Language(s) - English
Resource type - Journals
ISSN - 2307-387X
DOI - 10.1162/tacl_a_00140
Subject(s) - computer science , latent dirichlet allocation , topic model , artificial intelligence , natural language processing , word (group theory) , feature (linguistics) , probabilistic latent semantic analysis , cluster analysis , feature vector , probabilistic logic , document clustering , feature engineering , coherence (philosophical gambling strategy) , information retrieval , linguistics , deep learning , philosophy , physics , quantum mechanics
Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom