z-logo
open-access-imgOpen Access
An Extension of Standard Latent Dirichlet Allocation to Multiple Corpora
Author(s) -
Adam Foster
Publication year - 2016
Publication title -
siam undergraduate research online
Language(s) - English
Resource type - Journals
ISSN - 2327-7807
DOI - 10.1137/15s014599
Subject(s) - latent dirichlet allocation , extension (predicate logic) , computer science , mathematics , natural language processing , statistics , topic model , programming language
Latent Dirichlet Allocation (LDA) is a highly successful topic modeling framework. We describe a new extension to LDA which supports multiple subcorpora, each containing a different type of document. As in LDA, this multiple-corpora LDA (mLDA) model assumes document topic proportions follow a symmetric Dirichlet distribution. However, in mLDA, the Dirichlet parameter is subcorpus dependent. An online algorithm for training mLDA models is derived. The algorithm is applied to data from the USC Shoah Foundation’s Visual History Archive. Results show mLDA produced a better language model than standard LDA for this data. Using the same data, the mLDA topic model is used to construct an information retrieval system. Search results from this system outperform those obtained from traditional string-based search systems. A novel approach to the visualization of topics is outlined and visualizations are presented. As a novel development in natural language processing, mLDA will allow the power of topic modeling to be applied to a huge range of fields with diverse data by incorporating more information into a single topic model. It also enhances the applicability of topic modeling to information retrieval.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom