Premium
A practical guide to text mining with topic extraction
Author(s) -
Karl Andrew,
Wisnowski James,
Rushing W. Heath
Publication year - 2015
Publication title -
wiley interdisciplinary reviews: computational statistics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.693
H-Index - 38
eISSN - 1939-0068
pISSN - 1939-5108
DOI - 10.1002/wics.1361
Subject(s) - computer science , latent dirichlet allocation , topic model , cluster analysis , exploratory data analysis , information retrieval , unstructured data , document clustering , text mining , analytics , artificial intelligence , data mining , data science , natural language processing , big data
Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open‐ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document‐term matrix ( DTM ). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis ( LSA ) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open‐source and commercial software packages. WIREs Comput Stat 2015, 7:326–340. doi: 10.1002/wics.1361 This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition Statistical Learning and Exploratory Methods of the Data Sciences > Text Mining