
Short Text Document Clustering using Distributed Word Representation and Document Distance
Author(s) -
Supavit Kongwudhikunakorn,
Kitsana Waiyamai
Publication year - 2018
Publication title -
walailak journal of science and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.146
H-Index - 15
eISSN - 2228-835X
pISSN - 1686-3933
DOI - 10.48048/wjst.2019.4133
Subject(s) - document clustering , cluster analysis , computer science , word (group theory) , rand index , representation (politics) , information retrieval , natural language processing , artificial intelligence , metric (unit) , n gram , index (typography) , tf–idf , similarity (geometry) , data mining , language model , mathematics , world wide web , term (time) , political science , law , economics , physics , image (mathematics) , quantum mechanics , operations management , geometry , politics
This paper presents a method for clustering short text documents, such as instant messages, SMS, or news headlines. Vocabularies in the texts are expanded using external knowledge sources and represented by a Distributed Word Representation. Clustering is done using the K-means algorithm with Word Mover's Distance as the distance metric. Experiments were done to compare the clustering quality of this method, and several leading methods, using large datasets from BBC headlines, SearchSnippets, StackExchange, and Twitter. For all datasets, the proposed algorithm produced document clusters with higher accuracy, precision, F1-score, and Adjusted Rand Index. We also observe that cluster description can be inferred from keywords represented in each cluster.