Premium
Document representation methods for clustering bilingual documents
Author(s) -
Ma Shutian,
Zhang Chengzhi,
He Daqing
Publication year - 2016
Publication title -
proceedings of the association for information science and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.193
H-Index - 14
ISSN - 2373-9231
DOI - 10.1002/pra2.2016.14505301065
Subject(s) - latent dirichlet allocation , computer science , cluster analysis , document clustering , information retrieval , topic model , representation (politics) , natural language processing , vector space model , search engine indexing , text corpus , artificial intelligence , politics , political science , law
Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management. However, lack of training data and high efforts for corpus annotation will increase the cost for classifying multilingual documents which needs to bridge language gaps as well. Clustering is more suitable to implement in such practical applications. There are two main factors involved in documents clustering, document representation method and clustering algorithm. In this paper, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of clustering results. In our experiment, we use parallel corpora (English‐Chinese documents on topic of technology information) and comparable corpora (English and Chinese documents on topics of mobile technology and wind energy) as dataset. We compare four different types of document representation methods: Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation and Doc2Vec. Experimental results show that, accuracy of Vector Space Model were not competitive with other methods in all clustering tasks. Latent Semantic Indexing is overly sensitive to corpora itself, for it behaved differently when clustering two different topics of comparable corpora. Latent Dirichlet Allocation behaves best when clustering documents in small size of comparable corpora while Doc2Vec behaves best for large documents set of parallel corpora. Accordingly, characteristics of corpora should be under considerations for rational utilization of document representation methods to have better performance.