Document representation methods for clustering bilingual documents | Zendy

Ma Shutian | Zendy; Zhang Chengzhi | Zendy; He Daqing | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Document representation methods for clustering bilingual documents

Author(s) -

Ma Shutian,

Zhang Chengzhi,

He Daqing

Publication year - 2016

Publication title -

proceedings of the association for information science and technology

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.193

H-Index - 14

ISSN - 2373-9231

DOI - 10.1002/pra2.2016.14505301065

Subject(s) - latent dirichlet allocation , computer science , cluster analysis , document clustering , information retrieval , topic model , representation (politics) , natural language processing , vector space model , search engine indexing , text corpus , artificial intelligence , politics , political science , law

Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management. However, lack of training data and high efforts for corpus annotation will increase the cost for classifying multilingual documents which needs to bridge language gaps as well. Clustering is more suitable to implement in such practical applications. There are two main factors involved in documents clustering, document representation method and clustering algorithm. In this paper, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of clustering results. In our experiment, we use parallel corpora (English‐Chinese documents on topic of technology information) and comparable corpora (English and Chinese documents on topics of mobile technology and wind energy) as dataset. We compare four different types of document representation methods: Vector Space Model, Latent Semantic Indexing, Latent Dirichlet Allocation and Doc2Vec. Experimental results show that, accuracy of Vector Space Model were not competitive with other methods in all clustering tasks. Latent Semantic Indexing is overly sensitive to corpora itself, for it behaved differently when clustering two different topics of comparable corpora. Latent Dirichlet Allocation behaves best when clustering documents in small size of comparable corpora while Doc2Vec behaves best for large documents set of parallel corpora. Accordingly, characteristics of corpora should be under considerations for rational utilization of document representation methods to have better performance.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research