z-logo
Premium
Document representation and clustering models for bilingual documents clustering
Author(s) -
Ma Shutian,
Zhang Chengzhi
Publication year - 2017
Publication title -
proceedings of the association for information science and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.193
H-Index - 14
ISSN - 2373-9231
DOI - 10.1002/pra2.2017.14505401056
Subject(s) - cluster analysis , computer science , latent dirichlet allocation , dbscan , document clustering , representation (politics) , information retrieval , artificial intelligence , search engine indexing , topic model , natural language processing , fuzzy clustering , machine learning , canopy clustering algorithm , politics , political science , law
Currently, the internet has created many documents in languages other than English. People face challenges when seeking and using information; for example, non‐native English‐speaking students tend to have problems when utilizing libraries in North American universities. To help people efficiently organize information, bilingual documents clustering has advantages for practical utilization, it can divide documents into groups with the same topic and there is no need for a training dataset. Document representation and clustering models are two important parts in clustering. This paper compares four popular representation methods, vector space model (VSM), latent semantic indexing (LSI), latent Dirichlet allocation (LDA) and doc2vec (D2V), together with four different types of clustering algorithms, K‐means++, BIRCH, DBSCAN and affinity propagation (AP) to identify appropriate combinations for bilingual documents clustering. Parallel corpus and comparable corpus are all used for the bilingual datasets. Experimental results show that, clustering performance varies when combining different representation methods with clustering algorithms. It's important to make good choice of models for better documents organization.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here