Premium
Exploiting named entities for bilingual news clustering
Author(s) -
Montalvo Soto,
Martínez Raquel,
Fresno Víctor,
Delgado Agustín
Publication year - 2015
Publication title -
journal of the association for information science and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.903
H-Index - 145
eISSN - 2330-1643
pISSN - 2330-1635
DOI - 10.1002/asi.23175
Subject(s) - computer science , cluster analysis , heuristic , information retrieval , document clustering , artificial intelligence , natural language processing , data mining
In this article, we present a new algorithm for clustering a bilingual collection of comparable news items in groups of specific topics. Our hypothesis is that named entities ( NE s) are more informative than other features in the news when clustering fine grained topics. The algorithm does not need as input any information related to the number of clusters, and carries out the clustering only based on information regarding the shared named entities of the news items. This proposal is evaluated using different data sets and outperforms other state‐of‐the‐art algorithms, thereby proving the plausibility of the approach. In addition, because the applicability of our approach depends on the possibility of identifying equivalent named entities among the news, we propose a heuristic system to identify equivalent named entities in the same and different languages, thereby obtaining good performance.