Premium
Exploiting Near‐Duplicate Relations in Organizing News Archives
Author(s) -
Wang JenqHaur,
Chang HungChi
Publication year - 2014
Publication title -
international journal of intelligent systems
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.291
H-Index - 87
eISSN - 1098-111X
pISSN - 0884-8173
DOI - 10.1002/int.21647
Subject(s) - computer science , information retrieval , cluster analysis , task (project management) , document clustering , sentence , graph , data mining , world wide web , natural language processing , artificial intelligence , theoretical computer science , management , economics
Huge numbers of documents are being generated on the Web, especially for news articles and social media. How to effectively organize these evolving documents so that readers can easily browse or search is a challenging task. Existing methods include classification, clustering, and chronological or geographical ordering, which only provides a partial view of the relations among news articles. To better utilize cross‐document relations in organizing news articles, in this paper, we propose a novel approach to organize news archives by exploiting their near‐duplicate relations. First, we use a sentence‐level statistics‐based approach to near‐duplicate copy detection, which is language independent, simple but effective. Since content‐based approaches are usually time consuming and not robust to term substitutions, near‐duplicate detection approach can be used. Second, by extracting the cross‐document relations in a block‐sharing graph, we can derive a near‐duplicate clustering by cross‐document relations in which users can easily browse and find out unnecessary repetitions among documents. From the experimental results, we observed high efficiency and good accuracy of the proposed approach in detecting and clustering near‐duplicate documents in news archives.