z-logo
open-access-imgOpen Access
Learning a concept‐based document similarity measure
Author(s) -
Huang Lan,
Milne David,
Frank Eibe,
Witten Ian H.
Publication year - 2012
Publication title -
journal of the american society for information science and technology
Language(s) - English
Resource type - Journals
eISSN - 1532-2890
pISSN - 1532-2882
DOI - 10.1002/asi.22689
Subject(s) - computer science , document clustering , information retrieval , similarity (geometry) , cluster analysis , measure (data warehouse) , natural language processing , artificial intelligence , semantic similarity , similarity measure , document classification , data mining , image (mathematics)
Document similarity measures are crucial components of many text‐analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: They estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine‐learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people's judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom