Document similarity based on concept tree distance
Author(s) -
Praveen Lakkaraju,
Susan Gauch,
Mirco Speretta
Publication year - 2008
Publication title -
ku scholarworks (the university of kansas)
Language(s) - English
Resource type - Conference proceedings
DOI - 10.1145/1379092.1379118
Subject(s) - computer science , information retrieval , similarity (geometry) , search engine , tree (set theory) , similarity measure , data mining , vector space model , measure (data warehouse) , classifier (uml) , similitude , artificial intelligence , mathematics , mathematical analysis , image (mathematics)
The Web is quickly moving from the era of search engines to the era of discovery engines. Whereas search engines help you find information you are looking for, discovery engines help you find things that you never knew existed. A common discovery technique is to automatically identify and display objects similar to ones previously viewed by the user. Core to this approach is an accurate method to identify similar documents. In this paper, we present a new approach to identifying similar documents based on a conceptual tree-similarity measure. We represent each document as a concept tree using the concept associations obtained from a classifier. Then, we make employ a tree-similarity measure based on a tree edit distance to compute similarities between concept trees. Experiments on documents from the CiteSeer collection showed that our algorithm performed significantly better than document similarity based on the traditional vector space model.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom