Rediscovering missing web pages using link neighborhood lexical signatures
Author(s) -
Martin Klein,
Jeb Ware,
Michael L. Nelson
Publication year - 2011
Publication title -
citeseer x (the pennsylvania state university)
Language(s) - English
Resource type - Conference proceedings
DOI - 10.1145/1998076.1998101
Subject(s) - computer science , web page , signature (topology) , point (geometry) , information retrieval , natural language processing , lexical database , word (group theory) , artificial intelligence , world wide web , linguistics , mathematics , philosophy , geometry , wordnet
For discovering the new URI of a missing web page, lexical signatures, which consist of a small number of words chosen to represent the "aboutness" of a page, have been previously proposed. However, prior methods relied on computing the lexical signature before the page was lost, or using cached or archived versions of the page to calculate a lexical signature. We demonstrate a system of constructing a lexical signature for a page from its link neighborhood, that is the "backlinks", or pages that link to the missing page. After testing various methods, we show that one can construct a lexical signature for a missing web page using only ten backlink pages. Further, we show that only the first level of backlinks are useful in this effort. The text that the backlinks use to point to the missing page is used as input for the creation of a four-word lexical signature. That lexical signature is shown to successfully find the target URI in more than half of the test cases.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom