Using XPaths of inbound links to cluster template-generated web pages | Zendy

Tomas Grigalis | Zendy; Antanas Čenys | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Using XPaths of inbound links to cluster template-generated web pages

Author(s) -

Tomas Grigalis,

Antanas Čenys

Publication year - 2014

Publication title -

computer science and information systems

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.244

H-Index - 24

eISSN - 2406-1018

pISSN - 1820-0214

DOI - 10.2298/csis130416020g

Subject(s) - computer science , xpath , web page , cluster analysis , information retrieval , hits algorithm , scalability , task (project management) , static web page , world wide web , document object model , data mining , database , web navigation , xml , artificial intelligence , xml validation , management , economics

Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research