Tunneling enhanced by web page content block partition for focused crawling | Zendy

Peng Tao | Zendy; Zhang Changli | Zendy; Zuo Wanli | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Tunneling enhanced by web page content block partition for focused crawling

Author(s) -

Peng Tao,

Zhang Changli,

Zuo Wanli

Publication year - 2008

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.1211

Subject(s) - crawling , computer science , web page , web crawler , page view , information retrieval , world wide web , partition (number theory) , focused crawler , static web page , hits algorithm , context (archaeology) , dynamic web page , web development , mathematics , medicine , paleontology , combinatorics , anatomy , biology

The complexity of web information environments and multiple‐topic web pages are negative factors significantly affecting the performance of focused crawling. A highly relevant region in a web page may be obscured because of low overall relevance of that page. Segmenting the web pages into smaller units will significantly improve the performance. Conquering and traversing irrelevant page to reach a relevant one (tunneling) can improve the effectiveness of focused crawling by expanding its reach. This paper presents a heuristic‐based method to enhance focused crawling performance. The method uses a Document Object Model (DOM)‐based page partition algorithm to segment a web page into content blocks with a hierarchical structure and investigates how to take advantage of block‐level evidence to enhance focused crawling by tunneling. Page segmentation can transform an uninteresting multi‐topic web page into several single topic context blocks and some of which may be interesting. Accordingly, focused crawler can pursue the interesting content blocks to retrieve the relevant pages. Experimental results indicate that this approach outperforms Breadth‐First, Best‐First and Link‐context algorithm both in harvest rate, target recall and target length. Copyright © 2007 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research