z-logo
open-access-imgOpen Access
Mining and Harvesting High Quality Topical Resources from the Web
Author(s) -
Zhao Wei,
Guan Ziyu,
Cao Zhengwen,
Liu Zheng
Publication year - 2016
Publication title -
chinese journal of electronics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.267
H-Index - 25
eISSN - 2075-5597
pISSN - 1022-4653
DOI - 10.1049/cje.2016.01.008
Subject(s) - web crawler , crawling , computer science , focused crawler , world wide web , scalability , quality (philosophy) , web resource , web page , information retrieval , database , static web page , web development , medicine , philosophy , epistemology , anatomy
Focused crawlers aim to effectively prioritize uncrawled URLs to harvest relevant pages while avoiding irrelevant ones. In practice, harvesting high quality topical Web resources is more important due to the explosion of Web information. Our study shows that the popular focused crawling strategy cannot achieve this goal. In this paper we develop a new focused crawler, namely On‐line topical quality estimation (OTQE), which intelligently evaluates the topical quality of uncrawled pages by the observed link and content evidences and prioritize their URLs accordingly. The new crawler is scalable and requires fewer additional resources to do link‐based analysis. The experimental results on crawling 3.6 million Web pages demonstrate the advantages of our proposed method over traditional focused crawlers.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here