z-logo
open-access-imgOpen Access
Probabilistic models for focused web crawling
Author(s) -
Hongyu Liu,
Evangelos Milios,
Jeannette Janssen
Publication year - 2004
Publication title -
national research council canada (government of canada)
Language(s) - English
Resource type - Conference proceedings
ISBN - 1-58113-978-0
DOI - 10.1145/1031453.1031458
Subject(s) - crfs , crawling , computer science , conditional random field , hidden markov model , web crawler , probabilistic logic , context (archaeology) , relevance (law) , artificial intelligence , web page , machine learning , information retrieval , data mining , world wide web , medicine , paleontology , biology , political science , law , anatomy
A Focused crawler must use information gleaned from previously crawled page sequences to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modelling of context as well as the current observations. Probabilistic models, such as Hidden Markov Models(HMMs) and Conditional Random Fields(CRFs), can potentially capture both formatting and context. In this paper, we present the use of HMM for focused web crawling, and compare it with Best-First strategy. Furthermore, we discuss the concept of using CRFs to overcome the difficulties with HMMs and support the use of many, arbitrary and overlapping features. Finally, we describe a design of a system applying CRFs for focused web crawling, that is currently being implemented.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom