Improvised Architecture for Distributed Web Crawling | Zendy

Tilak Patidar | Zendy; Aditya Ambasth | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Improvised Architecture for Distributed Web Crawling

Author(s) -

Tilak Patidar,

Aditya Ambasth

Publication year - 2016

Publication title -

international journal of computer applications

Language(s) - English

Resource type - Journals

ISSN - 0975-8887

DOI - 10.5120/ijca2016911857

Subject(s) - computer science , crawling , architecture , world wide web , web crawler , biology , anatomy , art , visual arts

Web crawlers are program, designed to fetch web pages for information retrieval system. Crawlers facilitate this process by following hyperlinks in web pages to automatically download new or update existing web pages in the repository. A web crawler interacts with millions of hosts, fetches millions of page per second and updates these pages into a database, creating a need for maintaining I/O performance, network resources within OS limit, which are essential in order to achieve high performance at a reasonable cost. This paper aims to showcase efficient techniques to develop a scalable web crawling system, addressing challenges which deals with issues related to the structure of the web, distributed computing, job scheduling, spider traps, canonicalizing URLs and inconsistent data formats on the web. A brief discussion on new web crawler architecture is done in this paper.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research