RCrawler: An R package for parallel web crawling and scraping | Zendy

Salim Khalil | Zendy; Mohamed Fakir | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

RCrawler: An R package for parallel web crawling and scraping

Author(s) -

Salim Khalil,

Mohamed Fakir

Publication year - 2017

Publication title -

softwarex

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.528

H-Index - 21

ISSN - 2352-7110

DOI - 10.1016/j.softx.2017.04.004

Subject(s) - web crawler , crawling , computer science , parsing , web page , web content , domain (mathematical analysis) , web application , world wide web , information retrieval , artificial intelligence , medicine , mathematical analysis , mathematics , anatomy

RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research