z-logo
open-access-imgOpen Access
Design and Implementation of a Web Crawler System based on an Adaptive Page-Rank algorithm
Author(s) -
Xin Zhang,
Zhi Feng Cheng,
Chen Zhang
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1634/1/012021
Subject(s) - web crawler , computer science , web page , focused crawler , information retrieval , crawling , search engine , static web page , precision and recall , python (programming language) , dynamic web page , rank (graph theory) , world wide web , algorithm , web navigation , programming language , mathematics , medicine , combinatorics , anatomy
Web crawlers have the ability to automatically extract web page information, but there exists the issue that some pages reuse keywords to improve their search rankings. Therefore, we propose an adaptive Page-rank algorithm to build a crawler system to resolve the issue mentioned above. Specifically, we generate a relationship matrix based on the crawled web page access relationships, and then an probability matrix based on the number of web pages is generated iteratively, and finally the web pages crawled are displayed in descending order of calculated weights. Besides, we propose to control the iterative process in Page-rank with the coherence of anchor texts. The system uses Python language to realize the functions of web crawling. Experimental results demonstrate that this system has a high speed in data collection. Comparing with Hints and classical Page-rank crawler systems, The results show that the proposed method outperforms in precision and recall.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here