
Extraction System Web Content Sports New Based On Web Crawler Multi Thread
Author(s) -
Yoga Dwitya Pramudita,
Devie Rosa Anamisa,
Sigit Susanto Putro,
M A Rahmawanto
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1569/2/022077
Subject(s) - web crawler , crawling , computer science , thread (computing) , world wide web , web page , web server , focused crawler , the internet , static web page , information retrieval , operating system , medicine , anatomy
Web crawlers are programs that are used by search engines to collect necessary information from the internet automatically according to the rules set by the user. With so much information about sports news on the internet, it takes web crawlers with incredible speed in the process of crawling. There are several previous studies that discussed the process of extracting information in a web document that needs to be considered both in terms of both aspects, including in terms of the structure of the web page and the length of time needed. Therefore, in this research the web crawler application was developed by applying a multi-thread approach. This multi-thread approach to research is used to produce web crawlers that are faster in the process of crawling sports news by involving news sources more than one address at a time. In addition to the multi-thread approach, adjusting the structure of the website pages is also done to ensure the information to be extracted by web crawling. From the results of the multi-thread implementation test on the crawling process, this study has been able to increase speed compared to the single-thread method of 122.95 seconds. But the results of web update detection, have resulted in a speed that decreased by 6.27 seconds in the crawling process with unequal data and the speed on the crawling process has also decreased by 24.76 seconds on server 1 and by 23.92 seconds on server 2.