z-logo
open-access-imgOpen Access
Web Page Extraction and Classification Using JSOUP and Naïve Bayes
Author(s) -
Sugiarto Cokrowibowo,
Nahya Nur,
. Irmayanti
Publication year - 2020
Publication title -
iop conference series. materials science and engineering
Language(s) - English
Resource type - Journals
eISSN - 1757-899X
pISSN - 1757-8981
DOI - 10.1088/1757-899x/875/1/012089
Subject(s) - naive bayes classifier , computer science , extraction (chemistry) , world wide web , information retrieval , artificial intelligence , chromatography , support vector machine , chemistry
Classification of web pages manually requires quite a long time because most of the available web pages are not structured, so the classification method is needed quickly and accurately. Naïve Bayes algorithm with a good probabilistic approach in classifying web pages, seen from the advantages that are included in the classical category with a simple probability concept. However, this algorithm provides pretty good performance for many modern cases with large data. For the process of extracting information from web pages, it is proposed to use JSOUP which is a java library that provides a good API for extracting, manipulating data, and completing the initial data cleaning using the best methods from DOM, and CSS. The use of the JSOUP library makes it possible to be able to do web page analysis without having to save web documents to a computer store, so computer storage resources will be constant even though the amount of training data is increased. In this study, implementing JSOUP as a tool for extracting information from web pages and Naïve Bayes algorithm for classification of web pages.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here