Web Page Extraction and Classification Using JSOUP and Naïve Bayes | Zendy

Sugiarto Cokrowibowo | Zendy; Nahya Nur | Zendy; . Irmayanti | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Web Page Extraction and Classification Using JSOUP and Naïve Bayes

Author(s) -

Sugiarto Cokrowibowo,

Nahya Nur,

. Irmayanti

Publication year - 2020

Publication title -

iop conference series. materials science and engineering

Language(s) - English

Resource type - Journals

eISSN - 1757-899X

pISSN - 1757-8981

DOI - 10.1088/1757-899x/875/1/012089

Subject(s) - naive bayes classifier , computer science , extraction (chemistry) , world wide web , information retrieval , artificial intelligence , chromatography , support vector machine , chemistry

Classification of web pages manually requires quite a long time because most of the available web pages are not structured, so the classification method is needed quickly and accurately. Naïve Bayes algorithm with a good probabilistic approach in classifying web pages, seen from the advantages that are included in the classical category with a simple probability concept. However, this algorithm provides pretty good performance for many modern cases with large data. For the process of extracting information from web pages, it is proposed to use JSOUP which is a java library that provides a good API for extracting, manipulating data, and completing the initial data cleaning using the best methods from DOM, and CSS. The use of the JSOUP library makes it possible to be able to do web page analysis without having to save web documents to a computer store, so computer storage resources will be constant even though the amount of training data is increased. In this study, implementing JSOUP as a tool for extracting information from web pages and Naïve Bayes algorithm for classification of web pages.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore