Development of Web Crawler to Build Indonesian Text Corpus | Zendy

Janson Hendryli | Zendy; Viny Christanti Mawardi | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Development of Web Crawler to Build Indonesian Text Corpus

Author(s) -

Janson Hendryli,

Viny Christanti Mawardi

Publication year - 2020

Publication title -

iop conference series. materials science and engineering

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.179

H-Index - 26

eISSN - 1757-899X

pISSN - 1757-8981

DOI - 10.1088/1757-899x/1007/1/012043

Subject(s) - web crawler , indonesian , computer science , world wide web , python (programming language) , waterfall , information retrieval , focused crawler , natural language processing , artificial intelligence , web page , web development , geography , linguistics , static web page , programming language , cartography , philosophy

Recent improvement in natural language understanding research can be attributed to the availability of large scale datasets. Those datasets are mainly in English. In this work, we develop a web crawler with the purpose of extracting Indonesian news content from the DetikNews website and building a large dataset of texts. The web crawler is developed by following the waterfall model using Python, Scrapy, and BeautifulSoup4. It collects more than 790k news from DetikNews, spanning from 2011 to 2020, which consists of a total number of more than 190 million words, almost 2 million unique words, and more than 14 million sentences.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore