z-logo
open-access-imgOpen Access
Development of Web Crawler to Build Indonesian Text Corpus
Author(s) -
Janson Hendryli,
Viny Christanti Mawardi
Publication year - 2020
Publication title -
iop conference series. materials science and engineering
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.179
H-Index - 26
eISSN - 1757-899X
pISSN - 1757-8981
DOI - 10.1088/1757-899x/1007/1/012043
Subject(s) - web crawler , indonesian , computer science , world wide web , python (programming language) , waterfall , information retrieval , focused crawler , natural language processing , artificial intelligence , web page , web development , geography , linguistics , static web page , programming language , cartography , philosophy
Recent improvement in natural language understanding research can be attributed to the availability of large scale datasets. Those datasets are mainly in English. In this work, we develop a web crawler with the purpose of extracting Indonesian news content from the DetikNews website and building a large dataset of texts. The web crawler is developed by following the waterfall model using Python, Scrapy, and BeautifulSoup4. It collects more than 790k news from DetikNews, spanning from 2011 to 2020, which consists of a total number of more than 190 million words, almost 2 million unique words, and more than 14 million sentences.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here