Development of Focused Crawlers for Building Large Punjabi News Corpus | Zendy

Gurjot Singh Mahi | Zendy; Amandeep Verma | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Development of Focused Crawlers for Building Large Punjabi News Corpus

Author(s) -

Gurjot Singh Mahi,

Amandeep Verma

Publication year - 2021

Publication title -

journal of ict research and applications

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.145

H-Index - 11

eISSN - 2338-5499

pISSN - 2337-5787

DOI - 10.5614/itbj.ict.res.appl.2021.15.3.1

Subject(s) - web crawler , computer science , world wide web , focused crawler , python (programming language) , the internet , information retrieval , set (abstract data type) , search engine , web page , web development , static web page , programming language , operating system

Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore