Open Access
A dataset to evaluate Hindi Word Embeddings
Author(s) -
Vimal Kumar Soni,
Dinesh Gopalani,
Mahesh Chandra Govil
Publication year - 2021
Publication title -
iop conference series. materials science and engineering
Language(s) - English
Resource type - Journals
eISSN - 1757-899X
pISSN - 1757-8981
DOI - 10.1088/1757-899x/1131/1/012015
Subject(s) - computer science , hindi , natural language processing , artificial intelligence , word (group theory) , task (project management) , annotation , similarity (geometry) , stop words , construct (python library) , linguistics , philosophy , management , economics , image (mathematics) , programming language , preprocessor
The current trend to solve different challenges of Natural Language Processing utilizes various online crawling methods to fetch the data and applying different shallow or deep learning methods to develop models for the respective tasks on this data. Word vectors generated using such methods are being applied for several NLP challenges and such vectors are being evaluated on word similarity task. Not only huge data is available but also multiple datasets are available for the English language to evaluate the performance of the developed models. However, the scenario is not the same for Indian languages specifically for Hindi. Focusing this challenge, we propose a dataset to check word similarity in Hindi. The construction process and afterwards annotation process are described in details. To construct this dataset, first, 353 word-pairs from the most popular English dataset are selected and translated. Their translations are verified by Hindi Experts. These word pairs are finally annotated independently by 11 native Hindi speakers. Multiple criteria have been set to select the annotators for this task. The final dataset has been evaluated on CBOW and Skip-gram models.