z-logo
Premium
A word‐level language identification strategy for resource‐scarce languages
Author(s) -
Asubiaro Toluwase,
Adegbola Tunde,
Mercer Robert,
Ajiferuke Isola
Publication year - 2018
Publication title -
proceedings of the association for information science and technology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.193
H-Index - 14
ISSN - 2373-9231
DOI - 10.1002/pra2.2018.14505501004
Subject(s) - computer science , yoruba , trigram , natural language processing , artificial intelligence , igbo , language identification , hausa , word (group theory) , identification (biology) , character (mathematics) , natural language , linguistics , mathematics , philosophy , botany , geometry , biology
This study is based on the premise that it is possible to train computers to predict the language of a word (textual or audio) by learning from its character n‐gram pattern, without recourse to the language's dictionary. With the growth of multilingual collections and a need for automatic means of cleaning textual datasets, this paper presents a strategy for language identification of individual words in a body of texts. This strategy is suitable for resource‐scarce languages that do not have large electronic datasets that are required for machine learning and natural language processing studies and whose dictionaries may not be available. In this study, we focused on three African languages, namely Hausa, Igbo, and Yoruba. A training corpus in each of these languages was used to obtain the probabilities of character trigrams in the language. Given that English is a common language that is often mixed with these resource‐scarce languages in texts, we also obtained the probabilities of trigrams in an English training corpus. These probabilities were then used in identifying the language of each word in test corpora containing bilingual texts. Our strategy achieved average precision, recall and F1 values of about 97%, 91% and 94% respectively.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here