A word‐level language identification strategy for resource‐scarce languages | Zendy

Asubiaro Toluwase | Zendy; Adegbola Tunde | Zendy; Mercer Robert | Zendy; Ajiferuke Isola | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

A word‐level language identification strategy for resource‐scarce languages

Author(s) -

Asubiaro Toluwase,

Adegbola Tunde,

Mercer Robert,

Ajiferuke Isola

Publication year - 2018

Publication title -

proceedings of the association for information science and technology

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.193

H-Index - 14

ISSN - 2373-9231

DOI - 10.1002/pra2.2018.14505501004

Subject(s) - computer science , yoruba , trigram , natural language processing , artificial intelligence , igbo , language identification , hausa , word (group theory) , identification (biology) , character (mathematics) , natural language , linguistics , mathematics , philosophy , botany , geometry , biology

This study is based on the premise that it is possible to train computers to predict the language of a word (textual or audio) by learning from its character n‐gram pattern, without recourse to the language's dictionary. With the growth of multilingual collections and a need for automatic means of cleaning textual datasets, this paper presents a strategy for language identification of individual words in a body of texts. This strategy is suitable for resource‐scarce languages that do not have large electronic datasets that are required for machine learning and natural language processing studies and whose dictionaries may not be available. In this study, we focused on three African languages, namely Hausa, Igbo, and Yoruba. A training corpus in each of these languages was used to obtain the probabilities of character trigrams in the language. Given that English is a common language that is often mixed with these resource‐scarce languages in texts, we also obtained the probabilities of trigrams in an English training corpus. These probabilities were then used in identifying the language of each word in test corpora containing bilingual texts. Our strategy achieved average precision, recall and F1 values of about 97%, 91% and 94% respectively.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research