
Investigating Data Sharing in Speech Recognition for an Under-Resourced Language: The Case of Algerian Dialect
Author(s) -
Mohamed Amine Menacer,
Kamel Smaïli
Publication year - 2021
Publication title -
computer science and information technology ( cs and it )
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5121/csit.2021.110308
Subject(s) - computer science , modern standard arabic , baseline (sea) , word error rate , natural language processing , artificial intelligence , arabic , speech recognition , word (group theory) , language model , spoken language , linguistics , philosophy , oceanography , geology
The Arabic language has many varieties, including its standard form, Modern Standard Arabic (MSA), and its spoken forms, namely the dialects. Those dialects are representative examples of under-resourced languages for which automatic speech recognition is considered as an unresolved issue. To address this issue, we recorded several hours of spoken Algerian dialect and used them to train a baseline model. This model was boosted afterwards by taking advantage of other languages that impact this dialect by integrating their data in one large corpus and by investigating three approaches: multilingual training, multitask learning and transfer learning. The best performance was achieved using a limited and balanced amount of acoustic data from each additional language, as compared to the data size of the studied dialect. This approach led to an improvement of 3.8% in terms of word error rate in comparison to the baseline system trained only on the dialect data.