
Hybrid CTC-Attention Network-Based End-to-End Speech Recognition System for Korean Language
Author(s) -
Hosung Park,
Changmin Kim,
Hyunsoo Son,
Soonshin Seo,
JiHwan Kim
Publication year - 2022
Publication title -
journal of web engineering/journal of web engineering on line
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.151
H-Index - 13
eISSN - 1544-5976
pISSN - 1540-9589
DOI - 10.13052/jwe1540-9589.2126
Subject(s) - computer science , hidden markov model , speech recognition , end to end principle , language model , korean language , artificial neural network , artificial intelligence , character (mathematics) , time delay neural network , natural language processing , linguistics , philosophy , geometry , mathematics
In this study, an automatic end-to-end speech recognition system based on hybrid CTC-attention network for Korean language is proposed. Deep neural network/hidden Markov model (DNN/HMM)-based speech recognition system has driven dramatic improvement in this area. However, it is difficult for non-experts to develop speech recognition for new applications. End-to-end approaches have simplified speech recognition system into a single-network architecture. These approaches can develop speech recognition system that does not require expert knowledge. In this paper, we propose hybrid CTC-attention network as end-to-end speech recognition model for Korean language. This model effectively utilizes a CTC objective function during attention model training. This approach improves the performance in terms of speech recognition accuracy as well as training speed. In most languages, end-to-end speech recognition uses characters as output labels. However, for Korean, character-based end-to-end speech recognition is not an efficient approach because Korean language has 11,172 possible numbers of characters. The number is relatively large compared to other languages. For example, English has 26 characters, and Japanese has 50 characters. To address this problem, we utilize Korean 49 graphemes as output labels. Experimental result shows 10.02% character error rate (CER) when 740 hours of Korean training data are used.