Improved Transcription and Speaker Identification System for Concurrent Speech in Bahasa Indonesia Using Recurrent Neural Network | Zendy

Muhammad Bagus Andra | Zendy; Tsuyoshi Usagawa | Zendy

Open Access

Improved Transcription and Speaker Identification System for Concurrent Speech in Bahasa Indonesia Using Recurrent Neural Network

Author(s) -

Muhammad Bagus Andra,

Tsuyoshi Usagawa

Publication year - 2021

Publication title -

ieee access

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.587

H-Index - 127

ISSN - 2169-3536

DOI - 10.1109/access.2021.3077441

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Bahasa Indonesia is one of the most prominent low-resource Languages that still lack development in regards to communication-assisting technology. This paper proposes an improved system for generating transcript and identifying speakers from a concurrent speech in Bahasa Indonesia. The proposed method is applicable in a situation such as an online meeting and remote conference. The system combines Reinforced Learning (RL) Model with pitch-aware speech separation to identify the speakers in a concurrent speech. A Recurrent Neural Network (RNN) is utilized to generate the text transcript which is later improved by an external language model and spelling correction model. The proposed system was able to identify up to 5 speakers with a variable degree of confidence and generate a transcript for each of them with better quality compared to other methods when evaluated with several metrics. The result shows that the proposed method perform better compared to the baseline method, even in the single-speaker situation, and function in the simultaneous-speech situation, with an average Word Error Rate (WER) of 16.59% for two speakers, 26.72% for three speakers, and 31.50% for four speakers.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research