ASR for Under-Resourced Languages From Probabilistic Transcription | Zendy

Mark A. Hasegawa-Johnson | Zendy; Preethi Jyothi | Zendy; Daniel McCloy | Zendy; Majid Mirbagheri | Zendy; Giovanni M. di Liberto | Zendy; Amit Das | Zendy; Bradley Ekin | Zendy; Chunxi Liu | Zendy; Vimal Manohar | Zendy; Hao Tang | Zendy; Edmund C. Lalor | Zendy; Nancy F. Chen | Zendy; Paul Hager | Zendy; Tyler Kekona | Zendy; Rose Sloan | Zendy; Adrian K. C. Lee | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

ASR for Under-Resourced Languages From Probabilistic Transcription

Author(s) -

Mark A. Hasegawa-Johnson,

Preethi Jyothi,

Daniel McCloy,

Majid Mirbagheri,

Giovanni M. di Liberto,

Amit Das,

Bradley Ekin,

Chunxi Liu,

Vimal Manohar,

Hao Tang,

Edmund C. Lalor,

Nancy F. Chen,

Paul Hager,

Tyler Kekona,

Rose Sloan,

Adrian K. C. Lee

Publication year - 2016

Publication title -

ieee/acm transactions on audio, speech, and language processing

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.916

H-Index - 56

eISSN - 2329-9304

pISSN - 2329-9290

DOI - 10.1109/taslp.2016.2621659

Subject(s) - signal processing and analysis , computing and processing , communication, networking and broadcast technologies , general topics for engineers

In many under-resourced languages it is possible to find text, and it is possible to find speech, but transcribed speech suitable for training automatic speech recognition (ASR) is unavailable. In the absence of native transcripts, this paper proposes the use of a probabilistic transcript: A probability mass function over possible phonetic transcripts of the waveform. Three sources of probabilistic transcripts are demonstrated. First, self-training is a well-established semisupervised learning technique, in which a cross-lingual ASR first labels unlabeled speech, and is then adapted using the same labels. Second, mismatched crowdsourcing is a recent technique in which nonspeakers of the language are asked to write what they hear, and their nonsense transcripts are decoded using noisy channel models of second-language speech perception. Third, EEG distribution coding is a new technique in which nonspeakers of the language listen to it, and their electrocortical response signals are interpreted to indicate probabilities. ASR was trained in four languages without native transcripts. Adaptation using mismatched crowdsourcing significantly outperformed self-training, and both significantly outperformed a cross-lingual baseline. Both EEG distribution coding and text-derived phone language models were shown to improve the quality of probabilistic transcripts derived from mismatched crowdsourcing.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research