Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation
Author(s) -
Richard Lee Lai,
Jen-Cheng Hou,
I-Chun Chern,
Kuo-Hsuan Hung,
Yi-Ting Chen,
Mandar Gogate,
Tughrul Arslan,
Amir Hussain,
Chii-Wann Lin,
Yu Tsao
Publication year - 2025
Publication title -
ieee transactions on biomedical engineering
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 1.148
H-Index - 200
eISSN - 1558-2531
pISSN - 0018-9294
DOI - 10.1109/tbme.2025.3610284
Subject(s) - bioengineering , computing and processing , components, circuits, devices and systems , communication, networking and broadcast technologies
Objective: Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. This study explores the effectiveness of audio-visual speech enhancement (AVSE) in improving the intelligibility of vocoded speech in cochlear implant (CI) simulations. Methods: We propose a speech enhancement framework called Self-Supervised Learning-based AVSE (SSL-AVSE), which uses visual cues such as lip and mouth movements along with corresponding speech. Features are extracted using the AV-HuBERT model and refined through a bidirectional LSTM. Experiments were conducted using the Taiwan Mandarin speech with video (TMSV) dataset. Results: Objective evaluations showed improvements in PESQ from 1.43 to 1.67 and in STOI from 0.70 to 0.74. NCM scores increased by up to 87.2% over the noisy baseline. Subjective listening tests further demonstrated maximum gains of 45.2% in speech quality and 51.9% in word intelligibility. Conclusion: SSL-AVSE consistently outperforms AOSE and conventional AVSE baselines. Listening tests with statistically significant results confirm its effectiveness. In addition to its strong performance, SSL-AVSE demonstrates cross-lingual generalization: although it was pretrained on English data, it performs effectively on Mandarin speech. This finding highlights the robustness of the features extracted by a pretrained foundation model and their applicability across languages. Significance: To the best of our knowledge, no prior work has explored the application of AVSE to CI simulations. This study provides the first evidence that incorporating visual information can significantly improve the intelligibility of vocoded speech in CI scenarios.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom