Convolution-Augmented Transformers for Enhanced Speaker-Independent Dysarthric Speech Recognition
Author(s) -
Zihan Zhong,
Qianli Wang,
Satwinder Singh,
Clarion Mendes,
Mark Hasegawa-Johnson,
Waleed Abdulla,
Seyed Reza Shahamiri
Publication year - 2025
Publication title -
ieee transactions on neural systems and rehabilitation engineering
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 1.093
H-Index - 140
eISSN - 1558-0210
pISSN - 1534-4320
DOI - 10.1109/tnsre.2025.3610792
Subject(s) - bioengineering , computing and processing , robotics and control systems , signal processing and analysis , communication, networking and broadcast technologies
Dysarthria is a motor speech disorder characterized by muscle movement difficulties that complicate verbal communication. It poses significant challenges to Automatic Speech Recognition (ASR) systems due to data scarcity and speaker variability among dysarthric individuals. This study investigates speaker-independent (SI) approaches to assist speakers with communication impairments. Firstly, we developed dysarthric SI models using a Conformer-based system and a three-stage transferlearning pipeline that employs a selective layer freezing PEFT strategy to mitigate data scarcity. We pre-trained on standard speech and progressively adapted the models to two dysarthric datasets, respectively. Secondly, we introduced a benchmark framework for evaluating the generalizability of SI models with cross-dataset validation—a previously unexplored approach in dysarthric ASR, providing a more realistic scenario. The results demonstrate that the proposed dysarthric SI models outperform all baseline systems. Specifically, on the TORGO dataset, our models improved word recognition accuracy by 21.9% for isolated speech and reduced the word error rate by 18.5% for continuous speech. On UA-Speech, our optimal dysarthric SI model achieved a word recognition improvement of 14.6% over Whisper and 28.3% over the base model for isolated speech. Nevertheless, our cross-dataset testing showed that models tended to produce isolated words when asked to transcribe continuous speech for severe dysarthria, highlighting the need to further improve SI generalization.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom