
Pretraining and Fine-tuning Techniques for Electrolaryngeal Speech Enhancement Based on Sequence-to-sequence Voice Conversion
Author(s) -
Ding Ma,
Lester Phillip Violeta,
Kazuhiro Kobayashi,
Tomoki Toda
Publication year - 2025
Publication title -
ieee transactions on audio, speech and language processing
Language(s) - English
Resource type - Magazines
eISSN - 2998-4173
DOI - 10.1109/taslpro.2025.3577374
Subject(s) - signal processing and analysis , computing and processing , fields, waves and electromagnetics
We describe novel training methods based on sequence-to-sequence (seq2seq) voice conversion (VC) to address two practical issues in electrolaryngeal (EL)-speech-to-normalspeech conversion (EL2SP): 1) low-resource training data, and 2) the huge domain shift gap during transfer learning. Seq2seq VC is promising for EL2SP but suffers performance degradation without sufficiently high-quality, parallel training data. The common method utilizes transfer learning to address low-resource issue, following a direct pretraining–fine-tuning paradigm. However, in EL2SP, as huge domain shifts exist between upstream larger-scale normal corpora and the target EL2SP dataset, the common method cannot achieve effective transfer learning, limiting EL2SP performance. Therefore, we present training methods with multistage pretraining and fine-tuning techniques, particularly including an encoder adaptation training and a two-stage fine-tuning method, both leveraging low-quality synthetic data (SD), to improve transferability. During pretraining, aside from knowledge transfer from a more easily accessible TTS database into seq2seq VC pretraining, encoder adaptation training further minimizes the representation learning gap of the encoder in comprehending EL speech, facilitating smoother transfer for downstream EL2SP. Subsequently, the two-stage EL2SP finetuning finalizes a generalized and stable performance. Moreover, by effectively utilizing low-quality SD, our techniques relax training data demands and enhance practicality. Experimental results demonstrate our methods dramatically outperform a baseline using the common method regarding conversion quality and intelligibility. Comparative analyses confirm progressive performance gains with deeper system designs.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom