Pretraining and Fine-tuning Techniques for Electrolaryngeal Speech Enhancement Based on Sequence-to-sequence Voice Conversion | Zendy

Ding Ma | Zendy; Lester Phillip Violeta | Zendy; Kazuhiro Kobayashi | Zendy; Tomoki Toda | Zendy

Open Access

Pretraining and Fine-tuning Techniques for Electrolaryngeal Speech Enhancement Based on Sequence-to-sequence Voice Conversion

Author(s) -

Ding Ma,

Lester Phillip Violeta,

Kazuhiro Kobayashi,

Tomoki Toda

Publication year - 2025

Publication title -

ieee transactions on audio, speech and language processing

Language(s) - English

Resource type - Magazines

eISSN - 2998-4173

DOI - 10.1109/taslpro.2025.3577374

Subject(s) - signal processing and analysis , computing and processing , fields, waves and electromagnetics

We describe novel training methods based on sequence-to-sequence (seq2seq) voice conversion (VC) to address two practical issues in electrolaryngeal (EL)-speech-to-normalspeech conversion (EL2SP): 1) low-resource training data, and 2) the huge domain shift gap during transfer learning. Seq2seq VC is promising for EL2SP but suffers performance degradation without sufficiently high-quality, parallel training data. The common method utilizes transfer learning to address low-resource issue, following a direct pretraining–fine-tuning paradigm. However, in EL2SP, as huge domain shifts exist between upstream larger-scale normal corpora and the target EL2SP dataset, the common method cannot achieve effective transfer learning, limiting EL2SP performance. Therefore, we present training methods with multistage pretraining and fine-tuning techniques, particularly including an encoder adaptation training and a two-stage fine-tuning method, both leveraging low-quality synthetic data (SD), to improve transferability. During pretraining, aside from knowledge transfer from a more easily accessible TTS database into seq2seq VC pretraining, encoder adaptation training further minimizes the representation learning gap of the encoder in comprehending EL speech, facilitating smoother transfer for downstream EL2SP. Subsequently, the two-stage EL2SP finetuning finalizes a generalized and stable performance. Moreover, by effectively utilizing low-quality SD, our techniques relax training data demands and enhance practicality. Experimental results demonstrate our methods dramatically outperform a baseline using the common method regarding conversion quality and intelligibility. Comparative analyses confirm progressive performance gains with deeper system designs.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research