z-logo
open-access-imgOpen Access
End-to-End Multi-Speaker FastSpeech2 with Hierarchical Decoder
Author(s) -
Majid Adibian,
Hossein Zeinali
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3589120
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Multi-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of different speakers. In this paper, we strive to enhance the naturalness and speaker similarity of the FastSpeech2 model in multi-speaker text-to-speech synthesis across closed and open-set speaker scenarios while preserving its high inference speed and lightweight architecture. Specifically, we introduce a hierarchical decoder structure and a speaker similarity loss function to enhance speaker fidelity in synthesized speech. Additionally, we investigate various methods for integrating speaker embeddings within the model and propose an end-to-end training strategy to mitigate error propagation, an inherent limitation of cascaded models. Experimental results demonstrate that our modified FastSpeech2 model significantly outperforms the baseline in closed and open-set scenarios. The proposed model achieves an absolute improvement of 0.89 in Mean Opinion Score (MOS) and 0.44 in Speaker Similarity MOS (SMOS) while maintaining the high inference speed of FastSpeech2.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom