End-to-End Multi-Speaker FastSpeech2 with Hierarchical Decoder | Zendy

Majid Adibian | Zendy; Hossein Zeinali | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

End-to-End Multi-Speaker FastSpeech2 with Hierarchical Decoder

Author(s) -

Majid Adibian,

Hossein Zeinali

Publication year - 2025

Publication title -

ieee access

Language(s) - English

Resource type - Magazines

SCImago Journal Rank - 0.587

H-Index - 127

eISSN - 2169-3536

DOI - 10.1109/access.2025.3589120

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Multi-speaker text-to-speech (TTS) systems play a crucial role in different applications, such as personalized voice assistants, audiobooks, and multilingual speech synthesis. These systems aim to generate high-quality, natural-sounding speech while preserving the distinct characteristics of different speakers. In this paper, we strive to enhance the naturalness and speaker similarity of the FastSpeech2 model in multi-speaker text-to-speech synthesis across closed and open-set speaker scenarios while preserving its high inference speed and lightweight architecture. Specifically, we introduce a hierarchical decoder structure and a speaker similarity loss function to enhance speaker fidelity in synthesized speech. Additionally, we investigate various methods for integrating speaker embeddings within the model and propose an end-to-end training strategy to mitigate error propagation, an inherent limitation of cascaded models. Experimental results demonstrate that our modified FastSpeech2 model significantly outperforms the baseline in closed and open-set scenarios. The proposed model achieves an absolute improvement of 0.89 in Mean Opinion Score (MOS) and 0.44 in Speaker Similarity MOS (SMOS) while maintaining the high inference speed of FastSpeech2.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research