Research Library

open-access-imgOpen AccessEnhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations
Author(s)
Yejin Jeon,
Yunsu Kim,
Gary Geunbae Lee
Publication year2024
Zero-shot multi-speaker TTS aims to synthesize speech with the voice of achosen target speaker without any fine-tuning. Prevailing methods, however,encounter limitations at adapting to new speakers of out-of-domain settings,primarily due to inadequate speaker disentanglement and content leakage. Toovercome these constraints, we propose an innovative negation feature learningparadigm that models decoupled speaker attributes as deviations from thecomplete audio representation by utilizing the subtraction operation. Byeliminating superfluous content information from the speaker representation,our negation scheme not only mitigates content leakage, thereby enhancingsynthesis robustness, but also improves speaker fidelity. In addition, tofacilitate the learning of diverse speaker attributes, we leverage multi-streamTransformers, which retain multiple hypotheses and instigate a trainingparadigm akin to ensemble learning. To unify these hypotheses and realize thefinal speaker representation, we employ attention pooling. Finally, in light ofthe imperative to generate target text utterances in the desired voice, weadopt adaptive layer normalizations to effectively fuse the previouslygenerated speaker representation with the target text representations, asopposed to mere concatenation of the text and audio modalities. Extensiveexperiments and validations substantiate the efficacy of our proposed approachin preserving and harnessing speaker-specific attributes vis-`a-vis alternativebaseline models.
Language(s)English

Seeing content that should not be on Zendy? Contact us.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here