Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis | Zendy

Manuel Sam Ribeiro | Zendy; Oliver Watts | Zendy; Junichi Yamagishi | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis

Author(s) -

Manuel Sam Ribeiro,

Oliver Watts,

Junichi Yamagishi

Publication year - 2016

Publication title -

interspeech 2022

Language(s) - English

Resource type - Conference proceedings

DOI - 10.21437/interspeech.2016-1034

Subject(s) - syllable , speech recognition , computer science , speech synthesis , natural language processing , artificial intelligence

A top-down hierarchical system based on deep neural networks is investigated for the modeling of prosody in speech synthesis. Suprasegmental features are processed separately from segmental features and a compact distributed representation of highlevel units is learned at syllable-level. The suprasegmental representation is then integrated into a frame-level network. Objective measures show that balancing segmental and suprasegmental features can be useful for the frame-level network. Additional features incorporated into the hierarchical system are then tested. At the syllable-level, a bag-of-phones representation is proposed and, at the word-level, embeddings learned from text sources are used. It is shown that the hierarchical system is able to leverage new features at higher-levels more efficiently than a system which exploits them directly at the frame-level. A perceptual evaluation of the proposed systems is conducted and followed by a discussion of the results.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research