Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis
Author(s) -
Manuel Sam Ribeiro,
Oliver Watts,
Junichi Yamagishi
Publication year - 2016
Publication title -
interspeech 2022
Language(s) - English
Resource type - Conference proceedings
DOI - 10.21437/interspeech.2016-1034
Subject(s) - syllable , speech recognition , computer science , speech synthesis , natural language processing , artificial intelligence
A top-down hierarchical system based on deep neural networks is investigated for the modeling of prosody in speech synthesis. Suprasegmental features are processed separately from segmental features and a compact distributed representation of highlevel units is learned at syllable-level. The suprasegmental representation is then integrated into a frame-level network. Objective measures show that balancing segmental and suprasegmental features can be useful for the frame-level network. Additional features incorporated into the hierarchical system are then tested. At the syllable-level, a bag-of-phones representation is proposed and, at the word-level, embeddings learned from text sources are used. It is shown that the hierarchical system is able to leverage new features at higher-levels more efficiently than a system which exploits them directly at the frame-level. A perceptual evaluation of the proposed systems is conducted and followed by a discussion of the results.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom