z-logo
open-access-imgOpen Access
Research on Speech Synthesis Technology Based on Rhythm Embedding
Author(s) -
Tianxin Wu,
Lasheng Zhao,
Qiang Zhang
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1693/1/012127
Subject(s) - rhythm , speech recognition , computer science , embedding , speech synthesis , feature (linguistics) , encoder , set (abstract data type) , field (mathematics) , distortion (music) , artificial intelligence , mathematics , linguistics , acoustics , philosophy , physics , bandwidth (computing) , pure mathematics , programming language , operating system , amplifier , computer network
In recent years, Text-To-Speech (TTS) technology has developed rapidly. People have also been paying more attention to how to narrow the gap between synthetic speech and real speech, hoping that synthesized speech can be integrated with real rhythm. A rhythmic feature embedding method for Text-To-Speech was proposed in this thesis based on Tacotron2 model, which has arisen in the field of TTS in recent years. Firstly, rhythmic feature extraction through World vocoder can reduce redundant information in rhythmic features. Then, rhythmic feature fusion based on Variational Auto-Encoder (VAE) network can enhance rhythmic information. Experiments are carried out on the data set LJSpeech-1.0, and then subjective evaluation and objective evaluation are carried out on the synthesized speech respectively. Compared with the comparative literature, the subjective blind hearing test (ABX) score increased by 25%. At that same time, the objective Mel Cepstral Distortion value (MCD) declined to 12.77.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here