
Research on Speech Synthesis Technology Based on Rhythm Embedding
Author(s) -
Tianxin Wu,
Lasheng Zhao,
Qiang Zhang
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1693/1/012127
Subject(s) - rhythm , speech recognition , computer science , embedding , speech synthesis , feature (linguistics) , encoder , set (abstract data type) , field (mathematics) , distortion (music) , artificial intelligence , mathematics , linguistics , acoustics , philosophy , physics , bandwidth (computing) , pure mathematics , programming language , operating system , amplifier , computer network
In recent years, Text-To-Speech (TTS) technology has developed rapidly. People have also been paying more attention to how to narrow the gap between synthetic speech and real speech, hoping that synthesized speech can be integrated with real rhythm. A rhythmic feature embedding method for Text-To-Speech was proposed in this thesis based on Tacotron2 model, which has arisen in the field of TTS in recent years. Firstly, rhythmic feature extraction through World vocoder can reduce redundant information in rhythmic features. Then, rhythmic feature fusion based on Variational Auto-Encoder (VAE) network can enhance rhythmic information. Experiments are carried out on the data set LJSpeech-1.0, and then subjective evaluation and objective evaluation are carried out on the synthesized speech respectively. Compared with the comparative literature, the subjective blind hearing test (ABX) score increased by 25%. At that same time, the objective Mel Cepstral Distortion value (MCD) declined to 12.77.