
Unit Generation Based on Phrase Break Strength and Pruning for Corpus‐Based Text‐to‐Speech
Author(s) -
Kim Sanghun,
Lee Youngjik,
Hirose Keikichi
Publication year - 2001
Publication title -
etri journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.295
H-Index - 46
eISSN - 2233-7326
pISSN - 1225-6463
DOI - 10.4218/etrij.01.0101.0403
Subject(s) - speech synthesis , computer science , pruning , vector quantization , speech recognition , phrase , sentence , cluster analysis , artificial intelligence , reduction (mathematics) , word error rate , set (abstract data type) , natural language processing , mathematics , geometry , agronomy , biology , programming language
This paper discusses two important issues of corpus‐based synthesis: synthesis unit generation based on phrase break strength information and pruning redundant synthesis unit instances. First, the new sentence set for recording was designed to make an efficient synthesis database, reflecting the characteristics of the Korean language. To obtain prosodic context sensitive units, we graded major prosodic phrases into 5 distinctive levels according to pause length and then discriminated intra‐word triphones using the levels. Using the synthesis unit with phrase break strength information, synthetic speech was generated and evaluated subjectively. Second, a new pruning method based on weighted vector quantization (WVQ) was proposed to eliminate redundant synthesis unit instances from the synthesis database. WVQ takes the relative importance of each instance into account when clustering similar instances using vector quantization (VQ) technique. The proposed method was compared with two conventional pruning methods through objective and subjective evaluations of synthetic speech quality: one to simply limit the maximum number of instances, and the other based on normal VQ‐based clustering. For the same reduction rate of instance number, the proposed method showed the best performance. The synthetic speech with reduction rate 45% had almost no perceptible degradation as compared to the synthetic speech without instance reduction.