Article details

Research area
Text to speech

In proceedings of Interspeech 2012, Portland, Oregon, USA,


Alexander Sorin, Slava Shechtman, Vincent Pollet

Psychoacoustic segment scoring for multi-form speech synthesis


In multi-form segment synthesis, output speech is constructed by splicing waveform segments with statistically modeled and regenerated parametric speech segments. The fraction of model-derived segments is called model-template ratio. The motivation of this work is to further increase flexibility of multi-form synthesis maintaining high speech quality for high model-template ratios. An approach is presented where the representation type of a segment is selected per acoustic leaf. We introduce a novel method for leaf representation selection based on a psychoacoustic segment stationarity score. Additionally, refinements in multi-form segment concatenation including boundary constrained statistical parametric synthesis and time-domain alignment based on multi-peak analysis of cross-correlation for high model template ratio multi-form synthesis are presented.

Read/download now