Text to speech
Alexander Sorin, Slava Shechtman, Vincent Pollet
In expressive TTS and voice transformation systems, implantation of expressive prosody derived from external out-of-domain sources often leads to extreme pitch modification that compromises the naturalness of the synthesized speech. In this work we investigate and prove a hypothesis that the naturalness loss is in part attributed to a violation of a fundamental relationship between the instantaneous pitch frequency and instantaneous energy of a speech signal. We propose an enhancement for pitch modification where the instantaneous energy is modified coherently with the pitch frequency and demonstrate the potential of this method in a subjective listening evaluation. The proposed approach is complementary to and can be combined with spectrum shape transformation methods for achieving the maximal possible quality of pitch modification.