research category image

Text to speech

Text-to-Speech (TTS) is a corner-stone Nuance technology, with a long research history combining ideas integrated from several teams and new hires. We are now one of the largest TTS research groups in the world.

research category image

Text to speech



TTS for commercial use
The commercial importance of TTS continues to grow, with systems being deployed in embedded devices such as toys and cell phones, GPS systems, and large-scale systems for directory assistance & customer care. Recently, we have been working on novel domains such as news reading and information query, which are creating new challenges in comprehension, naturalness and expressivity.
Each market imposes demands on the technology; A TTS system on an embedded device must compete for limited “real estate“, while server systems must be computationally efficient, able to service hundreds of simultaneous request in real-time while providing high quality synthesis. And in all these applications, TTS must be robust to a broad range of input material.

Evolving TTS – our current research topics
TTS research at Nuance investigates a broad range of topics, including advanced NLU, prosody, statistical and hybrid selection and synthesis, as well as design lead such as voice selection, script design, recording methods and advanced analytics. Here are some examples of typical problems we research:

  • Advanced Parametric Speech Synthesis
    Multi-Form Segment (MFS) TTS is a speech synthesis technique that combines a model framework (statistical models or artificial neural network models) with segments of different representational structures. Both natural speech fragments (templates) and models are used, resulting in higher quality and naturalness compared to parametric TTS, and higher flexibility, generalization and smoothness compared to units-selection based TTS. MFS TTS generates output speech ranging from parametric to fully concatenative speech synthesis. It uses a multi-layered uniform parameterization based on a sinusoidal representation and Mel-Regularised Cepstral Coefficients (MRCC) parameterization of the spectral envelope and phase information. Additional layers of residual noise components are optionally added to the template segments to further increase the waveform reconstruction quality. During synthesis, the spectral and phase parameters of the sequence of segments are converted to a sinusoidal model representation. This allows great flexibility for manipulating the prosody and spectrum of TTS output.
  • Prosody prediction for natural and expressive Speech Synthesis
    Natural and expressive TTS requires an accurate prosody prediction model. This model must capture the rich semantic, syntactic, and pragmatic information speakers tacitly use of when producing speech. Moreover these features interact in a complex, dynamic way with contextual effects from neighboring and distant inputs. Prosody prediction is hence affected by an interaction of short- and long-term contextual factors. We employ recurrent artificial neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction.
  • Compound and Named Entity Recognition for TTS
    The recognition of compounds and named-entities is crucial for TTS to achieve natural prosody. For instance, English speakers expect the emphasis on “razor-blade” to be on the first word, but on the second word in “rusty blade”.
    New compounds and named-entities are created every day. We apply advanced data-driven approaches (dictionary based, rule-based and statistical) to compound and named-entity related processing (recognition, default prominence prediction, part of speech prediction etc…), using statistical techniques such as memory-based learning, maximum entropy and conditional random fields. We measure the progress of our research by computing symbolic accuracy figures for various prediction categories, e.g. noun-noun compounds, and using subjective measures such as TTS Mean Opinion Score.
  • Domain/Style specific TTS
    In order to improve the naturalness of our TTS, we explore the prediction and usage of a “style of speech” feature, a categorical pragmatic property of a document or text to be spoken. For example, it may denote a domain (e.g. Weather versus Wikipedia) or a style (e.g. Neutral versus Joyful). These features can operate above, at, or below the sentence level. We use a maximum entropy classifier, trained on domains clustered by vocabulary and syntactic structure to classify documents; and we investigate using changes in pronunciations to signal mood.

Explore recent publications by Nuance Text to Speech researchers.



Selected articles

Automatic Detection of Prosody Phrase Boundaries for Text-to-Speech Systems

Automatic acquisition of the prosodic phrase boundary detecting rules from the text and speech corpora has always been a difficulty for TTS systems. We collected

Read more

Dealing with Polyphone in Text-to-Speech Systems Using How-Net

Dealing with polyphones is an important part of Chinese text-to-speech system. Because the pronunciation of a Chinese character is directly related to the meaning of

Read more

Statistical corpus-based speech segmentation

An automatic speech segmentation technique is presented that is based on the alignment of a target speech signal with a set of different reference speech

Read more

Linguistic features weighting for a text-to-speech system without prosody model

This paper presents a Non-Uniform Units selection-based Text- To-Speech synthesizer. Nowadays, systems use prosodic mod- els that do not allow the prosody to vary as

Read more

Multi-Pass Pronunciation Adaptation

A mapping between words and pronunciations (potential phonetic realizations) is a key component of speech recognition systems. Traditionally, this has been encoded in a lexicon where each pronunciation

Read more

Synthesis by generation and concatenation of multiform segments

Machine generated speech can be produced in different ways however there are two basic methods for synthesizing speech in widespread use. One method generates speech

Read more

Refined Statistical Model Tuning for Speech Synthesis

This paper describes a number of approaches to refine and tune statistical models for speech synthesis. The first approach is to tune the sizes of

Read more

Refined statistical model tuning for speech synthesis

This paper describes a number of approaches to refine and tune statistical models for speech synthesis. The first approach is to tune the sizes of

Read more

Uniform speech parameterization for multi-form segment synthesis

In multi-form segment synthesis speech is constructed by sequencing speech segments of different nature: model segments, i.e. mathematical abstractions of speech and template segments, i.e.

Read more

Towards automatic phonetic segmentation for TTS

Phonetic segmentation is an important step in the development of a concatenative TTS voice. This paper introduces a segmentation process consisting of two phases. First,

Read more

1 2

Upcoming events

See all Research events