TTS for commercial use
The commercial importance of TTS continues to grow, with systems being deployed in embedded devices such as toys and cell phones, GPS systems, and large-scale systems for directory assistance & customer care. Recently, we have been working on novel domains such as news reading and information query, which are creating new challenges in comprehension, naturalness and expressivity.
Each market imposes demands on the technology; A TTS system on an embedded device must compete for limited “real estate“, while server systems must be computationally efficient, able to service hundreds of simultaneous request in real-time while providing high quality synthesis. And in all these applications, TTS must be robust to a broad range of input material.
Evolving TTS – our current research topics
TTS research at Nuance investigates a broad range of topics, including advanced NLU, prosody, statistical and hybrid selection and synthesis, as well as design lead such as voice selection, script design, recording methods and advanced analytics. Here are some examples of typical problems we research:
- Advanced Parametric Speech Synthesis
Multi-Form Segment (MFS) TTS is a speech synthesis technique that combines a model framework (statistical models or artificial neural network models) with segments of different representational structures. Both natural speech fragments (templates) and models are used, resulting in higher quality and naturalness compared to parametric TTS, and higher flexibility, generalization and smoothness compared to units-selection based TTS. MFS TTS generates output speech ranging from parametric to fully concatenative speech synthesis. It uses a multi-layered uniform parameterization based on a sinusoidal representation and Mel-Regularised Cepstral Coefficients (MRCC) parameterization of the spectral envelope and phase information. Additional layers of residual noise components are optionally added to the template segments to further increase the waveform reconstruction quality. During synthesis, the spectral and phase parameters of the sequence of segments are converted to a sinusoidal model representation. This allows great flexibility for manipulating the prosody and spectrum of TTS output.
- Prosody prediction for natural and expressive Speech Synthesis
Natural and expressive TTS requires an accurate prosody prediction model. This model must capture the rich semantic, syntactic, and pragmatic information speakers tacitly use of when producing speech. Moreover these features interact in a complex, dynamic way with contextual effects from neighboring and distant inputs. Prosody prediction is hence affected by an interaction of short- and long-term contextual factors. We employ recurrent artificial neural networks (RNNs) that are deep in time and can store state information from an arbitrarily large input history when making a prediction.
- Compound and Named Entity Recognition for TTS
The recognition of compounds and named-entities is crucial for TTS to achieve natural prosody. For instance, English speakers expect the emphasis on “razor-blade” to be on the first word, but on the second word in “rusty blade”.
New compounds and named-entities are created every day. We apply advanced data-driven approaches (dictionary based, rule-based and statistical) to compound and named-entity related processing (recognition, default prominence prediction, part of speech prediction etc…), using statistical techniques such as memory-based learning, maximum entropy and conditional random fields. We measure the progress of our research by computing symbolic accuracy figures for various prediction categories, e.g. noun-noun compounds, and using subjective measures such as TTS Mean Opinion Score.
- Domain/Style specific TTS
In order to improve the naturalness of our TTS, we explore the prediction and usage of a “style of speech” feature, a categorical pragmatic property of a document or text to be spoken. For example, it may denote a domain (e.g. Weather versus Wikipedia) or a style (e.g. Neutral versus Joyful). These features can operate above, at, or below the sentence level. We use a maximum entropy classifier, trained on domains clustered by vocabulary and syntactic structure to classify documents; and we investigate using changes in pronunciations to signal mood.
Explore recent publications by Nuance Text to Speech researchers.