research category image

Speech recognition

Turning speech into text is at the heart of an amazing variety of products and services that enrich peoples’ lives. Most of the world’s successful speech solutions today have Nuance speech technology inside.

research category image

Speech recognition



Our goal: Near-perfect speech recognition for everybody in the world.
Nuance has been a pioneer in speech and language technologies for more than 30 years. Our data centers host billions of speech transactions every month in over 40 languages from hundreds of applications. We continuously expand our research grid to explore this avalanche of data. Our researchers, experts in the fields of speech recognition, statistical modeling, deep machine learning, and linguistics, use these computational and data resources to continuously advance the boundaries of what can be done with speech technology.

Current applications for consumers and companies.
We optimize our technology for four main application scenarios.

– Our personal assistant solutions enable people to communicate with their devices on human terms. Our systems understand people’s intentions and provide appropriate responses. Drivers operate their GPSs, make phone calls and listen to messages using our robust speech solutions. Speech makes these interactions easier and safer. In that sense, speech technology saves lives. We continuously improve our accuracy, latency, and robustness; and extend our models to new domains, accents, languages, and devices.

– Our document creation solution powers Dragon NaturallySpeaking, the Nuance flagship speech recognition product. We develop a highly personalized speech recognition solution for each user without explicit training. Our solution not only transcribes accurately the words people dictate, but also formats the resulting written documents.

– Medical professionals use our dictation solutions to generate millions of reports every day. We offer both “front-end” solutions, where doctors see and correct reports as they dictate, and “back-end” solutions, where users speak into a microphone, and are later presented with corrected, formatted reports for signature.

– Most spoken communication takes place between people. Our transcription solution accurately converts the spoken words in conversational speech, particularly voicemails, into text. We focus our research on particular challenges in conversational speech, e.g. sloppy formulation and articulation, difficult recording conditions, multiple speakers, and unpredictable content.

Our solutions are implemented in server-based systems, embedded systems, and hybrid systems that use both server and embedded components. We work closely with our hosted operations and frequently roll out new algorithms and models

Where we’re headed next
Here are some representative examples of the problems we research:

  • Acoustic modeling 
    Neural Nets, and particularly “Deep” Neural Nets, provide substantial performance improvements for many speech recognition tasks. Our work around NNs covers network architectures (e.g. DNN, CNN, RNN/LSTM), input features, training algorithms (parallelization, sequence training), and runtime optimizations (and of course other issues). DNNs require well-labeled data and are hard to adapt to new speakers, devices, and acoustic conditions. So we are also interested in using our large corpora of unlabeled data in many languages for DNN training, and in rapidly adapting DNNs for new acoustic conditions.
  • Language modeling
    Recent years have witnessed a loosening of the death-grip of Kneser-Ney (KN) NGram models on state of the art language modeling, with exponential class models (aka Model M) and more recently various large-scale continuous space language models (aka Feed-Forward NN, RNN, LSTM) achieving superior perplexity and word error rate performance over a range of tasks. The recurrent version of these neural models drop the long-held NGram Markov approximation altogether. The gains in performance with these “new” models come with the cost of a significant increase in training times as compared to KN models. Meanwhile cloud-based dictation services have opened the floodgates of (unsupervised) in-domain training data, which KN models are only too happy to consume and benefit from. This presents an interesting challenge: what decisions about model architecture, training implementation and infrastructure (e.g. CPU, GPU, CPU cluster, multiple GPUs), objective function, parameter initialization, optimization method, data selection, and model combination lead to the best performing model within a practical timeframe robustly across application domains? When theory and engineering collide, the most interesting problems are born. We’re having fun solving these problems every day (but don’t worry, there are still a few left).
  • Research engineering
    The research engineering department’s goal is to turn complex algorithms into efficient, robust software. We create and maintain well-engineered toolkits that support researchers’ flexibility to create and test new algorithms, and we create engines and models which support products and applications in over 40 languages. Our customers range from a single researcher trying something new, to millions of users running Nuance products on their own devices or recognition services provided by our data centers. Our model training toolkit runs on our dedicated large-scale computing grid, and our engines run on anything from the smallest devices to large cloud servers.

Explore recent publications by Nuance Speech Recognition researchers.



Selected articles

Low-Complexity Pitch Estimation Based on Phase Differences Between Low-Resolution Spectra

Detection of voiced speech and estimation of the pitch frequency are important tasks for many speech processing algorithms. Pitch information can be used, e.g., to

Read more

Detection of Voiced Speech and Pitch Estimation for Applications with Low Spectral Resolution

Speech enhancement algorithms are employed in many applications, such as hands-free telephones, or speech recognizers, to recover a speech signal that is recorded in a

Read more

A single-channel non-intrusive C50 estimator correlated with speech recognition performance

Abstract—Several intrusive measures of reverberation can be computed from measured and simulated room impulse responses, over the full frequency band or for each individual mel-frequency

Read more

Voice Activity Detection Based on Modulation-Phase Differences

Many speech processing algorithms rely on voice activity detection (VAD) that separates speech from noise. For this task, several features have been introduced that employ

Read more

Kurtosis-Controlled Babble Noise Suppression

When a speech application is employed in a crowded environment, the user’s voice superposes with many interfering voices. This babble noise is a challenge for

Read more

Statistical Signal Processing Techniques for Robust Speech Recognition

Automatic speech recognition is becoming increasingly more important, with commercial applications such as call steering, dictation or voice-enabled personal assistance systems. Although successful in many

Read more

Features for voice activity detection: a comparative analysis

In many speech signal processing applications, voice activity detection (VAD) plays an essential role for separating an audio stream into time intervals that contain speech

Read more

Listening Test to Determine the Mismatch Between Signal-to-Noise Ratio and Human Perception

Evaluations of speech enhancement systems are typically based on artificially generated noisy speech signals. A common approach to quantify the weighting of speech and background

Read more

A morphological approach to single-channel wind-noise suppression

Today, a variety of technical devices deploy spoken language processing technology. In many practical use cases, not only stationary ambient noises but non-stationary interferences, such

Read more

A practical beamformer-postfilter system for adaptive speech enhancement in non-stationary noise environments

In this contribution we present an adaptive beamformer-postfilter system which can be used to suppress non-stationary noises. The emphasis lies on the spatial filtering property

Read more

1 2 3 6

Upcoming events

ASRU 2017
Okinawa, Japan
- December 20

Nuance is a sponsor of ASRU 2017, come and meet us there. For example at the student lunch, or join the Nuance talk: Matthew Gibson, Gary

Read more


See all Research events