Article details

Research area
Speech recognition

Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference


Yun Tang, Aanchan Mohan, Richard C. Rose, Chengyuan Ma

Deep neural network trained with speaker representation for speaker normalization


A method for speaker normalization in deep neural network (DNN) based discriminative feature estimation for automatic speech recognition (ASR) is presented. This method is applied in the context of a DNN configured for auto-encoder based low dimensional bottleneck (AE-BN) feature extraction where the derived features are used as input to a continuous Gaussian density hidden Markov model (HMM/GMM) based ASR decoder. While AE-BN features are known to provide significant reduction in ASR word error rate (WER) with respect to more conventional spectral magnitude based features, there is no general agreement on how these networks can reduce the impact of speaker variability by incorporating prior knowledge of the speaker. An approach is presented in this paper where spectrum based DNN inputs are augmented with speaker inputs that are derived from separate regression based speaker transformations. It is shown the proposed method could reduce the WER by 3% relative to the best speaker adapted AE-BN CDHMM system.

Read/download now