PhD Thesis, Saarland University
Automatic speech recognition is becoming increasingly more important, with commercial applications such as call steering, dictation or voice-enabled personal assistance systems. Although successful in many respects, the performance of such systems can significantly degrade in noisy environment such as a crowded restaurant. This is due to the fact that noise introduces a mismatch between the clean speech features, which the ASR system has been trained with, and the noisy speech features that are encountered in the operational environment.
This dissertation tries to mitigate the degradation in performance using two principally different approaches: speech feature enhancement (SFE) techniques, which minimize the mismatch between clean and noisy features, and missing feature reconstruction (MFR) techniques, which infer the values of noise-corrupted frequency bins from non-corrupted ones. Particular contributions include (1) a phase-averaged model of how noise corrupts clean speech features, (2) better noise estimation with a Monte Carlo variant of the expectation maximization algorithm, (3) an adaptive level of detail transform that allows for more accurate transformations of Gaussian random variables, and (4) a bounded conditional mean imputation technique.
In addition to the above, it is shown that both SFE and MFR techniques can be derived within the same mathematical framework, just using different models of how noise corrupts clean speech features.Read/download now