IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy
Raymond Brueckner, Björn Schuller
Non-verbal speech cues play an important role in human communication such as expressing emotional states or maintaining the conversational flow. In this paper we investigate the effect of applying deep bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks to the Interspeech 2013 Computational Paralinguistic Social Signals Sub-Challenge dataset requiring frame-wise, speaker-independent detection and classification of laughter and filler vocalizations in speech. BLSTM networks tend to prevail over conventional neural network architectures whenever the recognition or regression task relies on an intelligent exploitation of temporal context information.
We introduce deep BLSTM models by stacking several BLSTMs and by combining non-recurrent deep neural networks with BLSTMs.
We demonstrate that this new approach achieves significant improvements over previous attempts and we increase the current state-of-the-art unweighted average area-under-the-curve (UAAUC) value of 92.4 % to 94.0 %. This is the best result on this task reported in the literature so far.