A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition

Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, Ralf Schlüter, Hermann Ney
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New Orleans, USA, March 2017

Recent experiments show that deep bidirectional long short-term memory (BLSTM) recurrent neural network acoustic models outperform feedforward neural networks for automatic speech recognition (ASR). However, their training requires a lot of tuning and experience. In this work, we provide a comprehensive overview over various BLSTM training aspects and their interplay within ASR, which has been missing so far in the literature. We investigate on different variants of optimization methods, batching, truncated backpropagation, and regularization techniques such as dropout, and we study the effect of size and depth, training models of up to 10 layers. This includes a comparison of computation times vs. recognition performance. Furthermore, we introduce a pretraining scheme for LSTMs with layer-wise construction of the network showing good improvements especially for deep networks. The experimental analysis mainly was performed on the Quaero task, with additional results on Switchboard. The best BLSTM model gave a relative improvement in word error rate of over 15% compared to our best feed-forward baseline on our Quaero 50h task. All experiments were done using RETURNN and RASR, RWTH’s extensible training framework for universal recurrent neural networks and ASR toolkit. The training configuration files are publicly available.

» Show BibTeX

title={A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition},
author={Zeyer, Albert and Doetsch, Patrick and Voigtlaender, Paul and Schl{\"u}ter, Ralf and Ney, Hermann},
booktitle={IEEE International Conference on Acoustics, Speech, and Signal Processing},

Disclaimer Home Visual Computing institute RWTH Aachen University