Very deep multilingual convolutional neural networks for LVCSR

Tom Sercu, Christian Puhrsch, Brian Kingsbury, Yann LeCun

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77% WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8% after cross-entropy training, a 1.4% WER improvement (10.6% relative) over the best published CNN result so far.

Original languageEnglish (US)
Title of host publication2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4955-4959
Number of pages5
Volume2016-May
ISBN (Electronic)9781479999880
DOIs
StatePublished - May 18 2016
Event41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Shanghai, China
Duration: Mar 20 2016Mar 25 2016

Other

Other41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016
CountryChina
CityShanghai
Period3/20/163/25/16

Fingerprint

Continuous speech recognition
Neural networks
Network architecture
Speech recognition
Entropy

Keywords

  • Acoustic Modeling
  • Convolutional Networks
  • Multilingual
  • Neural Networks
  • Speech Recognition

ASJC Scopus subject areas

  • Signal Processing
  • Software
  • Electrical and Electronic Engineering

Cite this

Sercu, T., Puhrsch, C., Kingsbury, B., & LeCun, Y. (2016). Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings (Vol. 2016-May, pp. 4955-4959). [7472620] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2016.7472620

Very deep multilingual convolutional neural networks for LVCSR. / Sercu, Tom; Puhrsch, Christian; Kingsbury, Brian; LeCun, Yann.

2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May Institute of Electrical and Electronics Engineers Inc., 2016. p. 4955-4959 7472620.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sercu, T, Puhrsch, C, Kingsbury, B & LeCun, Y 2016, Very deep multilingual convolutional neural networks for LVCSR. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. vol. 2016-May, 7472620, Institute of Electrical and Electronics Engineers Inc., pp. 4955-4959, 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, 3/20/16. https://doi.org/10.1109/ICASSP.2016.7472620
Sercu T, Puhrsch C, Kingsbury B, LeCun Y. Very deep multilingual convolutional neural networks for LVCSR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May. Institute of Electrical and Electronics Engineers Inc. 2016. p. 4955-4959. 7472620 https://doi.org/10.1109/ICASSP.2016.7472620
Sercu, Tom ; Puhrsch, Christian ; Kingsbury, Brian ; LeCun, Yann. / Very deep multilingual convolutional neural networks for LVCSR. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May Institute of Electrical and Electronics Engineers Inc., 2016. pp. 4955-4959
@inproceedings{f61566b170b64225821ee06371364003,
title = "Very deep multilingual convolutional neural networks for LVCSR",
abstract = "Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77{\%} WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8{\%} after cross-entropy training, a 1.4{\%} WER improvement (10.6{\%} relative) over the best published CNN result so far.",
keywords = "Acoustic Modeling, Convolutional Networks, Multilingual, Neural Networks, Speech Recognition",
author = "Tom Sercu and Christian Puhrsch and Brian Kingsbury and Yann LeCun",
year = "2016",
month = "5",
day = "18",
doi = "10.1109/ICASSP.2016.7472620",
language = "English (US)",
volume = "2016-May",
pages = "4955--4959",
booktitle = "2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Very deep multilingual convolutional neural networks for LVCSR

AU - Sercu, Tom

AU - Puhrsch, Christian

AU - Kingsbury, Brian

AU - LeCun, Yann

PY - 2016/5/18

Y1 - 2016/5/18

N2 - Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77% WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8% after cross-entropy training, a 1.4% WER improvement (10.6% relative) over the best published CNN result so far.

AB - Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77% WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8% after cross-entropy training, a 1.4% WER improvement (10.6% relative) over the best published CNN result so far.

KW - Acoustic Modeling

KW - Convolutional Networks

KW - Multilingual

KW - Neural Networks

KW - Speech Recognition

UR - http://www.scopus.com/inward/record.url?scp=84973324686&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84973324686&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2016.7472620

DO - 10.1109/ICASSP.2016.7472620

M3 - Conference contribution

VL - 2016-May

SP - 4955

EP - 4959

BT - 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -