A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition

Liang Lu, Xingxing Zhang, Kyunghyun Cho, Steve Renals

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep neural networks have advanced the state-of-the-art in automatic speech recognition, when combined with hidden Markov models (HMMs). Recently there has been interest in using systems based on recurrent neural networks (RNNs) to perform sequence modelling directly, without the requirement of an HMM superstructure. In this paper, we study the RNN encoder-decoder approach for large vocabulary end-toend speech recognition, whereby an encoder transforms a sequence of acoustic vectors into a sequence of feature representations, from which a decoder recovers a sequence of words. We investigated this approach on the Switchboard corpus using a training set of around 300 hours of transcribed audio data. Without the use of an explicit language model or pronunciation lexicon, we achieved promising recognition accuracy, demonstrating that this approach warrants further investigation.

Original languageEnglish (US)
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech and Communication Association
Pages3249-3253
Number of pages5
Volume2015-January
StatePublished - 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: Sep 6 2015Sep 10 2015

Other

Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
CountryGermany
CityDresden
Period9/6/159/10/15

Fingerprint

Recurrent neural networks
Recurrent Neural Networks
Hidden Markov models
Speech Recognition
Encoder
Speech recognition
Markov Model
Acoustics
Automatic Speech Recognition
Language Model
Neural Networks
Transform
Vocabulary
Requirements
Modeling
Deep neural networks
Hidden Markov Model

Keywords

  • Deep neural networks
  • Encoder-decoder
  • End-to-end speech recognition
  • Recurrent neural networks

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Lu, L., Zhang, X., Cho, K., & Renals, S. (2015). A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 3249-3253). International Speech and Communication Association.

A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. / Lu, Liang; Zhang, Xingxing; Cho, Kyunghyun; Renals, Steve.

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. p. 3249-3253.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lu, L, Zhang, X, Cho, K & Renals, S 2015, A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2015-January, International Speech and Communication Association, pp. 3249-3253, 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 9/6/15.
Lu L, Zhang X, Cho K, Renals S. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January. International Speech and Communication Association. 2015. p. 3249-3253
Lu, Liang ; Zhang, Xingxing ; Cho, Kyunghyun ; Renals, Steve. / A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. pp. 3249-3253
@inproceedings{2b291ee19a8e49028efc9d3d6552241c,
title = "A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition",
abstract = "Deep neural networks have advanced the state-of-the-art in automatic speech recognition, when combined with hidden Markov models (HMMs). Recently there has been interest in using systems based on recurrent neural networks (RNNs) to perform sequence modelling directly, without the requirement of an HMM superstructure. In this paper, we study the RNN encoder-decoder approach for large vocabulary end-toend speech recognition, whereby an encoder transforms a sequence of acoustic vectors into a sequence of feature representations, from which a decoder recovers a sequence of words. We investigated this approach on the Switchboard corpus using a training set of around 300 hours of transcribed audio data. Without the use of an explicit language model or pronunciation lexicon, we achieved promising recognition accuracy, demonstrating that this approach warrants further investigation.",
keywords = "Deep neural networks, Encoder-decoder, End-to-end speech recognition, Recurrent neural networks",
author = "Liang Lu and Xingxing Zhang and Kyunghyun Cho and Steve Renals",
year = "2015",
language = "English (US)",
volume = "2015-January",
pages = "3249--3253",
booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
publisher = "International Speech and Communication Association",

}

TY - GEN

T1 - A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition

AU - Lu, Liang

AU - Zhang, Xingxing

AU - Cho, Kyunghyun

AU - Renals, Steve

PY - 2015

Y1 - 2015

N2 - Deep neural networks have advanced the state-of-the-art in automatic speech recognition, when combined with hidden Markov models (HMMs). Recently there has been interest in using systems based on recurrent neural networks (RNNs) to perform sequence modelling directly, without the requirement of an HMM superstructure. In this paper, we study the RNN encoder-decoder approach for large vocabulary end-toend speech recognition, whereby an encoder transforms a sequence of acoustic vectors into a sequence of feature representations, from which a decoder recovers a sequence of words. We investigated this approach on the Switchboard corpus using a training set of around 300 hours of transcribed audio data. Without the use of an explicit language model or pronunciation lexicon, we achieved promising recognition accuracy, demonstrating that this approach warrants further investigation.

AB - Deep neural networks have advanced the state-of-the-art in automatic speech recognition, when combined with hidden Markov models (HMMs). Recently there has been interest in using systems based on recurrent neural networks (RNNs) to perform sequence modelling directly, without the requirement of an HMM superstructure. In this paper, we study the RNN encoder-decoder approach for large vocabulary end-toend speech recognition, whereby an encoder transforms a sequence of acoustic vectors into a sequence of feature representations, from which a decoder recovers a sequence of words. We investigated this approach on the Switchboard corpus using a training set of around 300 hours of transcribed audio data. Without the use of an explicit language model or pronunciation lexicon, we achieved promising recognition accuracy, demonstrating that this approach warrants further investigation.

KW - Deep neural networks

KW - Encoder-decoder

KW - End-to-end speech recognition

KW - Recurrent neural networks

UR - http://www.scopus.com/inward/record.url?scp=84959173420&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959173420&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84959173420

VL - 2015-January

SP - 3249

EP - 3253

BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

PB - International Speech and Communication Association

ER -