Learning N-gram language models from uncertain data

Vitaly Kuznetsov, Hank Liao, Mehryar Mohri, Michael Riley, Brian Roark

Research output: Contribution to journalArticle

Abstract

We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semisupervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz back-off model. We compare three approaches to semisupervised adaptation of language models for speech recognition of selected YouTube video categories: (1) using just the one-best output from the baseline speech recognizer or (2) using samples from lattices with standard algorithms versus (3) using full lattices with our new algorithm. Unlike the other methods, our new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of video categories. We show that categories with the most data yielded the largest gains. The algorithm has been released as part of the OpenGrm n-gram library [1].

Original languageEnglish (US)
Pages (from-to)2323-2327
Number of pages5
JournalUnknown Journal
Volume08-12-September-2016
DOIs
StatePublished - 2016

Fingerprint

Uncertain Data
N-gram
Language Model
Baseline
Test Set
Speech Recognition
Speech recognition
Histogram
Learning
Output
Model

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Kuznetsov, V., Liao, H., Mohri, M., Riley, M., & Roark, B. (2016). Learning N-gram language models from uncertain data. Unknown Journal, 08-12-September-2016, 2323-2327. https://doi.org/10.21437/Interspeech.2016-1093

Learning N-gram language models from uncertain data. / Kuznetsov, Vitaly; Liao, Hank; Mohri, Mehryar; Riley, Michael; Roark, Brian.

In: Unknown Journal, Vol. 08-12-September-2016, 2016, p. 2323-2327.

Research output: Contribution to journalArticle

Kuznetsov, V, Liao, H, Mohri, M, Riley, M & Roark, B 2016, 'Learning N-gram language models from uncertain data', Unknown Journal, vol. 08-12-September-2016, pp. 2323-2327. https://doi.org/10.21437/Interspeech.2016-1093
Kuznetsov, Vitaly ; Liao, Hank ; Mohri, Mehryar ; Riley, Michael ; Roark, Brian. / Learning N-gram language models from uncertain data. In: Unknown Journal. 2016 ; Vol. 08-12-September-2016. pp. 2323-2327.
@article{a1c4d40cbc1d423a8b75b518467701b7,
title = "Learning N-gram language models from uncertain data",
abstract = "We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semisupervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz back-off model. We compare three approaches to semisupervised adaptation of language models for speech recognition of selected YouTube video categories: (1) using just the one-best output from the baseline speech recognizer or (2) using samples from lattices with standard algorithms versus (3) using full lattices with our new algorithm. Unlike the other methods, our new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of video categories. We show that categories with the most data yielded the largest gains. The algorithm has been released as part of the OpenGrm n-gram library [1].",
author = "Vitaly Kuznetsov and Hank Liao and Mehryar Mohri and Michael Riley and Brian Roark",
year = "2016",
doi = "10.21437/Interspeech.2016-1093",
language = "English (US)",
volume = "08-12-September-2016",
pages = "2323--2327",
journal = "Theoretical Computer Science",
issn = "0304-3975",
publisher = "Elsevier",

}

TY - JOUR

T1 - Learning N-gram language models from uncertain data

AU - Kuznetsov, Vitaly

AU - Liao, Hank

AU - Mohri, Mehryar

AU - Riley, Michael

AU - Roark, Brian

PY - 2016

Y1 - 2016

N2 - We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semisupervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz back-off model. We compare three approaches to semisupervised adaptation of language models for speech recognition of selected YouTube video categories: (1) using just the one-best output from the baseline speech recognizer or (2) using samples from lattices with standard algorithms versus (3) using full lattices with our new algorithm. Unlike the other methods, our new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of video categories. We show that categories with the most data yielded the largest gains. The algorithm has been released as part of the OpenGrm n-gram library [1].

AB - We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semisupervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz back-off model. We compare three approaches to semisupervised adaptation of language models for speech recognition of selected YouTube video categories: (1) using just the one-best output from the baseline speech recognizer or (2) using samples from lattices with standard algorithms versus (3) using full lattices with our new algorithm. Unlike the other methods, our new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of video categories. We show that categories with the most data yielded the largest gains. The algorithm has been released as part of the OpenGrm n-gram library [1].

UR - http://www.scopus.com/inward/record.url?scp=84994242466&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994242466&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-1093

DO - 10.21437/Interspeech.2016-1093

M3 - Article

AN - SCOPUS:84994242466

VL - 08-12-September-2016

SP - 2323

EP - 2327

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

ER -