Discriminative topic segmentation of text and speech

Mehryar Mohri, Pedro Moreno, Eugene Weinstein

Research output: Contribution to journalArticle

Abstract

We explore automated discovery of topicallycoherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.

Original languageEnglish (US)
Pages (from-to)533-540
Number of pages8
JournalJournal of Machine Learning Research
Volume9
StatePublished - 2010

Fingerprint

Segmentation
Support Vector
Compact Support
Speech Recognition
Extremum
Speech recognition
Acoustic noise
Speech
Text
Output
Demonstrate
Experiment
Similarity
Experiments

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Cite this

Discriminative topic segmentation of text and speech. / Mohri, Mehryar; Moreno, Pedro; Weinstein, Eugene.

In: Journal of Machine Learning Research, Vol. 9, 2010, p. 533-540.

Research output: Contribution to journalArticle

Mohri, Mehryar ; Moreno, Pedro ; Weinstein, Eugene. / Discriminative topic segmentation of text and speech. In: Journal of Machine Learning Research. 2010 ; Vol. 9. pp. 533-540.
@article{b0535371cbcb4815a926a7c05472d14b,
title = "Discriminative topic segmentation of text and speech",
abstract = "We explore automated discovery of topicallycoherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.",
author = "Mehryar Mohri and Pedro Moreno and Eugene Weinstein",
year = "2010",
language = "English (US)",
volume = "9",
pages = "533--540",
journal = "Journal of Machine Learning Research",
issn = "1532-4435",
publisher = "Microtome Publishing",

}

TY - JOUR

T1 - Discriminative topic segmentation of text and speech

AU - Mohri, Mehryar

AU - Moreno, Pedro

AU - Weinstein, Eugene

PY - 2010

Y1 - 2010

N2 - We explore automated discovery of topicallycoherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.

AB - We explore automated discovery of topicallycoherent segments in speech or text sequences. We give two new discriminative topic segmentation algorithms which employ a new measure of text similarity based on word co-occurrence. Both algorithms function by finding extrema in the similarity signal over the text, with the latter algorithm using a compact support-vector based description of a window of text or speech observations in word similarity space to overcome noise introduced by speech recognition errors and off-topic content. In experiments over speech and text news streams, we show that these algorithms outperform previous methods. We observe that topic segmentation of speech recognizer output is a more difficult problem than that of text streams; however, we demonstrate that by using a lattice of competing hypotheses rather than just the one-best hypothesis as input to the segmentation algorithm, the performance of the algorithm can be improved.

UR - http://www.scopus.com/inward/record.url?scp=84862272190&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862272190&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84862272190

VL - 9

SP - 533

EP - 540

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

SN - 1532-4435

ER -