A corpus and phonetic dictionary for tunisian Arabic speech recognition

Abir Masmoudi, Mariem Ellouze Khemakhem, Yannick Estève, Lamia Hadrich Belguith, Nizar Habash

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

Original languageEnglish (US)
Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
PublisherEuropean Language Resources Association (ELRA)
Pages306-310
Number of pages5
ISBN (Electronic)9782951740884
StatePublished - Jan 1 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: May 26 2014May 31 2014

Other

Other9th International Conference on Language Resources and Evaluation, LREC 2014
CountryIceland
CityReykjavik
Period5/26/145/31/14

Fingerprint

phonetics
dictionary
German Federal Railways
transport network
acoustics
recording
dialogue
Dictionary
Speech Recognition
interaction
language
performance
Automatic Speech Recognition
Railway

Keywords

  • Grapheme-to-phoneme
  • Phonetic dictionary
  • Speech recognition
  • Tunisian Arabic

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Education
  • Language and Linguistics

Cite this

Masmoudi, A., Khemakhem, M. E., Estève, Y., Belguith, L. H., & Habash, N. (2014). A corpus and phonetic dictionary for tunisian Arabic speech recognition. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014 (pp. 306-310). European Language Resources Association (ELRA).

A corpus and phonetic dictionary for tunisian Arabic speech recognition. / Masmoudi, Abir; Khemakhem, Mariem Ellouze; Estève, Yannick; Belguith, Lamia Hadrich; Habash, Nizar.

Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. p. 306-310.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Masmoudi, A, Khemakhem, ME, Estève, Y, Belguith, LH & Habash, N 2014, A corpus and phonetic dictionary for tunisian Arabic speech recognition. in Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), pp. 306-310, 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, 5/26/14.
Masmoudi A, Khemakhem ME, Estève Y, Belguith LH, Habash N. A corpus and phonetic dictionary for tunisian Arabic speech recognition. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA). 2014. p. 306-310
Masmoudi, Abir ; Khemakhem, Mariem Ellouze ; Estève, Yannick ; Belguith, Lamia Hadrich ; Habash, Nizar. / A corpus and phonetic dictionary for tunisian Arabic speech recognition. Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. pp. 306-310
@inproceedings{d7e585583a42493cb34c436806a8b327,
title = "A corpus and phonetic dictionary for tunisian Arabic speech recognition",
abstract = "In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9{\%}.",
keywords = "Grapheme-to-phoneme, Phonetic dictionary, Speech recognition, Tunisian Arabic",
author = "Abir Masmoudi and Khemakhem, {Mariem Ellouze} and Yannick Est{\`e}ve and Belguith, {Lamia Hadrich} and Nizar Habash",
year = "2014",
month = "1",
day = "1",
language = "English (US)",
pages = "306--310",
booktitle = "Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - A corpus and phonetic dictionary for tunisian Arabic speech recognition

AU - Masmoudi, Abir

AU - Khemakhem, Mariem Ellouze

AU - Estève, Yannick

AU - Belguith, Lamia Hadrich

AU - Habash, Nizar

PY - 2014/1/1

Y1 - 2014/1/1

N2 - In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

AB - In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

KW - Grapheme-to-phoneme

KW - Phonetic dictionary

KW - Speech recognition

KW - Tunisian Arabic

UR - http://www.scopus.com/inward/record.url?scp=85037106055&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85037106055&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85037106055

SP - 306

EP - 310

BT - Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

PB - European Language Resources Association (ELRA)

ER -