A corpus and phonetic dictionary for tunisian Arabic speech recognition

Abir Masmoudi, Mariem Ellouze Khemakhem, Yannick Estève, Lamia Hadrich Belguith, Nizar Habash

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

Original languageEnglish (US)
Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
EditorsNicoletta Calzolari, Khalid Choukri, Sara Goggi, Thierry Declerck, Joseph Mariani, Bente Maegaard, Asuncion Moreno, Jan Odijk, Helene Mazo, Stelios Piperidis, Hrafn Loftsson
PublisherEuropean Language Resources Association (ELRA)
Pages306-310
Number of pages5
ISBN (Electronic)9782951740884
StatePublished - Jan 1 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: May 26 2014May 31 2014

Publication series

NameProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

Other

Other9th International Conference on Language Resources and Evaluation, LREC 2014
CountryIceland
CityReykjavik
Period5/26/145/31/14

    Fingerprint

Keywords

  • Grapheme-to-phoneme
  • Phonetic dictionary
  • Speech recognition
  • Tunisian Arabic

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Education
  • Language and Linguistics

Cite this

Masmoudi, A., Khemakhem, M. E., Estève, Y., Belguith, L. H., & Habash, N. (2014). A corpus and phonetic dictionary for tunisian Arabic speech recognition. In N. Calzolari, K. Choukri, S. Goggi, T. Declerck, J. Mariani, B. Maegaard, A. Moreno, J. Odijk, H. Mazo, S. Piperidis, & H. Loftsson (Eds.), Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014 (pp. 306-310). (Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014). European Language Resources Association (ELRA).