A corpus and phonetic dictionary for tunisian Arabic speech recognition

Abir Masmoudi, Mariem Ellouze Khemakhem, Yannick Estève, Lamia Hadrich Belguith, Nizar Habash

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
    PublisherEuropean Language Resources Association (ELRA)
    Pages306-310
    Number of pages5
    ISBN (Electronic)9782951740884
    StatePublished - Jan 1 2014
    Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
    Duration: May 26 2014May 31 2014

    Other

    Other9th International Conference on Language Resources and Evaluation, LREC 2014
    CountryIceland
    CityReykjavik
    Period5/26/145/31/14

    Fingerprint

    phonetics
    dictionary
    German Federal Railways
    transport network
    acoustics
    recording
    dialogue
    Dictionary
    Speech Recognition
    interaction
    language
    performance
    Automatic Speech Recognition
    Railway

    Keywords

    • Grapheme-to-phoneme
    • Phonetic dictionary
    • Speech recognition
    • Tunisian Arabic

    ASJC Scopus subject areas

    • Linguistics and Language
    • Library and Information Sciences
    • Education
    • Language and Linguistics

    Cite this

    Masmoudi, A., Khemakhem, M. E., Estève, Y., Belguith, L. H., & Habash, N. (2014). A corpus and phonetic dictionary for tunisian Arabic speech recognition. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014 (pp. 306-310). European Language Resources Association (ELRA).

    A corpus and phonetic dictionary for tunisian Arabic speech recognition. / Masmoudi, Abir; Khemakhem, Mariem Ellouze; Estève, Yannick; Belguith, Lamia Hadrich; Habash, Nizar.

    Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. p. 306-310.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Masmoudi, A, Khemakhem, ME, Estève, Y, Belguith, LH & Habash, N 2014, A corpus and phonetic dictionary for tunisian Arabic speech recognition. in Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), pp. 306-310, 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, 5/26/14.
    Masmoudi A, Khemakhem ME, Estève Y, Belguith LH, Habash N. A corpus and phonetic dictionary for tunisian Arabic speech recognition. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA). 2014. p. 306-310
    Masmoudi, Abir ; Khemakhem, Mariem Ellouze ; Estève, Yannick ; Belguith, Lamia Hadrich ; Habash, Nizar. / A corpus and phonetic dictionary for tunisian Arabic speech recognition. Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. pp. 306-310
    @inproceedings{d7e585583a42493cb34c436806a8b327,
    title = "A corpus and phonetic dictionary for tunisian Arabic speech recognition",
    abstract = "In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9{\%}.",
    keywords = "Grapheme-to-phoneme, Phonetic dictionary, Speech recognition, Tunisian Arabic",
    author = "Abir Masmoudi and Khemakhem, {Mariem Ellouze} and Yannick Est{\`e}ve and Belguith, {Lamia Hadrich} and Nizar Habash",
    year = "2014",
    month = "1",
    day = "1",
    language = "English (US)",
    pages = "306--310",
    booktitle = "Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014",
    publisher = "European Language Resources Association (ELRA)",

    }

    TY - GEN

    T1 - A corpus and phonetic dictionary for tunisian Arabic speech recognition

    AU - Masmoudi, Abir

    AU - Khemakhem, Mariem Ellouze

    AU - Estève, Yannick

    AU - Belguith, Lamia Hadrich

    AU - Habash, Nizar

    PY - 2014/1/1

    Y1 - 2014/1/1

    N2 - In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

    AB - In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

    KW - Grapheme-to-phoneme

    KW - Phonetic dictionary

    KW - Speech recognition

    KW - Tunisian Arabic

    UR - http://www.scopus.com/inward/record.url?scp=85037106055&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85037106055&partnerID=8YFLogxK

    M3 - Conference contribution

    SP - 306

    EP - 310

    BT - Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

    PB - European Language Resources Association (ELRA)

    ER -