A multidialectal parallel corpus of Arabic

Houda Bouamor, Nizar Habash, Kemal Oflazer

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2, 000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
    PublisherEuropean Language Resources Association (ELRA)
    Pages1240-1245
    Number of pages6
    ISBN (Electronic)9782951740884
    StatePublished - Jan 1 2014
    Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
    Duration: May 26 2014May 31 2014

    Other

    Other9th International Conference on Language Resources and Evaluation, LREC 2014
    CountryIceland
    CityReykjavik
    Period5/26/145/31/14

    Fingerprint

    dialect
    colloquial
    Arab
    speaking
    Parallel Corpora
    art
    communication
    Arabic Dialects
    resources
    community

    Keywords

    • Arabic
    • Dialects
    • Parallel Corpus

    ASJC Scopus subject areas

    • Linguistics and Language
    • Library and Information Sciences
    • Education
    • Language and Linguistics

    Cite this

    Bouamor, H., Habash, N., & Oflazer, K. (2014). A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014 (pp. 1240-1245). European Language Resources Association (ELRA).

    A multidialectal parallel corpus of Arabic. / Bouamor, Houda; Habash, Nizar; Oflazer, Kemal.

    Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. p. 1240-1245.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Bouamor, H, Habash, N & Oflazer, K 2014, A multidialectal parallel corpus of Arabic. in Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), pp. 1240-1245, 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, 5/26/14.
    Bouamor H, Habash N, Oflazer K. A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA). 2014. p. 1240-1245
    Bouamor, Houda ; Habash, Nizar ; Oflazer, Kemal. / A multidialectal parallel corpus of Arabic. Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. pp. 1240-1245
    @inproceedings{9b4a95aadfa24ff4971b1d47ab89ef23,
    title = "A multidialectal parallel corpus of Arabic",
    abstract = "The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2, 000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.",
    keywords = "Arabic, Dialects, Parallel Corpus",
    author = "Houda Bouamor and Nizar Habash and Kemal Oflazer",
    year = "2014",
    month = "1",
    day = "1",
    language = "English (US)",
    pages = "1240--1245",
    booktitle = "Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014",
    publisher = "European Language Resources Association (ELRA)",

    }

    TY - GEN

    T1 - A multidialectal parallel corpus of Arabic

    AU - Bouamor, Houda

    AU - Habash, Nizar

    AU - Oflazer, Kemal

    PY - 2014/1/1

    Y1 - 2014/1/1

    N2 - The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2, 000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

    AB - The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2, 000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

    KW - Arabic

    KW - Dialects

    KW - Parallel Corpus

    UR - http://www.scopus.com/inward/record.url?scp=85026863473&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85026863473&partnerID=8YFLogxK

    M3 - Conference contribution

    SP - 1240

    EP - 1245

    BT - Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

    PB - European Language Resources Association (ELRA)

    ER -