Curras

an annotated corpus for the Palestinian Arabic dialect

Mustafa Jarrar, Nizar Habash, Faeq Alrimawi, Diyam Akra, Nasser Zalmout

    Research output: Contribution to journalArticle

    Abstract

    In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.

    Original languageEnglish (US)
    Pages (from-to)745-775
    Number of pages31
    JournalLanguage Resources and Evaluation
    Volume51
    Issue number3
    DOIs
    StatePublished - Sep 1 2017

    Fingerprint

    dialect
    guarantee
    genre
    methodology
    language
    Annotation
    Arabic Dialects
    Palestinians
    Egyptians

    Keywords

    • Arabic morphology
    • Conventional Orthography for Dialectal Arabic
    • Dialectal Arabic
    • Palestinian Arabic
    • Palestinian corpus
    • Word annotation

    ASJC Scopus subject areas

    • Language and Linguistics
    • Education
    • Linguistics and Language
    • Library and Information Sciences

    Cite this

    Curras : an annotated corpus for the Palestinian Arabic dialect. / Jarrar, Mustafa; Habash, Nizar; Alrimawi, Faeq; Akra, Diyam; Zalmout, Nasser.

    In: Language Resources and Evaluation, Vol. 51, No. 3, 01.09.2017, p. 745-775.

    Research output: Contribution to journalArticle

    Jarrar, M, Habash, N, Alrimawi, F, Akra, D & Zalmout, N 2017, 'Curras: an annotated corpus for the Palestinian Arabic dialect', Language Resources and Evaluation, vol. 51, no. 3, pp. 745-775. https://doi.org/10.1007/s10579-016-9370-7
    Jarrar, Mustafa ; Habash, Nizar ; Alrimawi, Faeq ; Akra, Diyam ; Zalmout, Nasser. / Curras : an annotated corpus for the Palestinian Arabic dialect. In: Language Resources and Evaluation. 2017 ; Vol. 51, No. 3. pp. 745-775.
    @article{0fb38f8d82b74d24ad061690ff57691d,
    title = "Curras: an annotated corpus for the Palestinian Arabic dialect",
    abstract = "In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.",
    keywords = "Arabic morphology, Conventional Orthography for Dialectal Arabic, Dialectal Arabic, Palestinian Arabic, Palestinian corpus, Word annotation",
    author = "Mustafa Jarrar and Nizar Habash and Faeq Alrimawi and Diyam Akra and Nasser Zalmout",
    year = "2017",
    month = "9",
    day = "1",
    doi = "10.1007/s10579-016-9370-7",
    language = "English (US)",
    volume = "51",
    pages = "745--775",
    journal = "Language Resources and Evaluation",
    issn = "1574-020X",
    publisher = "Springer Netherlands",
    number = "3",

    }

    TY - JOUR

    T1 - Curras

    T2 - an annotated corpus for the Palestinian Arabic dialect

    AU - Jarrar, Mustafa

    AU - Habash, Nizar

    AU - Alrimawi, Faeq

    AU - Akra, Diyam

    AU - Zalmout, Nasser

    PY - 2017/9/1

    Y1 - 2017/9/1

    N2 - In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.

    AB - In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.

    KW - Arabic morphology

    KW - Conventional Orthography for Dialectal Arabic

    KW - Dialectal Arabic

    KW - Palestinian Arabic

    KW - Palestinian corpus

    KW - Word annotation

    UR - http://www.scopus.com/inward/record.url?scp=85001544989&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85001544989&partnerID=8YFLogxK

    U2 - 10.1007/s10579-016-9370-7

    DO - 10.1007/s10579-016-9370-7

    M3 - Article

    VL - 51

    SP - 745

    EP - 775

    JO - Language Resources and Evaluation

    JF - Language Resources and Evaluation

    SN - 1574-020X

    IS - 3

    ER -