Curras: an annotated corpus for the Palestinian Arabic dialect

Mustafa Jarrar, Nizar Habash, Faeq Alrimawi, Diyam Akra, Nasser Zalmout

Research output: Contribution to journalArticle

Abstract

In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.

Original languageEnglish (US)
Pages (from-to)745-775
Number of pages31
JournalLanguage Resources and Evaluation
Volume51
Issue number3
DOIs
StatePublished - Sep 1 2017

Fingerprint

dialect
guarantee
genre
methodology
language
Annotation
Arabic Dialects
Palestinians
Egyptians

Keywords

  • Arabic morphology
  • Conventional Orthography for Dialectal Arabic
  • Dialectal Arabic
  • Palestinian Arabic
  • Palestinian corpus
  • Word annotation

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Linguistics and Language
  • Library and Information Sciences

Cite this

Curras : an annotated corpus for the Palestinian Arabic dialect. / Jarrar, Mustafa; Habash, Nizar; Alrimawi, Faeq; Akra, Diyam; Zalmout, Nasser.

In: Language Resources and Evaluation, Vol. 51, No. 3, 01.09.2017, p. 745-775.

Research output: Contribution to journalArticle

Jarrar, Mustafa ; Habash, Nizar ; Alrimawi, Faeq ; Akra, Diyam ; Zalmout, Nasser. / Curras : an annotated corpus for the Palestinian Arabic dialect. In: Language Resources and Evaluation. 2017 ; Vol. 51, No. 3. pp. 745-775.
@article{0fb38f8d82b74d24ad061690ff57691d,
title = "Curras: an annotated corpus for the Palestinian Arabic dialect",
abstract = "In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.",
keywords = "Arabic morphology, Conventional Orthography for Dialectal Arabic, Dialectal Arabic, Palestinian Arabic, Palestinian corpus, Word annotation",
author = "Mustafa Jarrar and Nizar Habash and Faeq Alrimawi and Diyam Akra and Nasser Zalmout",
year = "2017",
month = "9",
day = "1",
doi = "10.1007/s10579-016-9370-7",
language = "English (US)",
volume = "51",
pages = "745--775",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",
number = "3",

}

TY - JOUR

T1 - Curras

T2 - an annotated corpus for the Palestinian Arabic dialect

AU - Jarrar, Mustafa

AU - Habash, Nizar

AU - Alrimawi, Faeq

AU - Akra, Diyam

AU - Zalmout, Nasser

PY - 2017/9/1

Y1 - 2017/9/1

N2 - In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.

AB - In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.

KW - Arabic morphology

KW - Conventional Orthography for Dialectal Arabic

KW - Dialectal Arabic

KW - Palestinian Arabic

KW - Palestinian corpus

KW - Word annotation

UR - http://www.scopus.com/inward/record.url?scp=85001544989&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85001544989&partnerID=8YFLogxK

U2 - 10.1007/s10579-016-9370-7

DO - 10.1007/s10579-016-9370-7

M3 - Article

AN - SCOPUS:85001544989

VL - 51

SP - 745

EP - 775

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

IS - 3

ER -