Supervised collaboration for syntactic annotation of Quranic Arabic

Kais Dukes, Eric Atwell, Nizar Habash

    Research output: Contribution to journalReview article

    Abstract

    The Quranic Arabic Corpus (http://corpus.quran.com) is a collaboratively constructed linguistic resource initiated at the University of Leeds, with multiple layers of annotation including part-of-speech tagging, morphological segmentation (Dukes and Habash 2010) and syntactic analysis using dependency grammar (Dukes and Buckwalter 2010). The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year-old central religious text of Islam. This project contrasts with other Arabic treebanks by providing a deep linguistic model based on the historical traditional grammar known as i′rāb (urdu source). By adapting this well-known canon of Quranic grammar into a familiar tagset, it is possible to encourage online annotation by Arabic linguists and Quranic experts. This article presents a new approach to linguistic annotation of an Arabic corpus: online supervised collaboration using a multi-stage approach. The different stages include automatic rule-based tagging, initial manual verification, and online supervised collaborative proofreading. A popular website attracting thousands of visitors per day, the Quranic Arabic Corpus has approximately 100 unpaid volunteer annotators each suggesting corrections to existing linguistic tagging. To ensure a high-quality resource, a small number of expert annotators are promoted to a supervisory role, allowing them to review or veto suggestions made by other collaborators. The Quran also benefits from a large body of existing historical grammatical analysis, which may be leveraged during this review. In this paper we evaluate and report on the effectiveness of the chosen annotation methodology. We also discuss the unique challenges of annotating Quranic Arabic online and describe the custom linguistic software used to aid collaborative annotation.

    Original languageEnglish (US)
    Pages (from-to)33-62
    Number of pages30
    JournalLanguage Resources and Evaluation
    Volume47
    Issue number1
    DOIs
    StatePublished - Mar 12 2013

    Fingerprint

    linguistics
    grammar
    resources
    expert
    historical analysis
    Islam
    website
    Annotation
    Quran
    Syntax
    methodology
    Resources
    Tagging

    Keywords

    • Arabic
    • Collaborative annotation
    • Corpus
    • Quran
    • Treebank

    ASJC Scopus subject areas

    • Language and Linguistics
    • Education
    • Linguistics and Language
    • Library and Information Sciences

    Cite this

    Supervised collaboration for syntactic annotation of Quranic Arabic. / Dukes, Kais; Atwell, Eric; Habash, Nizar.

    In: Language Resources and Evaluation, Vol. 47, No. 1, 12.03.2013, p. 33-62.

    Research output: Contribution to journalReview article

    Dukes, Kais ; Atwell, Eric ; Habash, Nizar. / Supervised collaboration for syntactic annotation of Quranic Arabic. In: Language Resources and Evaluation. 2013 ; Vol. 47, No. 1. pp. 33-62.
    @article{47d389ee807045bdbe505860c5c3e654,
    title = "Supervised collaboration for syntactic annotation of Quranic Arabic",
    abstract = "The Quranic Arabic Corpus (http://corpus.quran.com) is a collaboratively constructed linguistic resource initiated at the University of Leeds, with multiple layers of annotation including part-of-speech tagging, morphological segmentation (Dukes and Habash 2010) and syntactic analysis using dependency grammar (Dukes and Buckwalter 2010). The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year-old central religious text of Islam. This project contrasts with other Arabic treebanks by providing a deep linguistic model based on the historical traditional grammar known as i′rāb (urdu source). By adapting this well-known canon of Quranic grammar into a familiar tagset, it is possible to encourage online annotation by Arabic linguists and Quranic experts. This article presents a new approach to linguistic annotation of an Arabic corpus: online supervised collaboration using a multi-stage approach. The different stages include automatic rule-based tagging, initial manual verification, and online supervised collaborative proofreading. A popular website attracting thousands of visitors per day, the Quranic Arabic Corpus has approximately 100 unpaid volunteer annotators each suggesting corrections to existing linguistic tagging. To ensure a high-quality resource, a small number of expert annotators are promoted to a supervisory role, allowing them to review or veto suggestions made by other collaborators. The Quran also benefits from a large body of existing historical grammatical analysis, which may be leveraged during this review. In this paper we evaluate and report on the effectiveness of the chosen annotation methodology. We also discuss the unique challenges of annotating Quranic Arabic online and describe the custom linguistic software used to aid collaborative annotation.",
    keywords = "Arabic, Collaborative annotation, Corpus, Quran, Treebank",
    author = "Kais Dukes and Eric Atwell and Nizar Habash",
    year = "2013",
    month = "3",
    day = "12",
    doi = "10.1007/s10579-011-9167-7",
    language = "English (US)",
    volume = "47",
    pages = "33--62",
    journal = "Language Resources and Evaluation",
    issn = "1574-020X",
    publisher = "Springer Netherlands",
    number = "1",

    }

    TY - JOUR

    T1 - Supervised collaboration for syntactic annotation of Quranic Arabic

    AU - Dukes, Kais

    AU - Atwell, Eric

    AU - Habash, Nizar

    PY - 2013/3/12

    Y1 - 2013/3/12

    N2 - The Quranic Arabic Corpus (http://corpus.quran.com) is a collaboratively constructed linguistic resource initiated at the University of Leeds, with multiple layers of annotation including part-of-speech tagging, morphological segmentation (Dukes and Habash 2010) and syntactic analysis using dependency grammar (Dukes and Buckwalter 2010). The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year-old central religious text of Islam. This project contrasts with other Arabic treebanks by providing a deep linguistic model based on the historical traditional grammar known as i′rāb (urdu source). By adapting this well-known canon of Quranic grammar into a familiar tagset, it is possible to encourage online annotation by Arabic linguists and Quranic experts. This article presents a new approach to linguistic annotation of an Arabic corpus: online supervised collaboration using a multi-stage approach. The different stages include automatic rule-based tagging, initial manual verification, and online supervised collaborative proofreading. A popular website attracting thousands of visitors per day, the Quranic Arabic Corpus has approximately 100 unpaid volunteer annotators each suggesting corrections to existing linguistic tagging. To ensure a high-quality resource, a small number of expert annotators are promoted to a supervisory role, allowing them to review or veto suggestions made by other collaborators. The Quran also benefits from a large body of existing historical grammatical analysis, which may be leveraged during this review. In this paper we evaluate and report on the effectiveness of the chosen annotation methodology. We also discuss the unique challenges of annotating Quranic Arabic online and describe the custom linguistic software used to aid collaborative annotation.

    AB - The Quranic Arabic Corpus (http://corpus.quran.com) is a collaboratively constructed linguistic resource initiated at the University of Leeds, with multiple layers of annotation including part-of-speech tagging, morphological segmentation (Dukes and Habash 2010) and syntactic analysis using dependency grammar (Dukes and Buckwalter 2010). The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year-old central religious text of Islam. This project contrasts with other Arabic treebanks by providing a deep linguistic model based on the historical traditional grammar known as i′rāb (urdu source). By adapting this well-known canon of Quranic grammar into a familiar tagset, it is possible to encourage online annotation by Arabic linguists and Quranic experts. This article presents a new approach to linguistic annotation of an Arabic corpus: online supervised collaboration using a multi-stage approach. The different stages include automatic rule-based tagging, initial manual verification, and online supervised collaborative proofreading. A popular website attracting thousands of visitors per day, the Quranic Arabic Corpus has approximately 100 unpaid volunteer annotators each suggesting corrections to existing linguistic tagging. To ensure a high-quality resource, a small number of expert annotators are promoted to a supervisory role, allowing them to review or veto suggestions made by other collaborators. The Quran also benefits from a large body of existing historical grammatical analysis, which may be leveraged during this review. In this paper we evaluate and report on the effectiveness of the chosen annotation methodology. We also discuss the unique challenges of annotating Quranic Arabic online and describe the custom linguistic software used to aid collaborative annotation.

    KW - Arabic

    KW - Collaborative annotation

    KW - Corpus

    KW - Quran

    KW - Treebank

    UR - http://www.scopus.com/inward/record.url?scp=84874724008&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84874724008&partnerID=8YFLogxK

    U2 - 10.1007/s10579-011-9167-7

    DO - 10.1007/s10579-011-9167-7

    M3 - Review article

    VL - 47

    SP - 33

    EP - 62

    JO - Language Resources and Evaluation

    JF - Language Resources and Evaluation

    SN - 1574-020X

    IS - 1

    ER -