Building an Arabic machine translation post-edited corpus

Guidelines and annotation

Wajdi Zaghouani, Nizar Habash, Ossama Obeid, Behrang Mohit, Houda Bouamor, Kemal Oflazer

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
    PublisherEuropean Language Resources Association (ELRA)
    Pages1869-1876
    Number of pages8
    ISBN (Electronic)9782951740891
    StatePublished - Jan 1 2016
    Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
    Duration: May 23 2016May 28 2016

    Other

    Other10th International Conference on Language Resources and Evaluation, LREC 2016
    CountrySlovenia
    CityPortoroz
    Period5/23/165/28/16

    Fingerprint

    language
    Annotation
    Machine Translation
    Editing
    Regular
    Language

    Keywords

    • Annotation
    • Guidelines
    • Post-editing

    ASJC Scopus subject areas

    • Linguistics and Language
    • Library and Information Sciences
    • Language and Linguistics
    • Education

    Cite this

    Zaghouani, W., Habash, N., Obeid, O., Mohit, B., Bouamor, H., & Oflazer, K. (2016). Building an Arabic machine translation post-edited corpus: Guidelines and annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 1869-1876). European Language Resources Association (ELRA).

    Building an Arabic machine translation post-edited corpus : Guidelines and annotation. / Zaghouani, Wajdi; Habash, Nizar; Obeid, Ossama; Mohit, Behrang; Bouamor, Houda; Oflazer, Kemal.

    Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. p. 1869-1876.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Zaghouani, W, Habash, N, Obeid, O, Mohit, B, Bouamor, H & Oflazer, K 2016, Building an Arabic machine translation post-edited corpus: Guidelines and annotation. in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), pp. 1869-1876, 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoroz, Slovenia, 5/23/16.
    Zaghouani W, Habash N, Obeid O, Mohit B, Bouamor H, Oflazer K. Building an Arabic machine translation post-edited corpus: Guidelines and annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA). 2016. p. 1869-1876
    Zaghouani, Wajdi ; Habash, Nizar ; Obeid, Ossama ; Mohit, Behrang ; Bouamor, Houda ; Oflazer, Kemal. / Building an Arabic machine translation post-edited corpus : Guidelines and annotation. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. pp. 1869-1876
    @inproceedings{f395e42b909e4ee596ce549bd174c060,
    title = "Building an Arabic machine translation post-edited corpus: Guidelines and annotation",
    abstract = "We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.",
    keywords = "Annotation, Guidelines, Post-editing",
    author = "Wajdi Zaghouani and Nizar Habash and Ossama Obeid and Behrang Mohit and Houda Bouamor and Kemal Oflazer",
    year = "2016",
    month = "1",
    day = "1",
    language = "English (US)",
    pages = "1869--1876",
    booktitle = "Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016",
    publisher = "European Language Resources Association (ELRA)",

    }

    TY - GEN

    T1 - Building an Arabic machine translation post-edited corpus

    T2 - Guidelines and annotation

    AU - Zaghouani, Wajdi

    AU - Habash, Nizar

    AU - Obeid, Ossama

    AU - Mohit, Behrang

    AU - Bouamor, Houda

    AU - Oflazer, Kemal

    PY - 2016/1/1

    Y1 - 2016/1/1

    N2 - We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.

    AB - We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.

    KW - Annotation

    KW - Guidelines

    KW - Post-editing

    UR - http://www.scopus.com/inward/record.url?scp=85037070308&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85037070308&partnerID=8YFLogxK

    M3 - Conference contribution

    SP - 1869

    EP - 1876

    BT - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

    PB - European Language Resources Association (ELRA)

    ER -