Exploiting Arabic diacritization for high quality automatic annotation

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.

Original languageEnglish (US)
Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
PublisherEuropean Language Resources Association (ELRA)
Pages4298-4304
Number of pages7
ISBN (Electronic)9782951740891
StatePublished - Jan 1 2016
Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
Duration: May 23 2016May 28 2016

Other

Other10th International Conference on Language Resources and Evaluation, LREC 2016
CountrySlovenia
CityPortoroz
Period5/23/165/28/16

Fingerprint

typist
genre
Annotation
costs
performance
Diacritics
Lemma
Costs
Part of Speech

Keywords

  • Annotation
  • Arabic
  • Diacritization
  • Morphology

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Language and Linguistics
  • Education

Cite this

Habash, N., Shahrour, A., & Al-Khalil, M. (2016). Exploiting Arabic diacritization for high quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 4298-4304). European Language Resources Association (ELRA).

Exploiting Arabic diacritization for high quality automatic annotation. / Habash, Nizar; Shahrour, Anas; Al-Khalil, Muhamed.

Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. p. 4298-4304.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Habash, N, Shahrour, A & Al-Khalil, M 2016, Exploiting Arabic diacritization for high quality automatic annotation. in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), pp. 4298-4304, 10th International Conference on Language Resources and Evaluation, LREC 2016, Portoroz, Slovenia, 5/23/16.
Habash N, Shahrour A, Al-Khalil M. Exploiting Arabic diacritization for high quality automatic annotation. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA). 2016. p. 4298-4304
Habash, Nizar ; Shahrour, Anas ; Al-Khalil, Muhamed. / Exploiting Arabic diacritization for high quality automatic annotation. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), 2016. pp. 4298-4304
@inproceedings{fd6d253a6f314c8f85e1806fa17f91ed,
title = "Exploiting Arabic diacritization for high quality automatic annotation",
abstract = "We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97{\%} on lemma, part-of-speech, and tokenization combined.",
keywords = "Annotation, Arabic, Diacritization, Morphology",
author = "Nizar Habash and Anas Shahrour and Muhamed Al-Khalil",
year = "2016",
month = "1",
day = "1",
language = "English (US)",
pages = "4298--4304",
booktitle = "Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - Exploiting Arabic diacritization for high quality automatic annotation

AU - Habash, Nizar

AU - Shahrour, Anas

AU - Al-Khalil, Muhamed

PY - 2016/1/1

Y1 - 2016/1/1

N2 - We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.

AB - We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.

KW - Annotation

KW - Arabic

KW - Diacritization

KW - Morphology

UR - http://www.scopus.com/inward/record.url?scp=85037136978&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85037136978&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85037136978

SP - 4298

EP - 4304

BT - Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

PB - European Language Resources Association (ELRA)

ER -