Determining case in Arabic: Learning complex linguistic behavior requires complex linguistic features

Nizar Habash, Ryan Gabbard, Owen Rambow, Seth Kulick, Mitch Marcus

Research output: Contribution to conferencePaper

Abstract

This paper discusses automatic determination of case in Arabic. This task is a major source of errors in full diacritization of Arabic. We use a gold-standard syntactic tree, and obtain an error rate of about 4.2%, with a machine learning based system outperforming a system using hand-written rules. A careful error analysis suggests that when we account for annotation errors in the gold standard, the error rate drops to 0.8%, with the hand-written rules outperforming the machine learning-based system.

Original languageEnglish (US)
Pages1084-1092
Number of pages9
StatePublished - Dec 1 2007
Event2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007 - Prague, Czech Republic
Duration: Jun 28 2007Jun 28 2007

Other

Other2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007
CountryCzech Republic
CityPrague
Period6/28/076/28/07

Fingerprint

Linguistics
linguistics
gold standard
learning
Learning systems
Syntactics
Error analysis
Gold

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems
  • Artificial Intelligence
  • Human-Computer Interaction
  • Linguistics and Language

Cite this

Habash, N., Gabbard, R., Rambow, O., Kulick, S., & Marcus, M. (2007). Determining case in Arabic: Learning complex linguistic behavior requires complex linguistic features. 1084-1092. Paper presented at 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007, Prague, Czech Republic.

Determining case in Arabic : Learning complex linguistic behavior requires complex linguistic features. / Habash, Nizar; Gabbard, Ryan; Rambow, Owen; Kulick, Seth; Marcus, Mitch.

2007. 1084-1092 Paper presented at 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007, Prague, Czech Republic.

Research output: Contribution to conferencePaper

Habash, N, Gabbard, R, Rambow, O, Kulick, S & Marcus, M 2007, 'Determining case in Arabic: Learning complex linguistic behavior requires complex linguistic features', Paper presented at 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007, Prague, Czech Republic, 6/28/07 - 6/28/07 pp. 1084-1092.
Habash N, Gabbard R, Rambow O, Kulick S, Marcus M. Determining case in Arabic: Learning complex linguistic behavior requires complex linguistic features. 2007. Paper presented at 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007, Prague, Czech Republic.
Habash, Nizar ; Gabbard, Ryan ; Rambow, Owen ; Kulick, Seth ; Marcus, Mitch. / Determining case in Arabic : Learning complex linguistic behavior requires complex linguistic features. Paper presented at 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007, Prague, Czech Republic.9 p.
@conference{d2cddd055cb64ccb8b03c3d1b8222811,
title = "Determining case in Arabic: Learning complex linguistic behavior requires complex linguistic features",
abstract = "This paper discusses automatic determination of case in Arabic. This task is a major source of errors in full diacritization of Arabic. We use a gold-standard syntactic tree, and obtain an error rate of about 4.2{\%}, with a machine learning based system outperforming a system using hand-written rules. A careful error analysis suggests that when we account for annotation errors in the gold standard, the error rate drops to 0.8{\%}, with the hand-written rules outperforming the machine learning-based system.",
author = "Nizar Habash and Ryan Gabbard and Owen Rambow and Seth Kulick and Mitch Marcus",
year = "2007",
month = "12",
day = "1",
language = "English (US)",
pages = "1084--1092",
note = "2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007 ; Conference date: 28-06-2007 Through 28-06-2007",

}

TY - CONF

T1 - Determining case in Arabic

T2 - Learning complex linguistic behavior requires complex linguistic features

AU - Habash, Nizar

AU - Gabbard, Ryan

AU - Rambow, Owen

AU - Kulick, Seth

AU - Marcus, Mitch

PY - 2007/12/1

Y1 - 2007/12/1

N2 - This paper discusses automatic determination of case in Arabic. This task is a major source of errors in full diacritization of Arabic. We use a gold-standard syntactic tree, and obtain an error rate of about 4.2%, with a machine learning based system outperforming a system using hand-written rules. A careful error analysis suggests that when we account for annotation errors in the gold standard, the error rate drops to 0.8%, with the hand-written rules outperforming the machine learning-based system.

AB - This paper discusses automatic determination of case in Arabic. This task is a major source of errors in full diacritization of Arabic. We use a gold-standard syntactic tree, and obtain an error rate of about 4.2%, with a machine learning based system outperforming a system using hand-written rules. A careful error analysis suggests that when we account for annotation errors in the gold standard, the error rate drops to 0.8%, with the hand-written rules outperforming the machine learning-based system.

UR - http://www.scopus.com/inward/record.url?scp=55649104405&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=55649104405&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:55649104405

SP - 1084

EP - 1092

ER -