Developing and using a pilot dialectal Arabic treebank

Mohamed Maamouri, Ann Bies, Tim Buckwalter, Mona Diab, Nizar Habash, Owen Rambow, Dalila Tabessi

Research output: Contribution to conferencePaper

Abstract

In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26, 000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedback to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our preexisting MSA resources and the new dialectal corpus.

Original languageEnglish (US)
Pages443-448
Number of pages6
StatePublished - Jan 1 2006
Event5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy
Duration: May 22 2006May 28 2006

Other

Other5th International Conference on Language Resources and Evaluation, LREC 2006
CountryItaly
CityGenoa
Period5/22/065/28/06

Fingerprint

dialect
linguistics
resources
Treebank
telephone
Arabic Dialects
Resources
Syntax
Parsers
Parsing
language
evaluation
experience

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Cite this

Maamouri, M., Bies, A., Buckwalter, T., Diab, M., Habash, N., Rambow, O., & Tabessi, D. (2006). Developing and using a pilot dialectal Arabic treebank. 443-448. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.

Developing and using a pilot dialectal Arabic treebank. / Maamouri, Mohamed; Bies, Ann; Buckwalter, Tim; Diab, Mona; Habash, Nizar; Rambow, Owen; Tabessi, Dalila.

2006. 443-448 Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.

Research output: Contribution to conferencePaper

Maamouri, M, Bies, A, Buckwalter, T, Diab, M, Habash, N, Rambow, O & Tabessi, D 2006, 'Developing and using a pilot dialectal Arabic treebank', Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, 5/22/06 - 5/28/06 pp. 443-448.
Maamouri M, Bies A, Buckwalter T, Diab M, Habash N, Rambow O et al. Developing and using a pilot dialectal Arabic treebank. 2006. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.
Maamouri, Mohamed ; Bies, Ann ; Buckwalter, Tim ; Diab, Mona ; Habash, Nizar ; Rambow, Owen ; Tabessi, Dalila. / Developing and using a pilot dialectal Arabic treebank. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.6 p.
@conference{bd3f8117e2404f498826f156a06a1c4d,
title = "Developing and using a pilot dialectal Arabic treebank",
abstract = "In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26, 000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedback to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our preexisting MSA resources and the new dialectal corpus.",
author = "Mohamed Maamouri and Ann Bies and Tim Buckwalter and Mona Diab and Nizar Habash and Owen Rambow and Dalila Tabessi",
year = "2006",
month = "1",
day = "1",
language = "English (US)",
pages = "443--448",
note = "5th International Conference on Language Resources and Evaluation, LREC 2006 ; Conference date: 22-05-2006 Through 28-05-2006",

}

TY - CONF

T1 - Developing and using a pilot dialectal Arabic treebank

AU - Maamouri, Mohamed

AU - Bies, Ann

AU - Buckwalter, Tim

AU - Diab, Mona

AU - Habash, Nizar

AU - Rambow, Owen

AU - Tabessi, Dalila

PY - 2006/1/1

Y1 - 2006/1/1

N2 - In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26, 000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedback to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our preexisting MSA resources and the new dialectal corpus.

AB - In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26, 000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedback to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our preexisting MSA resources and the new dialectal corpus.

UR - http://www.scopus.com/inward/record.url?scp=84942663602&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84942663602&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:84942663602

SP - 443

EP - 448

ER -