The madar Arabic dialect corpus and lexicon

Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann, Kemal Oflazer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we present two resources that were created as part of the Multi Arabic Dialect Applications and Resources (MADAR) project. The first is a large parallel corpus of 25 Arabic city dialects in the travel domain. The second is a lexicon of 1,045 concepts with an average of 45 words from 25 cities per concept. These resources are the first of their kind in terms of the breadth of their coverage and the fine location granularity. The focus on cities, as opposed to regions in studying Arabic dialects, opens new avenues to many areas of research from dialectology to dialect identification and machine translation.

Original languageEnglish (US)
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages3387-3396
Number of pages10
ISBN (Electronic)9791095546009
StatePublished - Jan 1 2019
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: May 7 2018May 12 2018

Other

Other11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period5/7/185/12/18

Fingerprint

dialect
resources
coverage
travel
Lexicon
Resources
Arabic Dialects

Keywords

  • Arabic Dialects
  • Lexicon
  • Parallel Corpus

ASJC Scopus subject areas

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

Cite this

Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., ... Oflazer, K. (2019). The madar Arabic dialect corpus and lexicon. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 3387-3396). European Language Resources Association (ELRA).

The madar Arabic dialect corpus and lexicon. / Bouamor, Houda; Habash, Nizar; Salameh, Mohammad; Zaghouani, Wajdi; Rambow, Owen; Abdulrahim, Dana; Obeid, Ossama; Khalifa, Salam; Eryani, Fadhl; Erdmann, Alexander; Oflazer, Kemal.

LREC 2018 - 11th International Conference on Language Resources and Evaluation. ed. / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. p. 3387-3396.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bouamor, H, Habash, N, Salameh, M, Zaghouani, W, Rambow, O, Abdulrahim, D, Obeid, O, Khalifa, S, Eryani, F, Erdmann, A & Oflazer, K 2019, The madar Arabic dialect corpus and lexicon. in H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (eds), LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), pp. 3387-3396, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 5/7/18.
Bouamor H, Habash N, Salameh M, Zaghouani W, Rambow O, Abdulrahim D et al. The madar Arabic dialect corpus and lexicon. In Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, editors, LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). 2019. p. 3387-3396
Bouamor, Houda ; Habash, Nizar ; Salameh, Mohammad ; Zaghouani, Wajdi ; Rambow, Owen ; Abdulrahim, Dana ; Obeid, Ossama ; Khalifa, Salam ; Eryani, Fadhl ; Erdmann, Alexander ; Oflazer, Kemal. / The madar Arabic dialect corpus and lexicon. LREC 2018 - 11th International Conference on Language Resources and Evaluation. editor / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. pp. 3387-3396
@inproceedings{5dd17e80b34245b5bf928b67e2ef1b88,
title = "The madar Arabic dialect corpus and lexicon",
abstract = "In this paper, we present two resources that were created as part of the Multi Arabic Dialect Applications and Resources (MADAR) project. The first is a large parallel corpus of 25 Arabic city dialects in the travel domain. The second is a lexicon of 1,045 concepts with an average of 45 words from 25 cities per concept. These resources are the first of their kind in terms of the breadth of their coverage and the fine location granularity. The focus on cities, as opposed to regions in studying Arabic dialects, opens new avenues to many areas of research from dialectology to dialect identification and machine translation.",
keywords = "Arabic Dialects, Lexicon, Parallel Corpus",
author = "Houda Bouamor and Nizar Habash and Mohammad Salameh and Wajdi Zaghouani and Owen Rambow and Dana Abdulrahim and Ossama Obeid and Salam Khalifa and Fadhl Eryani and Alexander Erdmann and Kemal Oflazer",
year = "2019",
month = "1",
day = "1",
language = "English (US)",
pages = "3387--3396",
editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
booktitle = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - The madar Arabic dialect corpus and lexicon

AU - Bouamor, Houda

AU - Habash, Nizar

AU - Salameh, Mohammad

AU - Zaghouani, Wajdi

AU - Rambow, Owen

AU - Abdulrahim, Dana

AU - Obeid, Ossama

AU - Khalifa, Salam

AU - Eryani, Fadhl

AU - Erdmann, Alexander

AU - Oflazer, Kemal

PY - 2019/1/1

Y1 - 2019/1/1

N2 - In this paper, we present two resources that were created as part of the Multi Arabic Dialect Applications and Resources (MADAR) project. The first is a large parallel corpus of 25 Arabic city dialects in the travel domain. The second is a lexicon of 1,045 concepts with an average of 45 words from 25 cities per concept. These resources are the first of their kind in terms of the breadth of their coverage and the fine location granularity. The focus on cities, as opposed to regions in studying Arabic dialects, opens new avenues to many areas of research from dialectology to dialect identification and machine translation.

AB - In this paper, we present two resources that were created as part of the Multi Arabic Dialect Applications and Resources (MADAR) project. The first is a large parallel corpus of 25 Arabic city dialects in the travel domain. The second is a lexicon of 1,045 concepts with an average of 45 words from 25 cities per concept. These resources are the first of their kind in terms of the breadth of their coverage and the fine location granularity. The focus on cities, as opposed to regions in studying Arabic dialects, opens new avenues to many areas of research from dialectology to dialect identification and machine translation.

KW - Arabic Dialects

KW - Lexicon

KW - Parallel Corpus

UR - http://www.scopus.com/inward/record.url?scp=85045416149&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045416149&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85045416149

SP - 3387

EP - 3396

BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Piperidis, Stelios

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Hasida, Koiti

A2 - Mazo, Helene

A2 - Choukri, Khalid

A2 - Goggi, Sara

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Calzolari, Nicoletta

A2 - Odijk, Jan

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

ER -