A multidialectal parallel corpus of Arabic

Houda Bouamor, Nizar Habash, Kemal Oflazer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2, 000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

Original languageEnglish (US)
Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
PublisherEuropean Language Resources Association (ELRA)
Pages1240-1245
Number of pages6
ISBN (Electronic)9782951740884
StatePublished - Jan 1 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: May 26 2014May 31 2014

Other

Other9th International Conference on Language Resources and Evaluation, LREC 2014
CountryIceland
CityReykjavik
Period5/26/145/31/14

Fingerprint

dialect
colloquial
Arab
speaking
Parallel Corpora
art
communication
Arabic Dialects
resources
community

Keywords

  • Arabic
  • Dialects
  • Parallel Corpus

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Education
  • Language and Linguistics

Cite this

Bouamor, H., Habash, N., & Oflazer, K. (2014). A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014 (pp. 1240-1245). European Language Resources Association (ELRA).

A multidialectal parallel corpus of Arabic. / Bouamor, Houda; Habash, Nizar; Oflazer, Kemal.

Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. p. 1240-1245.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bouamor, H, Habash, N & Oflazer, K 2014, A multidialectal parallel corpus of Arabic. in Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), pp. 1240-1245, 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, 5/26/14.
Bouamor H, Habash N, Oflazer K. A multidialectal parallel corpus of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA). 2014. p. 1240-1245
Bouamor, Houda ; Habash, Nizar ; Oflazer, Kemal. / A multidialectal parallel corpus of Arabic. Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. pp. 1240-1245
@inproceedings{9b4a95aadfa24ff4971b1d47ab89ef23,
title = "A multidialectal parallel corpus of Arabic",
abstract = "The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2, 000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.",
keywords = "Arabic, Dialects, Parallel Corpus",
author = "Houda Bouamor and Nizar Habash and Kemal Oflazer",
year = "2014",
month = "1",
day = "1",
language = "English (US)",
pages = "1240--1245",
booktitle = "Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - A multidialectal parallel corpus of Arabic

AU - Bouamor, Houda

AU - Habash, Nizar

AU - Oflazer, Kemal

PY - 2014/1/1

Y1 - 2014/1/1

N2 - The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2, 000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

AB - The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2, 000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

KW - Arabic

KW - Dialects

KW - Parallel Corpus

UR - http://www.scopus.com/inward/record.url?scp=85026863473&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85026863473&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85026863473

SP - 1240

EP - 1245

BT - Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

PB - European Language Resources Association (ELRA)

ER -