A parallel corpus of Arabic-Japanese news articles

Go Inoue, Nizar Habash, Yuji Matsumoto, Hiroyuki Aoyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.

Original languageEnglish (US)
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages918-924
Number of pages7
ISBN (Electronic)9791095546009
StatePublished - Jan 1 2019
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: May 7 2018May 12 2018

Other

Other11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period5/7/185/12/18

Fingerprint

news
language
typology
statistics
News Articles
Parallel Corpora
Tokyo
Machine Translation

Keywords

  • Arabic
  • Japanese
  • Machine Translation
  • Parallel Corpus
  • Sentence Alignment

ASJC Scopus subject areas

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

Cite this

Inoue, G., Habash, N., Matsumoto, Y., & Aoyama, H. (2019). A parallel corpus of Arabic-Japanese news articles. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 918-924). European Language Resources Association (ELRA).

A parallel corpus of Arabic-Japanese news articles. / Inoue, Go; Habash, Nizar; Matsumoto, Yuji; Aoyama, Hiroyuki.

LREC 2018 - 11th International Conference on Language Resources and Evaluation. ed. / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. p. 918-924.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Inoue, G, Habash, N, Matsumoto, Y & Aoyama, H 2019, A parallel corpus of Arabic-Japanese news articles. in H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (eds), LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), pp. 918-924, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 5/7/18.
Inoue G, Habash N, Matsumoto Y, Aoyama H. A parallel corpus of Arabic-Japanese news articles. In Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, editors, LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). 2019. p. 918-924
Inoue, Go ; Habash, Nizar ; Matsumoto, Yuji ; Aoyama, Hiroyuki. / A parallel corpus of Arabic-Japanese news articles. LREC 2018 - 11th International Conference on Language Resources and Evaluation. editor / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. pp. 918-924
@inproceedings{f9909e330f9d4736906de4f7b88ee438,
title = "A parallel corpus of Arabic-Japanese news articles",
abstract = "Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.",
keywords = "Arabic, Japanese, Machine Translation, Parallel Corpus, Sentence Alignment",
author = "Go Inoue and Nizar Habash and Yuji Matsumoto and Hiroyuki Aoyama",
year = "2019",
month = "1",
day = "1",
language = "English (US)",
pages = "918--924",
editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
booktitle = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - A parallel corpus of Arabic-Japanese news articles

AU - Inoue, Go

AU - Habash, Nizar

AU - Matsumoto, Yuji

AU - Aoyama, Hiroyuki

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.

AB - Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.

KW - Arabic

KW - Japanese

KW - Machine Translation

KW - Parallel Corpus

KW - Sentence Alignment

UR - http://www.scopus.com/inward/record.url?scp=85059886906&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059886906&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85059886906

SP - 918

EP - 924

BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Piperidis, Stelios

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Hasida, Koiti

A2 - Mazo, Helene

A2 - Choukri, Khalid

A2 - Goggi, Sara

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Calzolari, Nicoletta

A2 - Odijk, Jan

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

ER -