A parallel corpus of Arabic-Japanese news articles

Go Inoue, Nizar Habash, Yuji Matsumoto, Hiroyuki Aoyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.

Original languageEnglish (US)
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages918-924
Number of pages7
ISBN (Electronic)9791095546009
Publication statusPublished - Jan 1 2019
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: May 7 2018May 12 2018

Other

Other11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period5/7/185/12/18

    Fingerprint

Keywords

  • Arabic
  • Japanese
  • Machine Translation
  • Parallel Corpus
  • Sentence Alignment

ASJC Scopus subject areas

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

Cite this

Inoue, G., Habash, N., Matsumoto, Y., & Aoyama, H. (2019). A parallel corpus of Arabic-Japanese news articles. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 918-924). European Language Resources Association (ELRA).