A parallel corpus of Arabic-Japanese news articles

Go Inoue, Nizar Habash, Yuji Matsumoto, Hiroyuki Aoyama

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.

    Original languageEnglish (US)
    Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
    EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
    PublisherEuropean Language Resources Association (ELRA)
    Pages918-924
    Number of pages7
    ISBN (Electronic)9791095546009
    StatePublished - Jan 1 2019
    Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
    Duration: May 7 2018May 12 2018

    Other

    Other11th International Conference on Language Resources and Evaluation, LREC 2018
    CountryJapan
    CityMiyazaki
    Period5/7/185/12/18

    Fingerprint

    news
    language
    typology
    statistics
    News Articles
    Parallel Corpora
    Tokyo
    Machine Translation

    Keywords

    • Arabic
    • Japanese
    • Machine Translation
    • Parallel Corpus
    • Sentence Alignment

    ASJC Scopus subject areas

    • Linguistics and Language
    • Education
    • Library and Information Sciences
    • Language and Linguistics

    Cite this

    Inoue, G., Habash, N., Matsumoto, Y., & Aoyama, H. (2019). A parallel corpus of Arabic-Japanese news articles. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 918-924). European Language Resources Association (ELRA).

    A parallel corpus of Arabic-Japanese news articles. / Inoue, Go; Habash, Nizar; Matsumoto, Yuji; Aoyama, Hiroyuki.

    LREC 2018 - 11th International Conference on Language Resources and Evaluation. ed. / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. p. 918-924.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Inoue, G, Habash, N, Matsumoto, Y & Aoyama, H 2019, A parallel corpus of Arabic-Japanese news articles. in H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (eds), LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), pp. 918-924, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 5/7/18.
    Inoue G, Habash N, Matsumoto Y, Aoyama H. A parallel corpus of Arabic-Japanese news articles. In Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, editors, LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). 2019. p. 918-924
    Inoue, Go ; Habash, Nizar ; Matsumoto, Yuji ; Aoyama, Hiroyuki. / A parallel corpus of Arabic-Japanese news articles. LREC 2018 - 11th International Conference on Language Resources and Evaluation. editor / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. pp. 918-924
    @inproceedings{390078773b3c4585bb2d283923ba4b19,
    title = "A parallel corpus of Arabic-Japanese news articles",
    abstract = "Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.",
    keywords = "Arabic, Japanese, Machine Translation, Parallel Corpus, Sentence Alignment",
    author = "Go Inoue and Nizar Habash and Yuji Matsumoto and Hiroyuki Aoyama",
    year = "2019",
    month = "1",
    day = "1",
    language = "English (US)",
    pages = "918--924",
    editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
    booktitle = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",
    publisher = "European Language Resources Association (ELRA)",

    }

    TY - GEN

    T1 - A parallel corpus of Arabic-Japanese news articles

    AU - Inoue, Go

    AU - Habash, Nizar

    AU - Matsumoto, Yuji

    AU - Aoyama, Hiroyuki

    PY - 2019/1/1

    Y1 - 2019/1/1

    N2 - Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.

    AB - Much work has been done on machine translation between major language pairs including Arabic-English and English-Japanese thanks to the availability of large-scale parallel corpora with manually verified subsets of parallel sentences. However, there has been little research conducted on the Arabic-Japanese language pair due to its parallel-data scarcity, despite being a good example of interestingly contrasting differences in typology. In this paper, we describe the creation process and statistics of the Arabic-Japanese portion of the TUFS Media Corpus, a parallel corpus of translated news articles collected at Tokyo University of Foreign Studies (TUFS). Part of the corpus is manually aligned at the sentence level for development and testing. The corpus is provided in two formats: A document-level parallel corpus in XML format, and a sentence-level parallel corpus in plain text format. We also report the first results of Arabic-Japanese phrase-based machine translation trained on our corpus.

    KW - Arabic

    KW - Japanese

    KW - Machine Translation

    KW - Parallel Corpus

    KW - Sentence Alignment

    UR - http://www.scopus.com/inward/record.url?scp=85059886906&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85059886906&partnerID=8YFLogxK

    M3 - Conference contribution

    SP - 918

    EP - 924

    BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation

    A2 - Isahara, Hitoshi

    A2 - Maegaard, Bente

    A2 - Piperidis, Stelios

    A2 - Cieri, Christopher

    A2 - Declerck, Thierry

    A2 - Hasida, Koiti

    A2 - Mazo, Helene

    A2 - Choukri, Khalid

    A2 - Goggi, Sara

    A2 - Mariani, Joseph

    A2 - Moreno, Asuncion

    A2 - Calzolari, Nicoletta

    A2 - Odijk, Jan

    A2 - Tokunaga, Takenobu

    PB - European Language Resources Association (ELRA)

    ER -