Translate, predict or generate: Modeling rich morphology in statistical machine translation

Ahmed El Kholy, Nizar Habash

Research output: Contribution to conferencePaper

Abstract

We compare three methods of modeling morphological features in statistical machine translation (SMT) from English to Arabic, a morphologically rich language. Features can be modeled as part of the core translation process mapping source tokens to target tokens. Alternatively these features can be generated using target monolingual context as part of a separate generation (or post-translation inflection) step. Finally, the features can be predicted using both source and target information in a separate step from translation and generation. We focus on three morphological features that we demonstrate through a manual error analysis to be most problematic for English-Arabic SMT: gender, number and the determiner clitic. Our results show significant improvements over a state-of-the-art baseline (phrase-based SMT) of almost 1% absolute BLEU on a medium size training set. Our best configuration models the determiner as part of core translation and predicts gender and number separately, and handles the rest of the features through generation.

Original languageEnglish (US)
Pages27-34
Number of pages8
StatePublished - Jan 1 2012
Event16th Annual Conference of the European Association for Machine Translation, EAMT 2012 - Trento, Italy
Duration: May 28 2012May 30 2012

Other

Other16th Annual Conference of the European Association for Machine Translation, EAMT 2012
CountryItaly
CityTrento
Period5/28/125/30/12

Fingerprint

Error analysis
Modeling
Statistical Machine Translation
Determiners
Clitics
Inflection
Translation Process
Error Analysis
Language

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Software

Cite this

El Kholy, A., & Habash, N. (2012). Translate, predict or generate: Modeling rich morphology in statistical machine translation. 27-34. Paper presented at 16th Annual Conference of the European Association for Machine Translation, EAMT 2012, Trento, Italy.

Translate, predict or generate : Modeling rich morphology in statistical machine translation. / El Kholy, Ahmed; Habash, Nizar.

2012. 27-34 Paper presented at 16th Annual Conference of the European Association for Machine Translation, EAMT 2012, Trento, Italy.

Research output: Contribution to conferencePaper

El Kholy, A & Habash, N 2012, 'Translate, predict or generate: Modeling rich morphology in statistical machine translation', Paper presented at 16th Annual Conference of the European Association for Machine Translation, EAMT 2012, Trento, Italy, 5/28/12 - 5/30/12 pp. 27-34.
El Kholy A, Habash N. Translate, predict or generate: Modeling rich morphology in statistical machine translation. 2012. Paper presented at 16th Annual Conference of the European Association for Machine Translation, EAMT 2012, Trento, Italy.
El Kholy, Ahmed ; Habash, Nizar. / Translate, predict or generate : Modeling rich morphology in statistical machine translation. Paper presented at 16th Annual Conference of the European Association for Machine Translation, EAMT 2012, Trento, Italy.8 p.
@conference{ac406f69831f4486866ad774706f651f,
title = "Translate, predict or generate: Modeling rich morphology in statistical machine translation",
abstract = "We compare three methods of modeling morphological features in statistical machine translation (SMT) from English to Arabic, a morphologically rich language. Features can be modeled as part of the core translation process mapping source tokens to target tokens. Alternatively these features can be generated using target monolingual context as part of a separate generation (or post-translation inflection) step. Finally, the features can be predicted using both source and target information in a separate step from translation and generation. We focus on three morphological features that we demonstrate through a manual error analysis to be most problematic for English-Arabic SMT: gender, number and the determiner clitic. Our results show significant improvements over a state-of-the-art baseline (phrase-based SMT) of almost 1{\%} absolute BLEU on a medium size training set. Our best configuration models the determiner as part of core translation and predicts gender and number separately, and handles the rest of the features through generation.",
author = "{El Kholy}, Ahmed and Nizar Habash",
year = "2012",
month = "1",
day = "1",
language = "English (US)",
pages = "27--34",
note = "16th Annual Conference of the European Association for Machine Translation, EAMT 2012 ; Conference date: 28-05-2012 Through 30-05-2012",

}

TY - CONF

T1 - Translate, predict or generate

T2 - Modeling rich morphology in statistical machine translation

AU - El Kholy, Ahmed

AU - Habash, Nizar

PY - 2012/1/1

Y1 - 2012/1/1

N2 - We compare three methods of modeling morphological features in statistical machine translation (SMT) from English to Arabic, a morphologically rich language. Features can be modeled as part of the core translation process mapping source tokens to target tokens. Alternatively these features can be generated using target monolingual context as part of a separate generation (or post-translation inflection) step. Finally, the features can be predicted using both source and target information in a separate step from translation and generation. We focus on three morphological features that we demonstrate through a manual error analysis to be most problematic for English-Arabic SMT: gender, number and the determiner clitic. Our results show significant improvements over a state-of-the-art baseline (phrase-based SMT) of almost 1% absolute BLEU on a medium size training set. Our best configuration models the determiner as part of core translation and predicts gender and number separately, and handles the rest of the features through generation.

AB - We compare three methods of modeling morphological features in statistical machine translation (SMT) from English to Arabic, a morphologically rich language. Features can be modeled as part of the core translation process mapping source tokens to target tokens. Alternatively these features can be generated using target monolingual context as part of a separate generation (or post-translation inflection) step. Finally, the features can be predicted using both source and target information in a separate step from translation and generation. We focus on three morphological features that we demonstrate through a manual error analysis to be most problematic for English-Arabic SMT: gender, number and the determiner clitic. Our results show significant improvements over a state-of-the-art baseline (phrase-based SMT) of almost 1% absolute BLEU on a medium size training set. Our best configuration models the determiner as part of core translation and predicts gender and number separately, and handles the rest of the features through generation.

UR - http://www.scopus.com/inward/record.url?scp=85000936558&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85000936558&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85000936558

SP - 27

EP - 34

ER -