New reordering and modeling approaches for statistical machine translation

Ruiz Costa-Jussà, Marta

New reordering and modeling approaches for statistical machine translation

Ruiz Costa-Jussà, Marta

Dirixida por:

José Adrián Rodríguez Fonollosa Director

Universidade de defensa: Universitat Politècnica de Catalunya (UPC)

Fecha de defensa: 17 de setembro de 2008

Tribunal:

José Bernardo Mariño Acebal Presidente/a
Lluís Márquez Villodre Secretario/a
Carmen García Mateo Vogal
Philipp Koehn Vogal
Holger Schwenk Vogal

Tipo: Tese

Teseo: 275161 DIALNET

Resumo

This thesis focuses on the statistical machine translation (SMT) framework and primarly on the definition and experimentation of novel algorithms for building a correct structural reordering for translated words, Moreover, challenging techniques regarding language modeling and system combination are successfully applied to state-of-the-art SMT systems. To begin, a thorough study of the SMT state-of-the-art is performed. Ngram- and phrase-based SMT feature functions are described. The former, which has been developed in our research group, is used as a baseline system and the latter, given its popularity, is used to deepen the new techniques during experimentation. Then, the introduction of continuous space language models is reported and analyzed in an Ngram-based system that uses translation and target language models. The continuous space language modeling technique is based on projecting word indices onto a continuous space.The resulting probability functions are smooth functions of the word representation. Events are better estimated than in standard smoothing methods, which is shown by the significant reduction in perplexity. This better probability estimation allows for an improvement in translation quality. Moreover, this thesis performs a two-system combination considering the phrase and Ngram-based systems. Multiple outputs of both systems with their corresponding score are concatenated, and for each system translation the score given by the opposite system is computed. The final translation is properly chosen by simultaneously considering the scores given by both systems. Finally, this thesis proposes the introduction of novel statistical reordering techniques in an SMT system. The first approach is based on an algorithm that detects, learns and infers pairs of words in the source language that swap in the target language providing accurate local reorderings. The second approach consists of generating weighted reordering hypotheses using the same powerful techniques of SMT systems in order to undo the source language structure and to make it more similar to the target language structure. Therefore, the translation challenge is divided into two steps: predicting the order of the words in the target language and substituting these words in the target language. In order to infer new reorderings that were not learnt during training, the SMR system uses word classes instead of words themselves. In order to correctly integrate the SMR and SMT systems, both are concatenated, by using a word graph. This approach is an elegant and efficient reordering approach that is capable of achieving significantly improved translation in the target language.