New reordering and modeling approaches for statistical machine translation

Ruiz Costa-Jussà, Marta

New reordering and modeling approaches for statistical machine translation

Ruiz Costa-Jussà, Marta

unter der Leitung von:

José Adrián Rodríguez Fonollosa Doktorvater/Doktormutter

Universität der Verteidigung: Universitat Politècnica de Catalunya (UPC)

Fecha de defensa: 17 von September von 2008

Gericht:

José Bernardo Mariño Acebal Präsident/in
Lluís Márquez Villodre Sekretär/in
Carmen García Mateo Vocal
Philipp Koehn Vocal
Holger Schwenk Vocal

Art: Dissertation

Teseo: 275161 DIALNET

Zusammenfassung

This thesis focuses on the statistical machine translation (SMT) framework and primarly on the definition and experimentation of novel algorithms for building a correct structural reordering for translated words, Moreover, challenging techniques regarding language modeling and system combination are successfully applied to state-of-the-art SMT systems. To begin, a thorough study of the SMT state-of-the-art is performed. Ngram- and phrase-based SMT feature functions are described. The former, which has been developed in our research group, is used as a baseline system and the latter, given its popularity, is used to deepen the new techniques during experimentation. Then, the introduction of continuous space language models is reported and analyzed in an Ngram-based system that uses translation and target language models. The continuous space language modeling technique is based on projecting word indices onto a continuous space.The resulting probability functions are smooth functions of the word representation. Events are better estimated than in standard smoothing methods, which is shown by the significant reduction in perplexity. This better probability estimation allows for an improvement in translation quality. Moreover, this thesis performs a two-system combination considering the phrase and Ngram-based systems. Multiple outputs of both systems with their corresponding score are concatenated, and for each system translation the score given by the opposite system is computed. The final translation is properly chosen by simultaneously considering the scores given by both systems. Finally, this thesis proposes the introduction of novel statistical reordering techniques in an SMT system. The first approach is based on an algorithm that detects, learns and infers pairs of words in the source language that swap in the target language providing accurate local reorderings. The second approach consists of generating weighted reordering hypotheses using the same powerful techniques of SMT systems in order to undo the source language structure and to make it more similar to the target language structure. Therefore, the translation challenge is divided into two steps: predicting the order of the words in the target language and substituting these words in the target language. In order to infer new reorderings that were not learnt during training, the SMR system uses word classes instead of words themselves. In order to correctly integrate the SMR and SMT systems, both are concatenated, by using a word graph. This approach is an elegant and efficient reordering approach that is capable of achieving significantly improved translation in the target language.