Advances in fully-automatic and interactive phrase-based statistical machine translation

Tesis doctoral de Daniel Ortiz Martínez

This thesis presents different contributions in the fields of fully-automatic statistical machine translation and interactive statistical machine translation. in the field of statistical machine translation there are three problems that are to be addressed, namely, the modelling problem, the training problem and the search problem. In this thesis we present contributions regarding these three problems. regarding the modelling problem, an alternative derivation of phrase-based statistical translation models is proposed. Such derivation introduces a set of statistical submodels governing different aspects of the translation process. In addition to this, the resulting submodels can be introduced as components of a log-linear model. regarding the training problem, an alternative estimation technique for phrase-based models that tries to reduce the strong heuristic component of the standard estimation technique is proposed. The proposed estimation technique considers the phrase pairs that compose the phrase model as part of complete bisegmentations of the source and target sentences. We theoretically and empirically demonstrate that the proposed estimation technique can be efficiently executed. Experimental results obtained with the open-source thot toolkit also presented in this thesis, show that the alternative estimation technique obtains phrase models with lower perplexity than those obtained by means of the standard estimation technique. However, the reduction in the perplexity of the model did not allow us to obtain improvements in the translation quality. to deal with the search problem, we propose a search algorithm which is based on the branch-and-bound search paradigm. The proposed algorithm generalises different search strategies that can be accessed by modifying the input parameters. We carried out experiments to evaluate the performance of the proposed search algorithm. additionally, we also study an alternative formalisation of the search problem in which the best alignment at phrase-level is obtained given the source and target sentences. To solve this problem, smoothing techniques are applied over the phrase table. In addition to this, the standard search algorithm for phrase-based statistical machine translation is modified to explore the space of possible alignments. Empirical results show that the proposed techniques can be used to efficiently and robustly generate phrase-based alignments. one disadvantage of phrase-based models is its huge size when they are estimated from very large corpora. In this thesis, we propose techniques to alleviate this problem during both the estimation and the decoding stages. For this purpose, main memory requirements are transformed into hard disk requirements. Experimental results show that the hard disk accesses do not significantly decrease the efficiency of the smt system. with respect to the contributions in the field of interactive statistical machine translation, on the one hand, we present alternative techniques to implement interactive machine translation systems. On the other hand, we give a proposal of an interactive machine translation system which is able to learn from user-feedback by means of online learning techniques. we propose two alternative techniques for interactive statistical machine translation. The first one is based on the generation of partial alignments at phrase level. This approach constitutes an application of the phrase-based alignment generation techniques that are also proposed in this thesis. The second proposal tackles the interactive machine translation process by means of word graphs and stochastic error-correction models. The proposed approach differs from other existing approaches described in the literature in the introduction of error-correction techniques in the statistical framework of the interactive machine translation process. We carried out experiments to evaluate the two proposed techniques, showing that they are competitive with state-of-the-art interactive machine translation systems. In addition to this, such techniques have been used to implement an interactive machine translation prototype following a client-server architecture. finally, the above mentioned interactive machine translation system with online learning is based on the use of statistical models that can be incrementally updated. The main difficulty defining incremental versions of the statistical models involved in the interactive translation process appears when such models are estimated by means of the expectation-maximisation algorithm. To solve this problem, we propose the application of the incremental version of such algorithm. The proposed interactive machine translation system with online learning was empirically evaluated, demonstrating that the system is able to learn from scratch or from previously estimated models. In addition to this, the obtained results also show that the interactive machine translation system with online learning significantly outperforms other state-of-the-art systems described in the literature.

 

Datos académicos de la tesis doctoral «Advances in fully-automatic and interactive phrase-based statistical machine translation«

  • Título de la tesis:  Advances in fully-automatic and interactive phrase-based statistical machine translation
  • Autor:  Daniel Ortiz Martínez
  • Universidad:  Politécnica de Valencia
  • Fecha de lectura de la tesis:  07/10/2011

 

Dirección y tribunal

  • Director de la tesis
    • Ismael García Varea
  • Tribunal
    • Presidente del tribunal: enrique Vidal ruiz
    • hermann Ney (vocal)
    • marcello Federico (vocal)
    • philipp Koehn (vocal)

 

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

Scroll al inicio