Statistical approaches for natural language modelling and monotone statistical machine translation

Tesis doctoral de Jesús Andres Ferrer

This thesis gathers some contributions to statistical pattern recognition and, more specifically, to several natural language processing (nlp) tasks. Several well-known statistical techniques are revisited in this thesis: parameter estimation, loss function design and probability modelling. The former techniques are applied to several nlp tasks such as text classification (tc), language modelling (lm) and statistical machine translation (smt). in parameter estimation, we tackle the smoothing problem by proposing a constrained domain maximum likelihood estimation (cdmle) technique. the cdmle avoids the need of the smoothing stage that makes the maximum likelihood estimation (mle) to lose its good theoretical properties. This technique is applied to text classification by mean of the naive bayes classifier. Afterwards, the cdmle technique is extended to leaving-one-out mle and, then, applied to lm smoothing. The results obtained in several lm tasks reported an improvement in terms of perplexity compared with the standard smoothing techniques. concerning the loss function, we carefully study the design of loss functions different from the 0-1 loss. We focus our study on those loss functions that while retaining a similar decoding complexity than the 0-1 loss function, provide more flexibility. many candidate loss functions are presented and analysed in several statistical machine translation tasks and for several translation models. We also analyse some outstanding translations rules such as the direct translation rule; and we give a further insight into the log-linear models, which are, in fact, particular cases of loss functions. finally, several monotone translation models are proposed based on well-known modelling techniques. Firstly, an extension to the giati technique is proposed to infer finite state transducers (fst). Afterwards, a phrased-based monotone translation model inspired in hidden markov models is proposed. Lastly, a phrased-based hidden semi-markov model is introduced. The latter model produces slightly improvements over the baseline under some circumstances.

 

Datos académicos de la tesis doctoral «Statistical approaches for natural language modelling and monotone statistical machine translation«

  • Título de la tesis:  Statistical approaches for natural language modelling and monotone statistical machine translation
  • Autor:  Jesús Andres Ferrer
  • Universidad:  Politécnica de Valencia
  • Fecha de lectura de la tesis:  05/02/2010

 

Dirección y tribunal

  • Director de la tesis
    • Alfons Juan Ciscar
  • Tribunal
    • Presidente del tribunal: enrique Vidal ruiz
    • marcello Federico (vocal)
    • Koehn (vocal)
    • (vocal)

 

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

Scroll al inicio