Fault tolerance configuration for uncoordinated checkpoints

Tesis doctoral de Leonardo Fialho De Queiroz

Parallel computers are growing in complexity and in number of components. The components miniaturisation and concentration are the major root causes of the failures increasingly seen on these computers. Thus, in order to achieve the execution end, parallel application should use a fault tolerance strategy. a widely used strategy is the rollback-recovery, which consists of saving the application state periodically. In the event of a fault occurring, the application resumes it execution from the most recent saved state. These fault tolerance protocols include an overhead on the parallel application execution. using a coordinated checkpointing protocol it becomes easy to estimate the application execution time, as well as to calculate the frequency in which checkpoints should be taken. In fact, there are very precise models to estimate the application execution time and the checkpoint interval nowadays. however, the use of the coordinated checkpointing may not be the best solution to provide fault tolerance on the next-generation parallel computers. In other words, the current paradigm of fault tolerance for parallel applications is not suitable for the future parallel computer. fault tolerance protocols such as uncoordinated checkpointing permits that each process of the parallel application saves its state independently of other processes. The combination of uncoordinated checkpointing with logging of message-passing events avoids the inconvenience of this sort of protocol, such as the domino effect and orphan messages. This is the emergent paradigm of fault tolerance for scalable parallel applications. for instance, there is no model suitable to estimate the execution time of a parallel application protected by uncoordinated checkpointing. As well as there is no convenient model to calculate the frequency in which those checkpoints should be taken. the objective of this thesis is to define suitable models that can be used with each paradigm: the coordinated and the uncoordinated. These models should provide an estimation of the application wall time clock running under each fault tolerance paradigm, as well a methodology to define the value of the variables used to calculate the checkpointing interval. the main motivation of this work is to provide at the same time the knowledge necessary to face the emergent fault tolerance paradigm and make it suitable to be used by parallel applications users.

 

Datos académicos de la tesis doctoral «Fault tolerance configuration for uncoordinated checkpoints«

  • Título de la tesis:  Fault tolerance configuration for uncoordinated checkpoints
  • Autor:  Leonardo Fialho De Queiroz
  • Universidad:  Autónoma de barcelona
  • Fecha de lectura de la tesis:  08/07/2011

 

Dirección y tribunal

  • Director de la tesis
    • Dolores Isabel Rexachs Del Rosario
  • Tribunal
    • Presidente del tribunal: casiano Rodríguez león
    • ramon Doallo biempica (vocal)
    • (vocal)
    • (vocal)

 

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

Scroll al inicio