George Bosilca, Zizhong Chen, Jack Dongarra, and Julien Langou (2007)
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
SIAM Journal on Scientific Computing, 30(1):102-116.
A simple checkpoint-free fault-tolerant scheme for parallel iterative methods is given. Assuming that when one processor fails, all its data is lost and the system is recovered with a new processor, this scheme computes a new approximate solution from the data of the non-failed system. The iterative method is then restarted from this new vector. The main advantage of this technique over standard checkpoint is that there is no extra computation added in the iterative solver. In particular, if no failure occurs, the fault-tolerant application is the same as the original application. The main drawback is that the convergence after failure of the method is no longer the same as the original method. In this paper, we present this recovery technique as well as some implementations of checkpoints in iterative methods. Finally, experiments are presented to compare the two techniques. The fault tolerant MPI library is the FT-MPI library. Iterative linear solvers and iterative eigensolvers are considered.