Personal tools
You are here: Home Publications Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
Document Actions

George Bosilca, Zizhong Chen, Jack Dongarra, and Julien Langou (2007)

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

SIAM Journal on Scientific Computing, 30(1):102-116.

A simple checkpoint-free fault-tolerant scheme for parallel iterative methods is given. Assuming that when one processor fails, all its data is lost and the system is recovered with a new processor, this scheme computes a new approximate solution from the data of the non-failed system. The iterative method is then restarted from this new vector. The main advantage of this technique over standard checkpoint is that there is no extra computation added in the iterative solver. In particular, if no failure occurs, the fault-tolerant application is the same as the original application. The main drawback is that the convergence after failure of the method is no longer the same as the original method. In this paper, we present this recovery technique as well as some implementations of checkpoints in iterative methods. Finally, experiments are presented to compare the two techniques. The fault tolerant MPI library is the FT-MPI library. Iterative linear solvers and iterative eigensolvers are considered.

Also available as a University of Tennesse, Department of Computer Science Technical Report \\#UT-CS-04-538.
by admin last modified 2008-05-15 11:22
« September 2010 »
Su Mo Tu We Th Fr Sa
1234
567891011
12131415161718
19202122232425
2627282930
 

VGrADS Collaborators include:

Rice University UCSD UH UCSB UTK ISI UTK

Powered by Plone