Fault Tolerant Linear Algebra
In cooperation with the Self Adaptive Numerical Software (SANS) effort, we are implementing a strategy for the use of iterative methods under the Virtual Grid Execution System (vgES). These efforts will use the FT-LA (fault-tolerant linear algebra) package from University of Tennessee together with FT-MPI and GridSolve. (More information about FT-MPI can also be found under Programming Tools.) FT-MPI is a fault tolerant implementation of the MPI 1.2 specification being developed at University of Tennessee, Knoxville (UTK). The approach to adding fault tolerance is to enhance the error modes of MPI to allow damaged communicators to be repaired. We are exploring the use of FT-MPI as a mechanism for adding fault tolerance to MPI applications that run under vgES. The initial implementation is via the GridSolve framework to VGrADS. GridSolve is middleware created to provide a seamless bridge between the simple, standard programming interfaces, such as Fortran, C, and Matlab, and desktop Scientific Computing Environments (SCEs).
FT-LA will analyze the matrix to determine the method and the approximate number of iterations the computation will take to converge. Given the approximate number of iterations, the number of operations per iteration for the specific matrix and method, and the time constraint, FT-LA will call vgES to construct an appropriate Virtual Grid (VG) - in this case, a set of processors to solve the problem, most likely within a cluster. vgES will return a set of processors with information on the predicted number of process failures that will occur during the execution time. This predicted failure rate might, for example, be provided by the automatic resource characterization system. FT-LA will set up the problem and use vgES to launch the job on the processors, monitor the health of the system, and (if required) initiate recovery from faults detected.
Once running on the VG, FT-LA must be able to tolerate a small number of process failures in large scale computing. The goal is to develop scalable fault tolerance techniques to help to make future high performance computing applications self-adaptive and fault survivable. The fundamental challenge in this research is scalability. To approach this challenge, we are
Extending existing diskless checkpointing techniques to enable them to better scale in large scale high performance computing systems.
Designing checkpoint-free fault tolerance techniques for linear algebra computations to survive process failures without checkpoint or rollback recovery.Developing coding approaches and novel erasure correcting codes to help applications to survive multiple simultaneous process failures.
The fault tolerance schemes we are developing are scalable in the sense that the overhead to tolerate a failure of a fixed number of processes does not increase as the number of total processes in a parallel system increases.