VGrADS at Rice University

Sections

Personal tools

You are here: Home → Publications → Algorithm-based fault tolerance applied to High Performance Computing

Document Actions

George Bosilca, Remi Delmas, Jack Dongarra, and Julien Langou (2009)

Algorithm-based fault tolerance applied to High Performance Computing

Journal of Parallel and Distributed Computing, 69(4):410 - 416.

Abstract

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518-528] to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault-tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65\% of the machine peak efficiency and less than 12\% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly.

Note

Published

URL http://www.sciencedirect.com/science/article/B6WKJ-4V8GB44-2/2/2658d9756341bece20e06d1485456678

by Asim YarKhan — last modified 2009-08-27 06:18

News: New, Improved VGrADS Website opens for business 2005-01-10; More news…

VGrADS Collaborators include: