Personal tools
You are here: Home Publications Reliability Challenges in Large Systems
Document Actions

Daniel A Reed, Charng-da Lu, and Celso L Mendes (2006)

Reliability Challenges in Large Systems

Future Generation Computer Systems, 22(3):293–302.

Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, the assumption of fully reliable hardware and software becomes much less credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, we describe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in the environment and that allow an application to run successfully despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.

by admin last modified 2008-04-30 12:20
« September 2010 »
Su Mo Tu We Th Fr Sa
1234
567891011
12131415161718
19202122232425
2627282930
 

VGrADS Collaborators include:

Rice University UCSD UH UCSB UTK ISI UTK

Powered by Plone