Personal tools
You are here: Home Publications Performability Modeling for Scheduling and Fault Tolerance Strategies for Scientific Workflows
Document Actions

Lavanya Ramakrishnan and Daniel Reed (2008)

Performability Modeling for Scheduling and Fault Tolerance Strategies for Scientific Workflows

In: ACM/IEEE International Symposium on High Performance Distributed Computing (HPDC).

Scientific applications have diverse characteristics and resource requirements. When combined with the complexity of underlying distributed resources on which they execute (e.g. Grid, cloud computing), these applications can experience significant performance fluctuations as machine reliability varies. Although the performance and reliability of cluster and Grid systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. To enable a dynamic environment that can account for such changes while providing required QoS, next generation tools will need extensible application interfaces that allow users to qualitatively express performance and reliability requirements for the underlying systems. In this paper, we use the  concept of performability to capture the degraded performance that might result from varying resource availability. We apply the resulting model to workflow planning and fault tolerance strategies. We present experimental data to validate our model and use simulation results driven by failure data from real HPC systems to demonstrate how the proposed scheme better accounts for resource availability.

by Lavanya Ramakrishnan last modified 2008-05-15 11:10
« September 2010 »
Su Mo Tu We Th Fr Sa
1234
567891011
12131415161718
19202122232425
2627282930
 

VGrADS Collaborators include:

Rice University UCSD UH UCSB UTK ISI UTK

Powered by Plone