Daniel Nurmi, Rich Wolski, and John Brevik (2004)
Model-Based Checkpoint Scheduling for Volatile Resource Environments
University of California Santa Barbara, Department of Computer Science, Technical Report(2004-25), Santa Barbara, CA, 93106.
In this paper, we describe a system for application check- point scheduling in volatile resource environments. Our approach combines historical measurements of resource availability with an estimate of checkpoint/recovery delay to generate checkpoint intervals that minimize overhead. When executing in a desktop computing or resource harvesting context, long-running applications must checkpoint, since resources can be reclaimed by their owners without warning. Our system records the historical availability from each resource and fits a statistical model to the observations using either Maximum Likelihood Estimation (MLE) or Expectation Maximization (EM). When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the application’s execution, evaluates the expected overhead as a function of the checkpoint interval, and numerically optimizes this quantity. Using Condor as a target platform, we investigate the effectiveness of this technique fitting exponential, Weibull, 2- phase hyperexponential and 3-phase hyperexponential distributions to observed availability data. To verify our method and compare the distributions each against the same conditions, we use observations taken from the Condor pool at the University of Wisconsin and trace-based simulation. We examine the practical value of our approach by observing an implementation of our system when applied to a test application that is then run on the “live” Condor system. Finally, we conclude with a verification of the simulated results against the experimental observations. Our results indicate that application efficiency is relatively insensitive to the choice of distribution (among the ones we investigate) but that induced network load is not.