Document Actions

Automatic Resource Characterization

by admin — last modified 2007-12-14 12:00

To develop the automatic resource characterization capabilities necessary to build virtual grids, we are developing models and prediction techniques for resource availability. This is a critical capability to enabling good resource selection, efficient use, management for fault tolerance, and improved virtualization of resources. The Execution System team has produced several significant results in this area, including:

development of high-performance and scalable Virtual Grid (VG) interface to the NWS for delivery of resource characterizations,
development of new statistical techniques for predicting and quantitatively characterizing machine availability and batch-queue wait times, and
development of prototype automatic checkpoint schedule.

Our first resource characterization system was a set of resource availability sensors for Linux/Unix workstations and Condor-controlled resources that used the Network Weather Service (NWS) to record resource status. Obtaining accurate availability measurements suitable for automatic characterization proved more difficult than we first anticipated because previous sensing data (e.g. NCSA administrative logs) did not contain the accuracy required to form models and make accurate resource predictions, particularly with quantifiable confidence bounds. Improving sensors was a necessary precursor to our VGrADS work.

We eventually generalized the NWS system as a scalable virtualization system capable of producing a quantitative “picture” of the grid resource pool, including network characteristics in addition to availability. By automatically sensing the topology of networked resources, the system constructs a virtualization by eliding redundant data and replacing it with automatically generated forecasts. For example, to produce a map of the network connectivity between hosts at UCSB and those at UTK, the system determines that the campus-to-campus network dominates the performance any host at one campus observes when communicating with any host at the other campus. Thus the campus-to-campus connectivity need only be measured once and the forecasts used for any pair of hosts. Because the forecasting can take place either inside the NWS or in the client API library, this system can deliver data to the vgES rapidly and scalably. We have designed and implemented a vgES–NWS interface that allows the wealth of NWS information to be concisely transferred to both the vgFAB (guiding resource selection) and to individual VGs (enabling applications to manage their use across resources in their VG). We have also tested this system with 10,000 forecasts to ensure that its performance can handle grids at scale.

The statistical results focus on a new set of forecasting methods that produce verifiable confidence bounds for both machine availability and batch-queue waiting times. Availability predictions are critical to the development of performance efficient fault-tolerance mechanisms. These forecasts use Maximum Likelihood Estimation (MLE) and Expectation Maximization (EM) to derive availability models from measurement data. The system uses two goodness-of-fit tests (Kolmogorov-Smirnoff and Anderson-Darling) to evaluate competitive models to identify the most accurate fit. Our evaluations have (somewhat surprisingly) shown that this two-parameter method provides nearly as much accuracy as more complex methods such as EM-fit hyperexponential models of 6 or more parameters. With a quantitatively accurate prediction of host availability, it is possible to determine both effective checkpointing strategies that adapt to current conditions, and effective replication strategies that can then be virtualized into a single resource by the vgES. Of course, availability data for resources can generally be included in the vgES resource selection and VG construction process.

To determine the effectiveness of our new techniques, we also developed an optimal checkpoint scheduler that automatically determines the durations between checkpoints that optimize application execution for a given resource. Using the same availability measures, we generate an optimal (with respect to the best-fit distribution type and parameters) checkpoint schedule. In addition to improving application performance in comparison to previous checkpointing schemes, our scheduler also dramatically reduces the network load generated by regular checkpoints that must be stored remotely, thereby reducing network contention.

In a similar vein, we have developed a technique for automatically producing a confidence range for the time a particular job will wait in a given batch queue. While we have not developed a deployable prototype (as we have for the checkpoint scheduling system), we have verified the technique using archival logging data spanning ten years at various NSF and DOE sites—an empirical evaluation that consists of over 1,000,000 predictions. This technique is now used by the slotted virtual grid mechanism to manage “probabilistic” reservations on systems that do not support true resource reservations. This study gave us the confidence to begin work on a prototype implementation for real-grid use.

VGrADS at Rice University

Sections

Personal tools

Document Actions

Automatic Resource Characterization