VGrADS at Rice University

Participants

admin — 2008-04-23T18:02:55Z

Relative Performance of Scheduling Algorithms in Grid Environments

admin — 2009-11-04T19:09:14Z

Effective scheduling is critical for the performance of an application launched onto the Grid environment. Finding effective scheduling algorithms for this problem is a challenging research area. Many scheduling algorithms have been proposed, studied and compared on heterogeneous parallel computers but there are few studies comparing the performance of scheduling algorithms in Grid environments. The Grid is unique because of the drastic cost differences between inter-cluster and the intra-cluster data transfers. In this paper, we compare several scheduling algorithms that represent two classes of schedulers used for Grid computing. We analyze the results to explain how different resource environments and workflow application structures affect the performance of these algorithms. Based on our experiments, we introduce a new measurement called effective ACP that could drastically improve the performance of some schedulers.

Model-Based Checkpoint Scheduling for Volatile Resource Environments

admin — 2008-04-30T22:20:30Z

In this paper, we describe a system for application check- point scheduling in volatile resource environments. Our approach combines historical measurements of resource availability with an estimate of checkpoint/recovery delay to generate checkpoint intervals that minimize overhead. When executing in a desktop computing or resource harvesting context, long-running applications must checkpoint, since resources can be reclaimed by their owners without warning. Our system records the historical availability from each resource and fits a statistical model to the observations using either Maximum Likelihood Estimation (MLE) or Expectation Maximization (EM). When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the application’s execution, evaluates the expected overhead as a function of the checkpoint interval, and numerically optimizes this quantity. Using Condor as a target platform, we investigate the effectiveness of this technique fitting exponential, Weibull, 2- phase hyperexponential and 3-phase hyperexponential distributions to observed availability data. To verify our method and compare the distributions each against the same conditions, we use observations taken from the Condor pool at the University of Wisconsin and trace-based simulation. We examine the practical value of our approach by observing an implementation of our system when applied to a test application that is then run on the “live” Condor system. Finally, we conclude with a verification of the simulated results against the experimental observations. Our results indicate that application efficiency is relatively insensitive to the choice of distribution (among the ones we investigate) but that induced network load is not.

Modeling Machine Availability in Enterprise and Wide-area Distributed Computing Environments

admin — 2009-11-04T18:57:59Z

In this paper, we consider the problem of modeling machine availability in enterprise-area and wide-area distributed computing settings. Using availability data gathered from three different environments, we detail the suitability of four potential statistical distributions for each data set: exponential, Pareto, Weibull, and hyperexponential. In each case, we use software we have developed to determine the necessary parameters automatically from each data collection. To gauge suitability, we present both graphical and statistical evaluations of the accuracy with each distribution fits each data set. For all three data sets, we find that a hyperexponential model fits slightly more accurately than a Weibull, but that both are substantially better choices than either an exponential or Pareto. We also test the independence of individual machine measurements and the stationarity of the underlying statistical process model for each data set. These results indicate that either a hyperexponential or Weibull model effectively represents machine availability in enterprise and Internet computing environments.

The Virtual Grid Description Language: vgDL

admin — 2008-04-30T22:20:30Z

Simple resource specification, resource selection, and effective binding are critical capabilities for Grid middleware. We describe the Virtual Grid, an abstraction for dynamic grid applications to deal with complex resource environments. Elements of the Virtual Grid include a novel resource description language (vgDL) and a resource selection and binding component (vgFAB), which accepts a vgDL specification and returns a Virtual Grid, that is, a set of selected and bound resources. The goals of vgFAB are efficiency, scalability, robustness to high resource contention, and the ability to produce results with quantifiable high quality. We present the design of vgDL, showing how it captures application-level resource abstractions using resource aggregates and connectivity amongst them.

Efficient Resource Description and High Quality Selection for Virtual Grids

admin — 2008-04-30T22:20:30Z

Simple resource specification, resource selection, and effective binding are critical capabilities for Grid middleware. We describe the Virtual Grid, an abstraction for providing these capabilities complex resource environments. Elements of the Virtual Grid include a novel resource description language (vgDL) and a resource selection and binding component (vgFAB), which accepts a vgDL specification and returns a Virtual Grid, that is, a set of selected and bound resources. The goals of vgFAB are efficiency, scalability, robustness to high resource contention, and the ability to produce results with quantifiable high quality. We present the design of vgDL, showing how it captures application-level resource abstractions using resource aggregates and connectivity amongst them. We present and evaluate a prototype implementation of vgFAB. Our results show that resource selection and binding for virtual grids of 10,000's of resources can scale up to grids with millions of resources, identifying good matches in less than one second. Further, these matches have quantifiable quality, enabling applications to have high confidence in the results. We demonstrate the effectiveness of our combined selection and binding approach in the presence of resource contention, showing that robust selection and binding can be achieved at moderate cost.

Combined Selection and Binding for Competitive Resource Environments

admin — 2008-04-30T22:20:30Z

A critical technology for Grid computing is the ability to describe, select, and bind appropriate resources for synchronous use by applications. Our Virtual Grid Description Language (vgDL) allows applications to describe and manage resources conveniently via application-level abstractions and enables a novel approach to application-driven resource management called "finding and binding". We explore the viability of this strategy to enables applications to obtain complex resource collections in competitive resource environments. Our evaluation shows that combined resource "selection and binding" can scale well to millions of hosts, identifying good matches in a few seconds for the most complex resource requests. The combined selection and binding improves success rates for complex requests and the advantage increases for more complex requests and higher resource competition. Combined Selection and Binding can double the resource utilization (from 30\\% to 70\\%) at which synchronous resource allocation and use across as many as sixteen resource managers is possible.

Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems

admin — 2008-04-30T22:20:30Z

We tackle the problem of scheduling task graphs onto a heterogeneous set of machines, where each processor has a probability of failure governed by an exponential law. The goal is to design algorithms that optimize both makespan and reliability. First, we provide an optimal scheduling algorithm for independent unitary tasks where the objective is to maximize the reliability subject to makespan minimization. For the bi-criteria case, we provide an algorithm that approximates the Pareto-curve. Next, for independent non-unitary tasks, we show that the product failure rate× unitary instruction execution time is crucial to distinguish processors in this context. Based on these results we are able to let the user choose a trade-o between reliability maximization and makespan minimization. For general task graphs we provide a method for converting scheduling heuristics on heterogeneous cluster into heuristics that take reliability into account. Here again, we show how we can help the user to select a trade-o between makespan and reliability. publisher = "ACM",

Application-level Resource Provisioning on the Grid

admin — 2009-11-04T18:50:48Z

provisioning that employ agreement-based resource management. These algorithms allow user-level resource allocation and scheduling of applications that are structured as a precedence-constrained set of tasks. We present a provisioning model where the resource availability in the Grid can be enumerated as a set of slots. A slot is defined as a number of processors available from a certain start time for a certain duration at a certain cost. Using a cost model that combines the cost of resource allocation and the expected application runtime, we evaluate the performance of the Min-Min and of the Genetic algorithm (GA)-based heuristics for a range of synthetic applications. We show that the GA paired with a list scheduling algorithm can obtain significantly better solutions than the Min-Min heuristic alone.

Cross Architecture Performance Predictions for Scientific Applications Using Parameterized Models

admin — 2008-04-30T22:20:30Z

This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and differentiable profile. Our toolkit operates on application binaries and succeeds in modeling key aplication characteristics that determine program performance. We use these characterizations to explore the interactions between an application and a target architecture. We apply our toolkit to SPARC binaries to develop architecture-neutral models of computation and memory access patterns of the ASCI Sweep3D and the NAS SP, BT and LU benchmarks. From our models, we predict the L1, L2 and TLB cache miss counts as well as the overall execution time of these applications on an Origin 2000 system. We evaluate our predictions by comparing them against measurements collected using hardware performance counters.

Scheduling tasks with precedence constraints on heterogeneous distributed computing systems

admin — 2009-09-29T17:13:20Z

Efficient scheduling is essential to exploit the tremendous potential of high performance computing systems. Scheduling tasks with precedence constraints is a well studied problem and a number of heuristics have been proposed. In this thesis, we first consider the problem of scheduling task graphs in heterogeneous distributed computing systems (HDCS) where the processors have different capabilities. A novel, list scheduling-based algorithm to deal with this particular situation is proposed. The algorithm takes into account the resource scarcity when assigning the task node weights. It incorporates the average communication cost between the scheduling node and its node when computing the Earliest Finish Time (EFT). Comparison studies show that our algorithm performs better than related work overall. We next address the problem of scheduling task graphs to both minimize the makespan and maximize the robustness in HDCS. These two objectives are conflicting and an epsilon-constraint method is employed to solve the bi-objective optimization problem. We give two definitions of robustness based on tardiness and miss rate. We also prove that slack is an effective metric to be used to adjust the robustness. The overall performance of a schedule must consider both the makespan and robustness. Experiments are carried out to validate the performance of the proposed algorithm. The uncertainty nature of the task execution times and data transfer rates is usually neglected by traditional scheduling heuristics. We model those performance characteristics of the system as random variables. A stochastic scheduling problem is formulated to minimize the expected makespan and maximize the robustness. We propose a genetic algorithm based approach to tackle this problem. Experiment results show that our heuristic generates schedules with smaller makespan and higher robustness compared with other deterministic approaches.

SCHMIB: Segregating Clusters Hierarchically Making Improved Bounds

admin — 2008-04-30T22:20:30Z

Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have the option of choosing between different queues (having different charging rates) potentially on a number of different machines where they have access. In such a situation, the amount of time a user's job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. It thus becomes desirable to provide a prediction for the amount of time a job can expect to wait in the queue at a given time. Further, it is natural to expect that attributes of an incoming job, specifically the number of processors requested and the amount of time requested, might impact that job's wait time. Previous work has shown that it is possible to determine meaningful upper-bounds on queuing delay using a simple non-parametric technique, particularly when site administrators provide information for how jobs should be grouped by processor count. In this work, we explore the possibility of generating more accurate predictions by automatically grouping jobs having similar attributes using model-based clustering. Moreover, we implement this clustering technique for a time series of jobs so that predictions of future wait times can be generated in real time. Using trace-based simulation on data from 7 machines over a 9-year period from across the country, comprising over one million job records, we show that clustering either by requested time or by requested number of processors generally produces more accurate predictions than the earlier more naive approaches, that automatic clustering outperforms administrator-determined clusterings, and that clustering by requested time or the product of requested nodes and requested execution time is substantially more effective than clustering by requested number of processors.

Robust task scheduling in non-deterministic heterogeneous systems

admin — 2008-04-30T22:20:30Z

Quantifying Machine Availability in Networked and Desktop Grid Systems

admin — 2009-11-04T19:00:40Z

In this paper, we examine the problem of predicting machine availability in desktop and enterprise computing environments. Predicting the duration that a machine will run until it restarts (availability duration) is critically useful to application scheduling and resource characterization in federated systems. We describe one parametric model fitting technique and two non-parametric prediction techniques, comparing their accuracy in predicting the quantiles of empirically observed machine availability distributions. We describe each method analytically and evaluate its precision using a synthetic trace of machine availability constructed from a known distribution. To detail their practical efficacy, we apply them to machine availability traces from three separate desktop and enterprise computing environments, and evaluate each method in terms of the accuracy with which it predicts availability in a trace driven simulation. Our results indicate that availability duration can be predicted with quantifiable confidence bounds and that these bounds can be used as conservative bounds on lifetime predictions. Moreover, a non-parametric method based on a binomial approach generates the most accurate estimates.

Intelligent Monitoring for Adaptation in Grid Applications

admin — 2008-04-30T22:20:30Z

Grid applications access distributed, and often shared, resources. One consequence of this resource sharing is that measured application performance can vary widely and in unexpected ways. Determining the causes of poor performance, due to either anomalous application behavior or contention for shared resource use, and adapting to changing circumstances are critical to creation of robust Grid applications. Performance contracts and real-time adaptive control are two mechanisms to realize soft performance guarantees in Grid environments. Performance contracts formalize the relationship between application performance needs and resource capabilities. During execution, contract monitors use performance data to verify that expectations are met. When the contracted specifications are not satisfied, the system can choose to either adapt the application to available resources or reschedule the application on a new set of resources that can satisfy the original contract specifications. We describe an infrastructure for Grid application contract development and monitoring. This infrastructure, based on the Autopilot toolkit, provides flexible and scalable tools to assess both application and system behavior.