Scheduling Workflow Applications
We have developed strategies for scheduling and executing workflow applications like EMAN, EOL, LEAD, and Montage on grid resources. Our first attempts, focusing on EMAN and Montage, were described in some detail in the GrADS and VGrADS overview paper, which was supported by both awards. More recent work has extended the EMAN work and considered LEAD in more detail.
A workflow application is one in which a set of tasks is linked by producer-consumer communications. We usually represent this as a directed acyclic graph (DAG), with weights on the nodes representing computation and weights on the edges representing communication volume. In a grid context, the scheduler must perform two essential tasks: map the DAG onto (a subset of) the set of available resources, and order the tasks on a resource to respect the communication dependences. Moreover, in a realistic grid environment, different types of processors will have different computation times for the same task. (By “computation,” we mean all operations in a single task, including memory access and file I/O, which are sometimes considered separately.) Similarly, communications time for a given amount of data will depend on which pair of processors are sending and receiving. In short, the weights in the DAG depend on the schedule, creating significant complexity.
The ultimate goal of the scheduler is to solve the above problems and produce the most efficient possible execution. That is, a workflow scheduler is fundamentally an optimization routine that attempts to minimize the execution time (“makespan”) of a set of tasks on a given set of processors. Because the DAG scheduling problem is NP-complete, we rely on heuristic scheduling strategies to achieve this goal. VGrADS has chosen to investigate a priori scheduling strategies (sometimes called off-line scheduling), in which the schedule is computed before computation. To do this, we rely on performance estimation tools also developed in VGrADS.
Our scheduler experiments to date have been generally successful, although no single method is best for all applications. A few examples show the breadth of our attempts.
In one of our preliminary runs for an EMAN experiment, a portion of the nodes experienced heavy load due to competing jobs. This motivated us to examine the sensitivity of our scheduler to inaccuracy in the performance model. We produced four runs of the same data set on two of the clusters (the RTC at Rice and medusa at UH) varying two components of the scheduler:
We have created schedulers that make full use of Virtual Grid (VG) technology. We refer to this as the “two-phase” scheduling strategy, where the first phase is selecting a VG, and the second phase is mapping the application onto the chosen VG. Our first tests found that a simple greedy mapping heuristic coupled with our VG abstraction produced excellent application turnaround times (defined as total time for scheduling and executing the application itself). Below, we reproduce the following typical graph from the relevant paper. Note how the bars marked “VG” are consistently competitive in makespan (computation time) and low in scheduling time (the sum of VG creation and scheduler).
We have also started a comparative study of list-based and level-based schedulers, exemplified by the Heterogeneous Earliest Finish Time (HEFT) and Levelized Heuristic Based Scheduling (LHBS) strategies. Although the work is still in progress, we think the initial results are striking. In short, they show that simple scheduling, particularly in conjunction with reducing the number of resources considered, is highly effective. In particular, we now understand better how constraining the choice of resources improves both scheduler performance and the generated schedule, by avoiding “false economies” in trading computation performance for communication performance. A representative chart of the results follows; SLR is a normalized performance measure for the makespan. The chart clearly shows that, for a selection of random DAGs, constraining the set of considered resources produces better schedules than running any other heuristic tested, particularly for DAGs that have more potential parallelism than is available. The paper reporting these results led to a "Best Sustained Technical Contribution Award" from the CCGrid'07 conference.
We have developed a prototype scheduler for Grids in which resources are controlled by batch queues. The key insight for this work is that top-down schedulers, such as the ones we successfully used in the EMAN application, use an Estimated Start Time (EST) for each resource to choose the next task to be mapped. In the original scheduler formulation, the EST for a task only depended on the completion times of its predecessors and the communication time to receive its data. We use the predictions of batch queue wait time derived by VGrADS researchers to refine this estimate. Our new EST for mapping a task to a resource is essentially the maximum of the old EST and the time when the resource will become available through the queue. We demonstrated this system for small tests of EMAN at SC'05, and a report of the results was accepted at SC'06. The chart below, taken from the paper, shows that the majority of tests benefitted from knowledge of the queue status when scheduling was done.
A workflow application is one in which a set of tasks is linked by producer-consumer communications. We usually represent this as a directed acyclic graph (DAG), with weights on the nodes representing computation and weights on the edges representing communication volume. In a grid context, the scheduler must perform two essential tasks: map the DAG onto (a subset of) the set of available resources, and order the tasks on a resource to respect the communication dependences. Moreover, in a realistic grid environment, different types of processors will have different computation times for the same task. (By “computation,” we mean all operations in a single task, including memory access and file I/O, which are sometimes considered separately.) Similarly, communications time for a given amount of data will depend on which pair of processors are sending and receiving. In short, the weights in the DAG depend on the schedule, creating significant complexity.
The ultimate goal of the scheduler is to solve the above problems and produce the most efficient possible execution. That is, a workflow scheduler is fundamentally an optimization routine that attempts to minimize the execution time (“makespan”) of a set of tasks on a given set of processors. Because the DAG scheduling problem is NP-complete, we rely on heuristic scheduling strategies to achieve this goal. VGrADS has chosen to investigate a priori scheduling strategies (sometimes called off-line scheduling), in which the schedule is computed before computation. To do this, we rely on performance estimation tools also developed in VGrADS.
Our scheduler experiments to date have been generally successful, although no single method is best for all applications. A few examples show the breadth of our attempts.
In one of our preliminary runs for an EMAN experiment, a portion of the nodes experienced heavy load due to competing jobs. This motivated us to examine the sensitivity of our scheduler to inaccuracy in the performance model. We produced four runs of the same data set on two of the clusters (the RTC at Rice and medusa at UH) varying two components of the scheduler:
- Using the VGrADS scheduler (“Heuristic” on the following chart) or a randomized scheduler (“Random”).
- Using the VGrADS performance model (“Accurate” on the following chart) or a simple clock-cycle-based performance model (“Inaccurate”).
We have created schedulers that make full use of Virtual Grid (VG) technology. We refer to this as the “two-phase” scheduling strategy, where the first phase is selecting a VG, and the second phase is mapping the application onto the chosen VG. Our first tests found that a simple greedy mapping heuristic coupled with our VG abstraction produced excellent application turnaround times (defined as total time for scheduling and executing the application itself). Below, we reproduce the following typical graph from the relevant paper. Note how the bars marked “VG” are consistently competitive in makespan (computation time) and low in scheduling time (the sum of VG creation and scheduler).
We have also started a comparative study of list-based and level-based schedulers, exemplified by the Heterogeneous Earliest Finish Time (HEFT) and Levelized Heuristic Based Scheduling (LHBS) strategies. Although the work is still in progress, we think the initial results are striking. In short, they show that simple scheduling, particularly in conjunction with reducing the number of resources considered, is highly effective. In particular, we now understand better how constraining the choice of resources improves both scheduler performance and the generated schedule, by avoiding “false economies” in trading computation performance for communication performance. A representative chart of the results follows; SLR is a normalized performance measure for the makespan. The chart clearly shows that, for a selection of random DAGs, constraining the set of considered resources produces better schedules than running any other heuristic tested, particularly for DAGs that have more potential parallelism than is available. The paper reporting these results led to a "Best Sustained Technical Contribution Award" from the CCGrid'07 conference.
We have developed a prototype scheduler for Grids in which resources are controlled by batch queues. The key insight for this work is that top-down schedulers, such as the ones we successfully used in the EMAN application, use an Estimated Start Time (EST) for each resource to choose the next task to be mapped. In the original scheduler formulation, the EST for a task only depended on the completion times of its predecessors and the communication time to receive its data. We use the predictions of batch queue wait time derived by VGrADS researchers to refine this estimate. Our new EST for mapping a task to a resource is essentially the maximum of the old EST and the time when the resource will become available through the queue. We demonstrated this system for small tests of EMAN at SC'05, and a report of the results was accepted at SC'06. The chart below, taken from the paper, shows that the majority of tests benefitted from knowledge of the queue status when scheduling was done.