Lavanya Ramakrishnan, Daniel Nurmi, Anirban Mandal, Charles Koelbel, Dennis Gannon, T. M Huang, Yang-Seok Kee, Graziano Obertelli, Kiran Thyagaraja, Rich Wolski, Asim YarKhan, and Dmitri Zagorodnov (2009)
VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
In: SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, OR.
Today's scientific workflows use distributed heterogeneous resources
through diverse grid and cloud interfaces that are often hard to
program. In addition, especially for time-sensitive critical applications,
predictable quality of service is necessary across these distributed
resources. VGrADS' virtual grid execution system (vgES) provides an
uniform qualitative resource abstraction over grid and cloud
systems. We apply vgES for scheduling a set of deadline sensitive
weather forecasting workflows. Specifically, this paper reports on our
experiences with (1) virtualized reservations for batch-queue systems,
(2) coordinated usage of TeraGrid (batch queue), Amazon EC2 (cloud),
our own clusters (batch queue) and Eucalyptus (cloud) resources, and
(3) fault tolerance through automated task replication. The combined
effect of these techniques was to enable a new workflow planning
method to balance performance, reliability and cost considerations.
The results point toward improved resource selection and execution
management support for a variety of e-Science applications over grids
and cloud systems.