Document Actions

Fault Tolerance

by admin — last modified 2007-12-14 12:00

Grid applications have diverse requirements for performance and reliability. We address the requirements of some applications through our work on Fault-Tolerant MPI and Fault-Tolerant Linear Algebra (described elsewhere). However, other applications have needs that are difficult to enforce, given variability across grid resources. For these, we must provide support within vgES.

Although there are tools and mechanisms to monitor performance and ensure reliability, few tools allow users to express reliability policies from the application’s perspective, map these to resource capabilities, and then coordinate and enforce strategies. To address this issue, we have designed extensions to the Virtual Grid API to allow users to clearly articulate reliability expectations when specifying resource requirements. Specifically, the VG, by virtue of this work, will

allow applications to describe collective qualitative reliability or specific quantitative requirements for resource selection,

adjust fault tolerance levels and expectations at run-time,

and allow applications to register a callback, where application intervention might be required to deal with certain fault conditions.

We have defined a high-level qualitative reliability metric space (from "excellent" to "poor") that can be utilized by users to request resources. For example, a user can specify that he or she would like a HighReliabilityBag of 16 nodes. The qualitative levels are mapped to well-defined quantitative levels of reliability (e.g., "excellent" means 90–100 %) in the VG to enable runtime monitoring and adaptation. We expect that over time the exact definition of the levels may vary or evolve.

We have also defined a simple constraint and policy language that allows applications to specify conditions under which a callback might be necessary or specify an action that needs to be applied. For example, the policy might state that if the application is using between 16 and 64 processors with less than Excellent reliability, the task should be over-provisioned. We are experimenting with various policies that might work in this environment. The VG execution system will build on the integrated interface to Network Weather Service (NWS) and UNC's extant Health Application Programming Interface (HAPI) toolkit to collect fault-indicating data that will drive the fault tolerance strategy within the execution system.

VGrADS at Rice University

Sections

Personal tools

Document Actions

Fault Tolerance