Fault Tolerance
Grid applications have diverse requirements for performance and reliability. We address the requirements of some applications through our work on Fault-Tolerant MPI and Fault-Tolerant Linear Algebra (described elsewhere). However, other applications have needs that are difficult to enforce, given variability across grid resources. For these, we must provide support within vgES.
Although there are tools and mechanisms to monitor performance and ensure reliability, few tools allow users to express reliability policies from the application’s perspective, map these to resource capabilities, and then coordinate and enforce strategies. To address this issue, we have designed extensions to the Virtual Grid API to allow users to clearly articulate reliability expectations when specifying resource requirements. Specifically, the VG, by virtue of this work, will
We have also defined a simple constraint and policy language that allows applications to specify conditions under which a callback might be necessary or specify an action that needs to be applied. For example, the policy might state that if the application is using between 16 and 64 processors with less than Excellent reliability, the task should be over-provisioned. We are experimenting with various policies that might work in this environment. The VG execution system will build on the integrated interface to Network Weather Service (NWS) and UNC's extant Health Application Programming Interface (HAPI) toolkit to collect fault-indicating data that will drive the fault tolerance strategy within the execution system.
Although there are tools and mechanisms to monitor performance and ensure reliability, few tools allow users to express reliability policies from the application’s perspective, map these to resource capabilities, and then coordinate and enforce strategies. To address this issue, we have designed extensions to the Virtual Grid API to allow users to clearly articulate reliability expectations when specifying resource requirements. Specifically, the VG, by virtue of this work, will
- allow applications to describe collective qualitative reliability or specific quantitative requirements for resource selection,
- adjust fault tolerance levels and expectations at run-time,
- and allow applications to register a callback, where application intervention might be required to deal with certain fault conditions.
We have also defined a simple constraint and policy language that allows applications to specify conditions under which a callback might be necessary or specify an action that needs to be applied. For example, the policy might state that if the application is using between 16 and 64 processors with less than Excellent reliability, the task should be over-provisioned. We are experimenting with various policies that might work in this environment. The VG execution system will build on the integrated interface to Network Weather Service (NWS) and UNC's extant Health Application Programming Interface (HAPI) toolkit to collect fault-indicating data that will drive the fault tolerance strategy within the execution system.