Qualitative Performance Analysis for Large-scale Scientific Workflows

PhD thesis, Duke University, Durham, NC, USA.

Abstract

Today, large-scale scientific applications are both data driven and distributed.
To support the scale and inherent distribution of these applications, significant
heterogeneous and geographically distributed resources are required over long
periods of time to ensure adequate performance. Furthermore, the behavior of
these applications depends on a large number of factors related to the application,
the system software, the underlying hardware, and other running applications, as
well as potential interactions among these factors.
Most Grid application users are primarily concerned with obtaining the result
of the application as fast as possible, without worrying about the details involved
in monitoring and understanding factors affecting application performance. In this
work, we aim to provide the application users with a simple and intuitive performance
evaluation mechanism during the execution time of their long-running Grid applications
or workflows. Our performance evaluation mechanism provides a qualitative and periodic
assessment of the application’s behavior by informing the user whether the application’s
performance is expected or unexpected. Furthermore, it can help improve overall
application performance by informing and guiding fault-tolerance services when the
application exhibits persistent unexpected performance behaviors.
This thesis addresses the hypotheses that in order to qualitatively assess application
behavioral states in long-running scientific Grid applications: (1) it is necessary to extract
temporal information in performance time series data, and that (2) it is sufficient to extract
variance and pattern as specific examples of temporal information. Evidence supporting
these hypotheses can lead to the ability to qualitatively assess the overall behavior of the
application and, if needed, to offer a most likely diagnostic of the underlying problem.
To test the stated hypotheses, we develop and evaluate a general qualitative performance
analysis framework that incorporates (a) techniques from time series analysis and machine
learning to extract and learn from data, structural and temporal features associated with
application performance in order to reach a qualitative interpretation of the application’s
behavior, and (b) mechanisms and policies to reason over time and across the distributed
resource space about the behavior of the application.
Experiments with two scientific applications from meteorology and astronomy comparing
signatures generated from instantaneous values of performance data versus those generated
from temporal characteristics support the former hypothesis that temporal information is
necessary to extract from performance time series data to be able to accurately interpret
the behavior of these applications. Furthermore, temporal signatures incorporating variance
and pattern information generated for these applications reveal signatures that have distinct
characteristics during well-performing versus poor-performing executions. This leads to the
framework’s accurate classification of instances of similar behaviors, which represents supporting
evidence for the latter hypothesis. The proposed framework’s ability to generate a qualitative
assessment of performance behavior for scientific applications using temporal information
present in performance time series data represents a step towards simplifying and improving
the quality of service for Grid applications.

by Emma Buneci — last modified 2009-09-29 06:59

VGrADS at Rice University

Sections

Personal tools

Document Actions

Qualitative Performance Analysis for Large-scale Scientific Workflows