VGrADS at Rice University

VGrADS at Rice University ./ These are the search results for the query, showing results 1 to 8. High Performance GridRPC Middleware publications/Caniou2009HPG A simple way to offer a Grid access through a middleware is to use the GridRPC paradigm. It is based on the classical RPC model and extended to Grid environments. Client can access to remote servers as simply as a function call. Several middlewares are compliant to this paradigm as DIET, GridSolve, or Ninf-G. Actors of these projects have worked together to design a standard API within the Open Grid Forum. In this chapter we give an overview of this standard and the current works around the data management. Three use cases are introduced through a detailled descriptions of DIET, GridSolve, and Ninf-G middleware features. Finally applications for each middleware are shown to appreciate how they take benefit of the GridRPC API. No publisher yarkhan 2009-08-27T16:18:12Z Incollection Reference Transparent cross-platform access to software services using GridSolve and GridRPC publications/Seymour2009TCP Distributed computing can be daunting even for experienced programmers. Although many projects have been created to facilitate developing distributed applications, they are often quite complex in themselves. While many scientific applications could benefit from distributed computing, the complexity of the programming models can be a high barrier to entry, especially since many of these applications are developed by domain scientists without extensive training in software development. Thus, we believe that the paramount design consideration of a distributed computing model should be ease of use. With this in mind, we will discuss GridRPC, which is a model for remote procedure call in the context of a computational Grid or other loosely coupled distributed computing environment. Then we will discuss GridSolve, an implementation of the GridRPC model. No publisher yarkhan 2009-09-17T20:46:52Z Incollection Reference Constructing Resiliant Communication Infrastructure for Runtime Environments publications/Bosilca2009CRC In this paper, we present and analyze a self-stabilizing algorithm1 to transform the underlying communication infrastructure provided by the launching service into a BMG, and maintain it in spite of failures. We demonstrate that this algorithm is scalable, tolerate transient failures, and adapt itself to topology changes. No publisher yarkhan 2009-09-16T19:53:36Z Inproceedings Reference Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems publications/inproceedingsreference200908118895152445 Multicore systems have increasingly gained importance in both shared-memory and distributed-memory environments. This paper presents a dynamic task scheduling approach to executing dense linear algebra algorithms on multicore systems (either shared- or distributed-memory). We use a task-based library to replace the existing linear algebra sub- routines such as PBLAS to transparently provide the same interface and computational function as the ScaLAPACK library. Linear algebra programs are written with the task- based library and executed by a dynamic runtime system. We mainly focus our runtime system design on the met- ric of performance scalability. We propose an algorithm to solve data dependences without process cooperation in a dis- tributed manner. We have implemented the runtime system and applied it to three linear algebra algorithms: Cholesky factorization, LU factorization, and QR factorization. Our experiments on both shared-memory machines (16-core In- tel Tigerton, 32-core IBM Power6) and distributed-memory machines (Cray XT4 using 1024 cores) demonstrate that our runtime system is able to achieve good scalability. Further- more, we provide analytical analysis to show why the tiled algorithms are scalable and the expected execution time. No publisher yarkhan 2009-08-27T16:18:12Z Inproceedings Reference Algorithm-based fault tolerance applied to High Performance Computing publications/Bosilca2009ABF We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518-528] to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault-tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65\% of the machine peak efficiency and less than 12\% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly. No publisher yarkhan 2009-08-27T16:18:12Z Article Reference Experiments with SmartGridSolve: Achieving higher performance by improving the GridRPC model publications/Brady2008 The paper presents SmartGridSolve, an extension of GridSolve, the programming system for high performance computing. The extension is aimed at higher performance of Grid applications by providing the functionality for collective mapping of a group of tasks on to a network topology that is fully connected. This functionality was achieved with only a minor addition to the GridRPC API. The key to the implementation of collective mapping was to separate the mapping of tasks from their execution which is one atomic operation in the GridRPC model of GridSolve. This paper demonstrates the performance gained by collective mapping with a real-life astrophysical experiment. The presented results show a significant speedup of 2.17 executing this application on a small network of two servers. No publisher yarkhan 2009-08-27T16:18:12Z Inproceedings Reference Request Sequencing: Enabling Workflow for Efficient Problem Solving in GridSolve publications/Li2008RSE GridSolve employs a standard RPC-based model for solving computational problems. There are two deficiencies associated with this model when a computational problem essentially forms a workflow consisting of a set of tasks, among which there exist data dependencies. First, intermediate results are passed among tasks going through the client, resulting in additional data transport between the client and the servers, which is pure overhead. Second, since the execution of each individual task is a separate RPC session, it is difficult to exploit the potential parallel ismamong tasks. This paper presents a request sequencing technique that eliminates those limitations and solves the above problems. The core features of this work include automatic DAG construction and data dependency analysis, direct inter-server data transfer, and the capability of parallel task execution. No publisher yarkhan 2009-08-27T16:18:12Z Article Reference yarkhan Members/yarkhan No publisher yarkhan 2007-12-19T16:20:51Z Folder