Encyclopedia of Life (EOL)
The Encyclopedia of Life (EOL) is a
collaborative global project designed to catalog the complete genome of
every living species in a flexible reference system. It is an open
collaboration led by the San Diego Supercomputer Center, with developments in several major areas. In VGrADS we focused on the major computational element of EOL, the script called iGAP (integrated genome annotation pipeline). This application creates protein sequence annotationsby gluing together a number of well-known community software packages. The key goal of iGAP is to discover
relationships across genomes, and thus involves extensive computation
and access to databases that contain data derived from previous
processing of multiple genomes (dozens initially and ultimately
hundreds).
iGAP's community software packages operate on input files ("sequence files") as well as on community databases (actually, flat ASCII files containing biological sequences). Like EMAN, the script is essentially a workflow application performing a pipeline of operations on each input genome. The analysis we are concerned with is shown in the figure below, in somewhat idealized form.
The overall model for the EOL computation is then as follows: a set of independent jobs (one job per genome), where each job consists of independent sub-jobs (one sub-job per sequence) and each sub-job consists of a linear sequence of several steps. At the onset of the VGrADS project, EOL was deployed in production on a multi-institution grid, with two software technologies (APST and BSMW) for managing the deployment. Our broad goal was to demonstrate the benefit of our Virtual Grid (VG) abstraction and of the VGrADS approach in general, when applied to the EOL application.
As part of the evaluation and refinement of our ideas for VGs, we developed a simple resource abstraction for the EOL application, including macro-scale modeling of EOL and the details of some components (PSI-BLAST and 123D) that represent the major compute-intensive parts of the EOL pipeline. The conclusion of this analysis was much the same as from the analysis of EMAN - the need for the homogeneous “cluster” and heterogeneous “bag of” resource abstractions and for coarse notions of network proximity when building VGs.
While this initial work with the EOL application was invaluable in defining and validating the VG concept, the initial plan of using EOL as a driver for VGrADS research and development activities beyond conceptual design was not completed. Working with the actual EOL software within the VGrADS project turned out to be infeasible, for several reasons. First, there was a legacy software problem. The EOL software turned out to be an ad-hoc, intricate, and hardly maintainable code base, and VGrADS had limited programming support. Second, the EOL software and funding was in flux, making it risky for VGrADS to participate in the project. However, it is important to note that VGrADS technology is eminently applicable to applications in the same domain as EOL, i.e., bioinformatics applications that use sequence databases and sequence comparison tools for identifying matching sequences. The EOL experience also pointed us to some additional candidate applications. So, in spite of this setback, we are confident that our work will be applicable to future work.
iGAP's community software packages operate on input files ("sequence files") as well as on community databases (actually, flat ASCII files containing biological sequences). Like EMAN, the script is essentially a workflow application performing a pipeline of operations on each input genome. The analysis we are concerned with is shown in the figure below, in somewhat idealized form.
The overall model for the EOL computation is then as follows: a set of independent jobs (one job per genome), where each job consists of independent sub-jobs (one sub-job per sequence) and each sub-job consists of a linear sequence of several steps. At the onset of the VGrADS project, EOL was deployed in production on a multi-institution grid, with two software technologies (APST and BSMW) for managing the deployment. Our broad goal was to demonstrate the benefit of our Virtual Grid (VG) abstraction and of the VGrADS approach in general, when applied to the EOL application.
As part of the evaluation and refinement of our ideas for VGs, we developed a simple resource abstraction for the EOL application, including macro-scale modeling of EOL and the details of some components (PSI-BLAST and 123D) that represent the major compute-intensive parts of the EOL pipeline. The conclusion of this analysis was much the same as from the analysis of EMAN - the need for the homogeneous “cluster” and heterogeneous “bag of” resource abstractions and for coarse notions of network proximity when building VGs.
While this initial work with the EOL application was invaluable in defining and validating the VG concept, the initial plan of using EOL as a driver for VGrADS research and development activities beyond conceptual design was not completed. Working with the actual EOL software within the VGrADS project turned out to be infeasible, for several reasons. First, there was a legacy software problem. The EOL software turned out to be an ad-hoc, intricate, and hardly maintainable code base, and VGrADS had limited programming support. Second, the EOL software and funding was in flux, making it risky for VGrADS to participate in the project. However, it is important to note that VGrADS technology is eminently applicable to applications in the same domain as EOL, i.e., bioinformatics applications that use sequence databases and sequence comparison tools for identifying matching sequences. The EOL experience also pointed us to some additional candidate applications. So, in spite of this setback, we are confident that our work will be applicable to future work.