аЯрЁБс>ўџ 9;ўџџџ8џџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџьЅС‰@ №Пџ"jbjbq”q” 4ііџџџџџџџl"""""""zzzzz †4zдйлллллл,№ ž"H""ЦNHHH""йH6"X"""""йH~HЦЦ!""йК ѕЇ–Мzz&"БйХЎHЎйHУхThe Encyclopedia of Life (EOL) Application Description Henri Casanova, Andrew Chien, Richard Huang April 4, 2004 URL: http://eol.sdsc.edu 1. Background The Encyclopedia of Life (EOL) is a collaborative global project designed to catalog the complete genome of every living species in a flexible reference system. It is an open collaboration led by the San Diego Supercomputer Center, and currently has three major development areas: (i) Creating protein sequence annotations using the integrated genome annotation pipeline (iGAP); (ii) Storage of these annotations in a data warehouse where they are integrated with other data sources; and (iii) A toolkit area that presents the data to users in the presence of useful annotation and visualization tools. In VGrADS we will focus on (i) because it is the major computational element. The key goal of (i) is to discover relationships across genomes, and thus involves extensive computation, access to databases which contain data derived from the iGAP processing of multiple genomes (dozens initially and ultimately hundreds). However, this coupling across genomes is achieve exclusively thru accesses and updates to these shared databases. In the future, (ii) may be of interest as well. The two goals of the EOL applications effort are to achieve full automation (just hit "Go" for each genome), and to maximize the number of genomes annotated per week or per month. This is important because new genomes are being sequenced all the time, and existing genomes are updated and may need to be reprocessed. Note that the overall genome throughput goal is different from minimizing turnaround for annotating one genome. 2. EOL Today EOL is not a single code, but rather a script (iGAP) that glues together a number of well-known community software packages. These packages operate on input files ("sequence files") as well as on community databases. We describe both these databases, and the computations used by EOL below. We only provide a high-level description and refer the reader to papers the EOL web site for all details (and explanation of acronyms). 2.1. The Databases Note that in the bioinformatics context "database" does not mean a relational database but instead means a "flat ASCII file" that contains a sequence of biological sequences. We will use the term "database" in this sense hereafter. EOL uses essentially 2 community databases of significant size: NCBI NR protein database [Size after installation: 3.7G]: Non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq. FOLDLIB [Size after installation: 2.2G]: this is a comprehensive protein fold database built by the EOL researchers, which contains protein domains from SCOP, domains parsed using the protein domain parser (PDP), full-length Protein Data Bank (PDB) chains and chains not classified by SCOP, but associated with SCOP using combinatorial extension (CE), a structural-similarity search algorithm. From FOLDLIB comparative and fold recognition models of three-dimensional structure are derived. 2.2. The computation EOL processes genomes, which can be all processed independently and in any order. Each genome consists of a number of sequences. These sequences can be processed independently and in any order. Each sequence is processed in a series of steps (a workflow), implemented as a set of shell scripts that call community codes one after another genomes and subsequences, as well as for the protein sequences described in the genome. This sequence of steps is called the iGAP "pipeline", a use of biological science terminology. However this terminology can be misleading to computer scientists as their need not be any concurrency across stages of the pipeline. We describe below the basic steps of iGAP and then state which ones require significant computational power (it turns out that only a few of the EOL components are responsible for almost all the compute cycles needed by the application). Use a set of filter programs to determine the low-complexity regions as well as transmembrane regions, signal-peptide sequences, and coiled coils in a particular genome. Determine sequence similarity hits by pairwise sequence comparison using WU-BLAST. WU-BLAST is used because it is very fast and performed best in benchmark studies. Generate PSI-BLAST profiles for each input protein sequence against the FOLDLIB sequences. This step can be repeated several times for better accuracy. Use the 123D program to provide additional mapping to FOLDLIB using fold recognition. PSI-BLAST and 123D far outweigh the other components in terms of computational cost. The overall model for the EOL computation is then as follows: an set of independent jobs (one job per genome), where each job consists of independent sub-jobs (one sub-job per sequence), where each sub-job consists of a "chain" of the above 4 steps. Note that in the above chain, the PSI-BLAST step can be repeated for better accuracy. A simple benchmark on a 2GHz Pentium shows that 123D takes about 40 minutes, and that a single run of PSI-BLAST takes about 2.5 minutes. So in the typical case in which one runs PSI-BLAST 3 times, there is a factor 5 approximately. Protein sequences range from 30 bytes to 30 KBytes, however, large sequences are typically broken up in sub-sequences that are less than 2 KBytes or 3 KBytes. The average length is 440 bytes. 2.3. Current Deployment model At the moment EOL is deployed on a Grid that aggregates AIX and Linux clusters in several institutions (see http://eol.sdsc.edu:8080/eol/resources.jsp). Note that a few Sun machines are required for some components that run only on that architecture. The databases are fully replicated and installed on each cluster (either in each local disk, or over a GPFS, maybe even NFS). EOL is deployed with APST (http://grail.sdsc.edu/projects/apst), which handles all logistics of application deployment (interaction with Globus, SGE, PBS, etc.), and the Biology Workflow Management System (BWMS) which was developed specifically for EOL. Essentially, EOL researchers submit a full genome for computation to APST, and APST schedules and run all involved sequences through the iGAP steps. APST uses all of the cluster resources symmetrically, so scheduling a genome computation over the collected cluster resources based on load and computational capabilities. Because the data is fully replicated, essentially no consideration is made of data movement cost or locality. EOL was demonstrated in its current deployment at SC'03. A paper reporting on this deployment was accepted for publication and will be available soon. Status of ongoing computations can be seen on-line at http://eol.sdsc.edu:8080/eol/genomestatus1.jsp. 3. More general model It is probably interesting to think about a more general model than just the EOL application described above. The model could be: A set of D databases that can be distributed, replicated, streamed, cachable, etc. A set of S software components arranged in a DAG (maybe just a simple chain, maybe two separate chains, etc.), where each component may require one of the databases (or maybe more). The application consists either of a set of N of the above DAGs that must be processed in as little time as possible; or of an infinite set of N of the above DAGs must be processed at maximum rate. 67› š G[-Ec“ Љ џ"ѕьхнхнхнхнхнхнх5OJPJQJ OJPJQJCJOJPJQJ5CJOJPJQJ78drsŒ›œЋ м н Œ  š › FG[\ИВВВВВВВВВВВВВВВВВВВ Ц€G Ц€EЦ€F%„78drsŒ›œЋ м н Œ  š › FG[\EF†‡*+-.ТУБВ\]œѓєIJDEDEcdрс‘’‘ ’ “ Љ Њ ,!!5"ћ"ќ"§"ў"џ"§§§§§§§§§§§§§§§§§§§§§§§§ј§ё§§§§§§§ь§х§о§з§§§§§§§§§§§§§§§§§§§вЫФ§§§§                     C\EF†‡*+-.ТљљљљЎљcљљљљJ & F Ц€EЦ€Ц%„З№J & F Ц€EЦ€Ц%„З№ Ц€ ТУБВ\]љљљЎљcљJ & F Ц€EЦ€Ц%„)J & F Ц€EЦ€Ц%„) Ц€œѓєIJDEDEcДЎcЎЎЎЎЎЎЎЎJ & F Ц€EЦ€Ц%„) Ц€J & F Ц€EЦ€Ц%„) cdрс‘’‘ ’ “ Љ Њ ,!!љљљљљљљљљљљЎJ & F Ц€EЦ€Ц%„З№ Ц€ !5"ћ"ќ"§"ў"џ"Дi_YYY Ц€ „№`„№ Ц€J & F Ц€EЦ€Ц%„З№J & F Ц€EЦ€Ц%„З№Аа/ Ар=!Аg"Аg# $ %А i@@ёџ@ Normal CJOJPJQJ_HmH sH tH <A@ђџЁ< Default Paragraph Font0Z@ђ0 Plain TextOJQJ@ў@  Balloon TextCJOJQJ^JaJџ4џџџџ џџ z™ џџ z™ џџ z™b сџЕАџ"\Тc!џ"џ"џџUnknown Andrew ChienHenri CasanovaЖЗњWXшьЪЮћ   " ( шьƒ‹ЏЕ jphlpxptгз/ > ˆŠ::џџHenri Casanova,Macintosh HD:Users:casanova:Desktop:eol_.docБ)ФoІqџџџџџџџџџ5\‡ц$мЗџџџџџџџџџM)(т8њ5џџџџџџџџџІS(*pЕџџџџџџџџџrzО]DrАˆџџџџџџџџџЈ†^lЭ$оџџџџџџџџџЬ. gМ2 џџџџџџџџџh„а„˜ўЦа^„а`„˜ў)h„ „˜ўЦ ^„ `„˜ў.’h„p„LџЦp^„p`„Lџ.h„@ „˜ўЦ@ ^„@ `„˜ў.h„„˜ўЦ^„`„˜ў.’h„р„LџЦр^„р`„Lџ.h„А„˜ўЦА^„А`„˜ў.h„€„˜ўЦ€^„€`„˜ў.’h„P„LџЦP^„P`„Lџ.„р„˜ўЦр^„р`„˜ўo(.€„А„˜ўЦА^„А`„˜ў.‚„€„LџЦ€^„€`„Lџ.€„P „˜ўЦP ^„P `„˜ў.€„ „˜ўЦ ^„ `„˜ў.‚„№„LџЦ№^„№`„Lџ.€„Р„˜ўЦР^„Р`„˜ў.€„„˜ўЦ^„`„˜ў.‚„`„LџЦ`^„``„Lџ.h „а„˜ўЦа^„а`„˜ўOJQJo(З№h „ „˜ўЦ ^„ `„˜ўOJQJo(oh „p„˜ўЦp^„p`„˜ўOJQJo(Ї№h „@ „˜ўЦ@ ^„@ `„˜ўOJQJo(З№h „„˜ўЦ^„`„˜ўOJQJo(oh „р„˜ўЦр^„р`„˜ўOJQJo(Ї№h „А„˜ўЦА^„А`„˜ўOJQJo(З№h „€„˜ўЦ€^„€`„˜ўOJQJo(oh „P„˜ўЦP^„P`„˜ўOJQJo(Ї№h „а„˜ўЦа^„а`„˜ўOJQJo(З№h „ „˜ўЦ ^„ `„˜ўOJQJo(oh „p„˜ўЦp^„p`„˜ўOJQJo(Ї№h „@ „˜ўЦ@ ^„@ `„˜ўOJQJo(З№h „„˜ўЦ^„`„˜ўOJQJo(oh „р„˜ўЦр^„р`„˜ўOJQJo(Ї№h „А„˜ўЦА^„А`„˜ўOJQJo(З№h „€„˜ўЦ€^„€`„˜ўOJQJo(oh „P„˜ўЦP^„P`„˜ўOJQJo(Ї№h„а„˜ўЦа^„а`„˜ў)h„ „˜ўЦ ^„ `„˜ў.’h„p„LџЦp^„p`„Lџ.h„@ „˜ўЦ@ ^„@ `„˜ў.h„„˜ўЦ^„`„˜ў.’h„р„LџЦр^„р`„Lџ.h„А„˜ўЦА^„А`„˜ў.h„€„˜ўЦ€^„€`„˜ў.’h„P„LџЦP^„P`„Lџ.„X„˜ўЦX^„X`„˜ўOJPJQJo(-€ „(„˜ўЦ(^„(`„˜ўOJQJo(o€ „ј„˜ўЦј^„ј`„˜ўOJQJo(Ї№€ „Ш „˜ўЦШ ^„Ш `„˜ўOJQJo(З№€ „˜ „˜ўЦ˜ ^„˜ `„˜ўOJQJo(o€ „h„˜ўЦh^„h`„˜ўOJQJo(Ї№€ „8„˜ўЦ8^„8`„˜ўOJQJo(З№€ „„˜ўЦ^„`„˜ўOJQJo(o€ „и„˜ўЦи^„и`„˜ўOJQJo(Ї№„р„˜ўЦр^„р`„˜ўOJPJQJo(-€ „А„˜ўЦА^„А`„˜ўOJQJo(o€ „€„˜ўЦ€^„€`„˜ўOJQJo(Ї№€ „P „˜ўЦP ^„P `„˜ўOJQJo(З№€ „ „˜ўЦ ^„ `„˜ўOJQJo(o€ „№„˜ўЦ№^„№`„˜ўOJQJo(Ї№€ „Р„˜ўЦР^„Р`„˜ўOJQJo(З№€ „„˜ўЦ^„`„˜ўOJQJo(o€ „`„˜ўЦ`^„``„˜ўOJQJo(Ї№ІS(Ь. grzО]5\‡Б)M)(Ј†^џџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџ@€X‚АЯьџ@ @GTimes New Roman5€Symbol3 Arial3Times7Courier5 Tahoma? Courier New;€Wingdings qˆˆ№аhD%„Ц%„1ш 3œF 8!№ЅРДД€20d[тќЛ2ƒq№мHXџџThe Encyclopedia of Life (EOL)Henri CasanovaHenri Casanovaўџ р…ŸђљOhЋ‘+'Гй0|ˆИФмшј  8 D P\dlt'The Encyclopedia of Life (EOL)he Henri CasanovaaenrNormalaHenri Casanovaa4nrMicrosoft Word 10.1@ @јNУФ@T^дФ1шўџ еЭеœ.“—+,љЎ0  hp€ˆ˜  ЈАИ Р ы'UCSDcy3 [ The Encyclopedia of Life (EOL) Title ўџџџ !"#$%&'ўџџџ)*+,-./ўџџџ1234567ўџџџ§џџџ:ўџџџўџџџўџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџRoot Entryџџџџџџџџ РF€№б™Ф<€1TableџџџџџџџџЎWordDocumentџџџџџџџџ4SummaryInformation(џџџџ(DocumentSummaryInformation8џџџџџџџџџџџџ0CompObjџџџџџџџџџџџџXџџџџџџџџџџџџџџџџџџџџџџџџўџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџџўџџџџџ РFMicrosoft Word DocumentўџџџNB6WWord.Document.8