Workflows for Sequence Analysis on the Dutch Grid
This page provides the source code and extra information about the workflows described in: Initial steps towards a production platform for DNA sequence analysis on the grid. (BMC Bioinformatics 2010, 11:598)
Angela CM Luyf, Barbera DC van Schaik, Michel de Vries, Frank Baas, Antoine HC van Kampen, Silvia D Olabarriaga.
The workflows are developed to run on the Dutch Grid
using the EBioscience
infrastructure. They are described in scufl language and can be enacted by the MOTEUR
workflow management system.
If you would like to have access to the EBioScience services or want to install your own server with these services, please contact Silvia Olabarriaga: s. d. olabarriaga <at> amc. nl.
The raw sequences (as generated by the sequencer software) are converted to the FASTA format, and searched in parallel in a database using BLAST and/or BLAT. After completion (only using BLAST), the output of the alignment are filtered based on configurable criteria. In this case the maximum hits to return per query sequence.
Download and extract Sequence_WF.rar
This results in the directories Sequence_WF/Blast and Sequence_WF/Blat, containing the following subdirectories and files:
- bin -> executables (this is where the dependent executables must be placed)
- GasW -> xml description files of the workflow components
- perlScripts -> perl scripts, the implementation of workflow components
- Scufl -> the workflows in scufl format
- shFiles -> shell scripts, the implementation of workflow components
Current Grid workflows:
- /Sequence_WF/Blast/Scufl/sffToFasta_Blast_Blat.scufl (depends on Blat and Blast directory)
The path's in the scufl and Gasw xml files are changed to "your_grid_directory", this should be adapted. All files are stored on a LFC server. The scufl files do not necessarily have to be placed on the LFC server.
Copy the dependent executables into the "bin" directory. They are listed below and in the file "Files.txt" in this directory.
- Blat source and documentation
- Blastall, standalone application and databases
- formatdb a BLAST Database Related Tool
Roche 454's sffinfo
utility is not available for download as it is part of the Roche sfftools package.
Running the workflows
Start workflow using the VBrowser
Assuming you have access to the EBioScience infrastructure and installed the VBrowser
, the workflow starts by clicking on the scufl file.
User input for Sff2Fasta _Blast_ParseBlast.scufl:
- SffFile, one or more files in sff format
- DatabaseFasta.gz ... reference database(s) in fasta format and compressed with gzip
- MaxBlastHitsPerEntry ... maximum hits to return per query sequence
The output of each component of the workflow is stored on the LFC server, the directory and filename should be adapted in the corresponding Gasw
The output of the sff2Fasta component are sequence files in fasta format, which is extracted in this component from the input file in sff format. These files and the reference database are the input for the BLAST component. The output files of BLAST together with the input specified by "maximum hits per query" is used as input for the ParseBlast component. The ParseBlast component returns a list with Blast alignments.
All files can be managed and copied to other locations (e.g. from grid storage to your own pc or server and vice versa) with the VBrowser.
Run components on a local system
At the moment the workflows can only be executed on a grid infrastructure. You will need a grid certificate and access to the EBioScience services to run them.
All shell scripts (the implementation of the workflow components) can run independently on a unix system. In that case the files from the "bin", "perlScripts" and "shFiles" directory need to be placed in one directory. An example on how to run the shell scripts is available in the code of these scripts.
* b.d.vanschaik <at> amc.uva.nl , Barbera van Schaik
* a.c.luyf <at> amc.uva.nl , Angela Luijf