Workflows for Sequence Analysis on the Dutch Grid

This page provides the source code and extra information about the workflows described in:
Initial steps towards a production platform for DNA sequence analysis on the grid. (BMC Bioinformatics 2010, 11:598)
Angela CM Luyf, Barbera DC van Schaik, Michel de Vries, Frank Baas, Antoine HC van Kampen, Silvia D Olabarriaga.

The workflows are developed to run on the Dutch Grid using the EBioscience infrastructure. They are described in scufl language and can be enacted by the MOTEUR workflow management system.

If you would like to have access to the EBioScience services or want to install your own server with these services, please contact Silvia Olabarriaga: s. d. olabarriaga <at> amc. nl.

The raw sequences (as generated by the sequencer software) are converted to the FASTA format, and searched in parallel in a database using BLAST and/or BLAT. After completion (only using BLAST), the output of the alignment are filtered based on configurable criteria. In this case the maximum hits to return per query sequence.

Download

Download and extract Sequence_WF.rar.

This results in the directories Sequence_WF/Blast and Sequence_WF/Blat, containing the following subdirectories and files:

  • bin -> executables (this is where the dependent executables must be placed)
  • GasW -> xml description files of the workflow components
  • perlScripts -> perl scripts, the implementation of workflow components
  • Scufl -> the workflows in scufl format
  • shFiles -> shell scripts, the implementation of workflow components

Current Grid workflows:

  • /Sequence_WF/Blat/Scufl/sffToFasta_Blat.scufl
  • /Sequence_WF/Blast/Scufl/sffToFasta_Blast_Blat.scufl (depends on Blat and Blast directory)
  • /Sequence_WF/Blast/Scufl/sffToFasta_Blast_ParseBlast.scufl

The path's in the scufl and Gasw xml files are changed to "your_grid_directory", this should be adapted. All files are stored on a LFC server. The scufl files do not necessarily have to be placed on the LFC server.

Copy the dependent executables into the "bin" directory. They are listed below and in the file "Files.txt" in this directory.

Dependent executables:

  • Blat source and documentation
  • Blastall, standalone application and databases
  • formatdb a BLAST Database Related Tool

Roche 454's sffinfo utility is not available for download as it is part of the Roche sfftools package.

Running the workflows

Start workflow using the VBrowser

Assuming you have access to the EBioScience infrastructure and installed the VBrowser , the workflow starts by clicking on the scufl file.

User input for Sff2Fasta _Blast_ParseBlast.scufl:

  • SffFile, one or more files in sff format
  • DatabaseFasta.gz ... reference database(s) in fasta format and compressed with gzip
  • MaxBlastHitsPerEntry ... maximum hits to return per query sequence

Running workflow

Output files

The output of each component of the workflow is stored on the LFC server, the directory and filename should be adapted in the corresponding Gasw description files.

The output of the sff2Fasta component are sequence files in fasta format, which is extracted in this component from the input file in sff format. These files and the reference database are the input for the BLAST component. The output files of BLAST together with the input specified by "maximum hits per query" is used as input for the ParseBlast component. The ParseBlast component returns a list with Blast alignments.

All files can be managed and copied to other locations (e.g. from grid storage to your own pc or server and vice versa) with the VBrowser.

Run components on a local system

At the moment the workflows can only be executed on a grid infrastructure. You will need a grid certificate and access to the EBioScience services to run them.

All shell scripts (the implementation of the workflow components) can run independently on a unix system. In that case the files from the "bin", "perlScripts" and "shFiles" directory need to be placed in one directory. An example on how to run the shell scripts is available in the code of these scripts.

Contact

* b.d.vanschaik <at> amc.uva.nl , Barbera van Schaik
* a.c.luyf <at> amc.uva.nl , Angela Luijf

Topic revision: r13 - 2011-01-10 - AngelaLuijf
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback