e-BioInfra Tutorial at the NBIC Conference:
port applications to the grid and run them as part of a workflow
19 April 2011
Location: NBIC Conference
Instructors: Barbera van Schaik
, Mark Santcroos
, Aldo Jongejan
, Antoine van Kampen
and Silvia D. Olabarriaga
NBIC conference 2011 - all tutorials: https://wiki.nbic.nl/index.php/NBIC_Conference_2011:_BioAssist_Tutorials_Installation
URL to this tutorial: http://www.bioinformaticslaboratory.nl/twiki/bin/view/BioLab/NBICTutorial
The tutorial demonstrates how to port and run applications on the grid with the eBioScience infrastructure. The participant learns how to run an existing grid workflow, wrap an application as a workflow component and link components in a workflow.
IMPORTANT: when using the VBrowser, do not delete or copy any grid files that were not created by yourself! The LFC is a shared storage device and the basis for usage is trust.
A virtual machine is available with all software pre-installed. Alternatively you can install the Vbrowser, a wrapper script and the Moteur2 application on your own machine. The Vbrowser is used to access and copy files from/to grid resources and to start grid workflows. The wrapper script is used to wrap executables as a workflow component. The Moteur2 application can be used to connect workflow components.
Install virtualbox and load the ebioinfra virtual machine (VM)
- Install the latest version of VirtualBox on your machine. Binaries for a number of platforms are available on the USB stick
- Download the VM if you did not retrieve a USB stick: eBioInfra.ovf (16KB), eBioInfra-disk1.vmdk (1.7GB)
- Start Virtualbox and import the eBioInfra.ovf file (File > import appliance)
- Start the eBioInfra virtual machine
Installation of vbrowser, wrapper script and moteur2 on your own system
- Perl (for the executable wrapper)
- Java 1.6 (for Vbrowser and Moteur2)
- Graphviz (for Moteur2)
Background and manuals:
Loading the grid certificate
The tutorial certificate has been inactivated after the conference. Ask Aldo or Barbera how to proceed if you would like to do this tutorial.
Get acquainted with the interface
Start the vbrowser and activate the grid certificate
Copy a file to the grid and examine the stored file
- Locate the installation of VBrowser software on your computer (tip: menu)
- Start the VBrowser
- Login with the guest certificate by pressing the "!" button. Leave all settings unchanged and fill in the password. Push the "Create" button and then "OK"
- Locate directory /grid/vlemed/NBICTutorial (tip: LFC resource)
- Create a subdirectory there to store your files (below mentioned as “MYDIR”). You might need to refresh the screen before you will see the new directory.
- Copy a (small) file from your local computer to MYDIR and observe the messages shown on the “Transferring…” pop-up (tip: open new VBrowser window)
- Look at the properties of the new file. On which host was it stored?
- Look at the file replicas. What is the complete name of the physical file?
- Add a new replica for this file. What is the complete name of the new physical file?
- Look again the properties of the file. On which host(s) is it stored?
Run an existing workflow on the grid
- Go to the /grid/vlemed/NBICTutorial directory
- Browse to directory HelloWorld
- Right-click on the HelloWorld.gwendia file and select "View with.." > "Other.." > "ViewerMoteur2"
- Fill in your name (without spaces). Make sure you press enter after each value.
- Submit the workflow by pressing the "Run workflow" button
- The jobs can be monitored from the Vbrowser and from an internet browser. Copy/paste the url in an internet browser.
- Monitor the workflow execution (tip: click on the workflow components)
- Locate the stdout, stderr and output files
- Look at the generated files and answer the questions
- On which host has the job run?
- How long did the job take to run? (tip: inspect stdout, stderr)
- Copy the result file to your desktop
Run an existing workflow with multiple input parameters
- Run the workflow again with more than one input value (press enter after each value!)
- Look at the generated files and answer the questions
- Where did the jobs run?
- Where were the files stored?
- Were there any errors/warnings during the execution?
Port an application to the grid
Examine the files of the hello-world workflow
Port a hello-world.sh script to the grid with a wrapper script
- Take a look at the hello-world.sh file (on your desktop in directory HelloWorld)
- Test the shell script via the command line
- To port an application to the grid it is necessary to properly define the in- and output parameters and files. The example shell script has one input value (inputParam) and one output file (outputFile). A wrapper has to be written for the executable.
- Take a look at the HelloWorld.xml file. This is the GASW wrapper for the executable that you would like to run on the grid.
- One or more components can be linked in a workflow. This is defined in the Gwendia language or in the Scufl language. Take a look at the HelloWorld.gwendia file to get an impression of the Gwendia language.
- It is not necessary to write these files ourselve. They can be generated automatically with a wrapper script
- Open a terminal and start the command "create_GASW_SCUFL_GWENDIA". When you are not using the virtual machine, start the perl script "create_GASW_SCUFL_GWENDIA.pl" Follow the instructions on the screen. An example is given below.
- About the LFC directory: Fill in the grid directory name where you would like to place the files, e.g. /grid/vlemed/NBICtutorial/MYDIR
- About the names of the in and output parameters/files: they are not allowed to start with in, out or result
- About the template for the output file (/grid/vlemed/NBICtutorial/MYDIR/result/$na1_%s.txt): You can generate output file names automatically by supplying a template name.
- $dir1 means: place the output file in the same directory as the input file (not applicable in this example)
- $na1 is the name of the first input parameter/file, $na2 of the second parameter/file, etc
- %s is a random number. This will prevent that files are overwritten accidently
- Copy the generated files and the hello-world.sh file to your directory on the grid
- Run the workflow (the gwendia file)
This script creates a Gasw xml file, the scufl AND gwendia file with the given in- and output.
The scufl and gwendia file will have the same name as you enter for the Gasw file preceded by WF_ (e.g. WF_"name".scufl or "name".gwendia), these files are written in the output directory.
Enter the name for the output directory e.g.:
"outputXML" (will be created in working directory)
"the full path" /home/user/creategasw/outputXML)
Press enter to use the default directory: xmlfiles):
Created directory /home/ebioinfra/Desktop/workflowfiles
Enter the name for your Gasw outputfile, e.g. myGaswFile.xml.
If you enter an existing file(name), this file will be overwritten!!!
Opened file: /home/ebioinfra/Desktop/workflowfiles/MyHelloWorld.xml
Enter the directory on the LFC where you will store the Gasw xml file (e.g. /grid/vlemed/AMC-e-BioScience/thisworkflow/gasw)
Enter the access type for the executable (LFN or URL):
Enter the name of the executable:
Enter the path where the executable is stored (e.g. /grid/vlemed/user
Writing to MyHelloWorld.xml:
Enter the number of parameters to use (press enter if none):
Enter the parameter name for parameter no. 1:
Enter the parameter option for parameter no. 1 (e.g. -l)
Writing to MyHelloWorld.xml:
<input name="YourNamePlease" option="-i"/>
Enter the number of input files (press enter if none):
Enter the number of output files (press enter if none):
Enter the description for output file no. 1:
Enter the output filename AND directory for output file no. 1 : (e.g. /grid/vlemed/user/outdir/Out_Filename.txt or $dir(n)/$na(n)/%s_output_file)
Writing to MyHelloWorld.xml:
<output name="HelloTxt" option = "no0">
Enter the number of sandbox files (press enter if none):
Enter the requirement value (e.g. (other.GlueCEUniqueID == "ce.gina.sara.nl:2119/jobmanager-pbs-medium" || other.GlueCEUniqueID == "deimos.htc.biggrid.nl:2119/jobmanager-pbs-medium")
(press enter if none):
Written the files /home/ebioinfra/Desktop/workflowfiles/MyHelloWorld.xml and /home/ebioinfra/Desktop/workflowfiles/WF_MyHelloWorld.scufl and /home/ebioinfra/Desktop/workflowfiles/MyHelloWorld.gwendia .
Create a workflow with multiple components with the Moteur application
You can connect components with the Gwendia
workflow language. Moteur2
is a visual editor to create workflows. You will build a workflow where sequence reads (data source: 1000 genomes project
) are aligned with BWA
against one chromosome (data source: UCSC
) and at the same time are analyzed by the FastQC
Copy the Gasw files to your desktop
Create a new workflow
- Go to grid directory lfn://lfc.grid.sara.nl:5010/grid/vlemed/NBICtutorial/Sequence_WF
- Copy the files BwaIllumina.xml and FastqToFastQC.xml to your desktop. You will need to examine these files later
Add the BWA component
- Open the Moteur2 editor (menu > Applications > Moteur)
- Give the workflow a name via Edit > Rename workflow
Add in- and outputs to the workflow
- Add a new workflow component. Right click on the canvas and select "add processor" and rename the processor to for example "bwa" (double click on the "new processor")
- Next step is to configure the component (right click on the component and select "configure")
- The in- and output parameters must be exactly the same (case-sensitive) and in the same order as described in the Gasw file. Open the FastqToFastQC.xml, remove the parameters "in" and "out" in the "Ports" tab and enter the names of the in- and out parameters as described in the Gasw file
- In: Lane, SampleName, Library, ParametersTxt, Pair1FastqGz, Pair2FastqGz, ReferenceTarGz
- Out: Bam, Bai
- You can leave the "Type" unchanged (String)
- Go to the "Iterators" tab
- Right click on the cross sign and change it to "dot iterator". More complex iteration strategies are possible, but not from the visual editor
- Go to the "Processor" tab and select Processor type "GASW". Fill in the location of the Gasw file on grid storage including the protocol and entire path (lfn://lfc.grid.sara.nl:5010/grid/vlemed/NBICtutorial/Sequence_WF/BwaIllumina.xml)
- Close the "processor bwa configuration" window
Add the FastQC component
- Create an input node (right click on the canvas and select "add input")
- Rename the input to "Lane". The input names of these nodes do not necessarily have to be exactly the same as in the Gasw
- Connect the input to the input port of the BWA processor by selecting "Data link from" > "out" (right mouse button) and dropping this on the "Lane" input of the BWA processor
- Create inputs for the other input ports (SampleName, Library, ParametersTxt, Pair1FastqGz, Pair2FastqGz, ReferenceTarGz) and link them to the corresponding input ports of the BWA component
- Create outputs (Bam and Bai) and link the output port of the BWA component to these outputs (right click on the BWA component > data link from...)
- Save the workflow on the destop as "myworkflow.gwendia"
Create an output node for the fastqc component and link the fastq-inputs to this component
- Add a new processor and give it the name "fastqc"
- Examine the "FastqToFastQC.xml" file for the in- and output names and enter these in the "Ports" tab
- Go to the "Processor" tab, select type "GASW" and enter the URI of the fastqc gasw file (lfn://lfc.grid.sara.nl:5010/grid/vlemed/NBICtutorial/Sequence_WF/FastqToFastQC.xml)
- Close the "processor fastqc configuration" window
Run the workflow
- Add a new output to the canvas "fastqc_zip" and link the output port "fileOut.zip" to this output
- Link the "Pair1FastqGz" input to the "fastqc" component
- Save the workflow
- Browse to your desktop in the VBrowser (probably you need to refresh the view before you can see your workflow)
- Start your workflow (right click > view with.. > Other.. > ViewerMoteur2)
- Enter the input parameters. You can use fake names for Lane, SampleName and Library. Don't forget to press [enter] after each value.
- Location of ParametersTxt: lfn://lfc.grid.sara.nl:5010/grid/vlemed/NBICtutorial/Sequence_WF/bwa-default.txt
- Pair1FastqGz: lfn://lfc.grid.sara.nl:5010/grid/vlemed/NBICtutorial/Sequence_WF/sequences-1000g/SRR063088_1.filtaz.fastq.gz
- Pair2FastqGz: lfn://lfc.grid.sara.nl:5010/grid/vlemed/NBICtutorial/Sequence_WF/sequences-1000g/SRR063088_2.filtaz.fastq.gz
- ReferenceTarGz: lfn://lfc.grid.sara.nl:5010/grid/vlemed/NBICtutorial/Sequence_WF/database-hg19/index_chr13.fa.tar.gz
- Run the workflow and monitor the progress via the Web service URL. The alignment can take a while, so you can check the results later on this day.
When you need to enter many parameters it is easier to edit an input-values-file. You can save the parameters via the "Save to file" button and manually add more values (or write a script to generate this file)
If you wish more information about the research activities of the e-bioscience group or to get access to the e-BioInfra, please contact us