Mail your answers to the questions below to firstname.lastname@example.org before February 4, 2012
Virus discovery or other metagenomics studies can be conducted using next generation sequencing and bioinformatics techniques. Read the story about virus discovery HERE
to get acquanted with the experiments and data handling.
Sequence data from a Roche sequencer
The sequencing of the experiments was done on a Roche 454 FLX+ system.
A typical sequence run takes about 8 hours and produces 1GB of sequence data in SFF format.
- Why do you think the Roche sequencer is used, and not the Solid sequencer?
- What kind of information is stored in the SFF files that are produced by the Roche FLX+ sequencer?
Basic Local Alignment Search Tool (BLAST)
Once you have obtained the sequences you need to identify each fragment. The program BLAST can be used for this purpose. It was developed by the NCBI in 1990 and still very popular today. With BLAST you can search for similar sequences in public databases. For our experiments we use the "blastn" program.
- Where do you use the "blastn" program for?
- What kind of sequence format does the "blastn" program expect?
Analyze the sequences below with the "Nucleotide blast" program. Select the non-redundant (NR) database for the search.
- What does the non-redundant database contain?
- To which organisms do the sequences below belong?
BLAST via the ebioinfra gateway
A next generation sequencing experiment generates one million sequences per sequence run. With the NCBI BLAST you can only upload 200 sequences at the time. Analyzing a complete experiment is therefore a very time-consuming task. Next to this you might want to repeat the experiment when a new version of the NR database becomes available, since the sequences that could not be identified today might be identified tomorrow when more sequences are added to the database. To scale up the analysis the BLAST application was implemented in the e-BioInfra gateway.
Files with multiple sequences can be uploaded in two formats: the FASTA format and the SFF format. In the last case the workflow will convert these sequences to FASTA format first before performing the BLAST search. In the following exercise you are going to upload a small sequence file to the grid with the e-BioInfra gateway and analyze it with BLAST. You will also analyze three larger experiments that have already been uploaded to grid storage.
Synchronization of the data on the data staging machine with grid storage takes place every 15 minutes, therefore we will first continue with data that has already been stored on grid and continue with these sequences later.
- The dataset that you are going to work with is stored in the directory /grid/vlemed/ebioinfragate/studentNN/exp Fill in "exp" for the "Custom Input Data Directory" and keep the database directory unchanged.
- Compare the three experiments with the "genbank175_ribosomal" and "genbank175_viral" database with default blast settings.
- In the Experiments list you will see all the experiments that have been performed so far with this account
Hopefully the other sequence file is now transferred to grid storage.. if not: coffee break.
- How many jobs are running?
- At which site(s) are the jobs running? (You might have to wait until a job is complete before you see this)
- Download the result files when the analysis is complete
The sequences that are uploaded via the gateway are stored on a default location in your account directory. Go to BLAST again, leave the directories unchanged and compare the sequences with the bacteria, viral and ribosomal databases of genbank175.
- Are the results the same as with the NCBI BLAST?
- Are the patients infected with a virus? If so, which one?
- Which group of sequences is worthwhile to explore further?
If it takes too long, or if you can't open the files, you can download the result files here (right-mouse-click-etc, otherwise your browser won't like it):
Converted SFF file
Blast results of 3 experiments against virus database
Blast result of one of the three experiments against the entire Genbank NT database
Blast result of the 6 sequences that you have uploaded