Assignment 1: Understanding Bio Seqio, Bio Seq, and Bio Seqfeature Objects

In-Class Exercises Nov 6th

cp –r /xdisk/ssolonen/ecol553_nov6/ .

Today we will be exploring two BioPerl modules, Bio::SeqIO and Bio::SearchIO, each designed to handle very diverse data types. In our examples, we will focus on two files you are already familiar with: the Genbank file containing an annotated genome and the BLAST result file containing descriptions of sequence homology between a query and a database sequence.

Inside the directory you copied above are several files: two perl scripts, a fasta file of viral protein sequences, and an annotated virus genome in Genebank format, and three files containing the word “Glimmer”. The last three relate to the homework; ignore them for now.

Assignment 1: Understanding Bio::SeqIO, Bio::Seq, and Bio::SeqFeature objects

- Open the genbank_cds_pull.pl script and try filling in the missing comments. What does each part of the script do? What objects are used?

- Try running the script on the virus genome Genebank file. Does the output match your expectations?

- Go to http://www.bioperl.org/wiki/HOWTO:Feature-Annotation and find out how to extract features using their tags. Does our script use this method?

- Visit the Bio::Seq page at search.cpan.org. What does the “new” method do in this module? Find the attributes of the Bio::Seq object that we set in our script. What other parameters can a Bio::Seq object hold?

Assignment 2: Understanding Bio::SearchIO Result, Hit, and HSP objects

- Open the blast_to_tab.pl script and try your hand at adding commentary containing what you think the function of each section of code is and what is being manipulated there.

- Try running this script on the example “Vibrio_Phage_VHML.gbk.Glimmer.bln” BLAST output file. Be sure to pipe the output to a descriptive filename, but don’t over-write the example “Vibrio_Phage_VHML.gbk.Glimmer.tab” file also found in this directory.

- Go to http://www.bioperl.org/wiki/HOWTO:SearchIO and find the methods used to access hit object information and HSP object information. What objects would you access to find e-values, descriptions for the query sequences, or descriptions of the subject sequences?

Homework 9 due Tuesday, Nov 13th

We will be building up from the scripts used in the in-class exercises. Our objective for this homework is to 1) obtain select CDS sequence data and functional descriptions from a Genbank file, 2) BLAST search each CDS against a virus protein database, and 3) parse the blast output to create a tab-delimited file comparing the CDS functional descriptions to the descriptions of homologous proteins from the virus protein database. Try matching “putative” or “hypothetical” CDS descriptions. Will any of the homologous database proteins have more assertive and/or specific functional annotations?

Part 1

A) Starting with the genbank_cds_pull.pl script, modify it so that another command line argument is introduced. This argument will be a pattern that you will use to search annotation found in the Vibrio_Phage VHML.gbk file. Modify the SeqIO output object to save to “Vibrio_Phage_VHML.gbk.[pattern].fa” file. Save your new script as genbank_cds_search.pl

B) Extract the contents of the “note” field in the annotation of each CDS. This field contains functional descriptions of the CDS region. Modify the script so that a sequence is only output if the contents of the “note” field contain the user-specified pattern (case-insensitive).

C) Fill out the Bio::Seq object by 1) capturing the CDS “protein_id” annotation and using it as the accession number for the Bio::Seq object and 2) using the contents of the “note” field as the sequence description.

Part 2

A) Build BLAST protein database from the virus protein fasta file using the “makeblastdb” command. Don’t forget to load the BLAST module from your command line first!

B) Use blastx to search your database using the CDS fasta sequences as a query. Set the e-value threshold at 1e-3. (See example from homework #2 if you need a refresher; no need to run a PBS script though, this BLAST search is small enough to run from the command line).

Part 3

A) Starting with the blast_to_tab.pl script, obtain the query description, the hit e-value, and the hit description and append them as the last columns of tab-delimited output. Save your script as blast_to_tab_mod.pl

B) Count the number of HSPs each hit got and append this number as the first column.

For an example of the files created in each part of this homework:

- Part 1, run with “Glimmer” as the search word: Vibrio_Phage_VHML.gbk.Glimmer.fa

- Part 2, run with 1e-03 e-vlaue threshold: Vibrio_Phage_VHML.gbk.Glimmer.bln

- Part 3, run on the above .bln file: Vibrio_Phage_VHML.gbk.Glimmer.tab