Genome Annotation

Bioinformatics Lab2

Tucson High School

Introduction

Today we will be learning about annotating a genome. This process involves finding genes in a genome sequence, and determining the function of those genes based on existing data currently available in public sequence databases such as Genbank. We will be finding and annotating genes in our genomeP-HM1, a T4-like phage that infects Prochlorococcus in the oceans.

Directions

Part 1. Gene Finding

We will be looking for genes in our genome sequence using a program called Prodigal. Prodigal (Prokaryotic Dynamic Programming GenefindingAlgorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee.

  1. Go to the following website:
  2. Upload the genome sequence on your desktop using the browse button, using the following genome sequence (genome_assembly/genome/P-HM1-genome.fa)
  3. Select “Gene Coordinates with Protein Translations” under the output options
  4. Run prodigal by pressing “Begin prodigal analysis”
  5. Keep the web page open to use in the next section.

Part 2. Gene Annotation

Using the protein sequences from the step above, we will try to find the function for several genes by comparing them to databases of known genes.

  1. Open another web browser or tab and go to the following website:
  2. Click on protein blast under the “Basic BLAST” header
  3. Copy and paste any of the protein sequences you generated above using the prodigal gene finder into the box that says “enter accession, gi, or FASTA sequence”. For example:

>Prodigal Gene 1 # 1 # 561 # 1 MYLSLKLHFTTDTFDYFKYGNAAKASQQSFDSRRDKFFFVKLSRTFKEDELREFFVANMI VEDKVYPATLVREGAKNYQEYLKRKQSLTYRFKEDVITLHEVSQKFDKLFIIDGMHPPLL KAHLGGRISIETLAIFHKIFNYVENFDKIIKEEIVWRPIRNRILKYEPFIFIDKGKYKNI IKQQYV

  1. Leave all other options as default and click on “BLAST”
  2. Look at the top matches for your protein sequence and answer the following questions to yourself. Does the sequence match a phage? How good are the matches? How much of the sequence matches (coverage), and how many mismatches do you have (identity)? Do you believe your protein is exactly, similar to, or not at all the same thing as what it hit to?
  1. Repeat steps 1-5 for 10 random genes and fill out the table below for the best hit or a very close hit to a phage.

gene ID / hit description / cov / iden / evalue / phage? / function?
Gene 1 / gp59 [Prochlorococcus phage P-SSM4] / 97% / 36% / 4e-29 / yes / helicase

Questions

  1. What is coverage a measure of?
  1. What is identity a measure of?
  1. Were you able to find a hit to most, some, or only a few of the genes you compared against Genbank (a public sequence repository)? Why do you think this is?
  1. For the gene that you got the best hit to of the ten above (based on coverage and identity), are the genes coding for something that is common in phages or specific to our phage? Why do you think this is?
  1. Suppose the best hit for a phage gene was to a bacteria rather than a phage. What are some possible explanations for this?