Assignment 1 (50points)

The goals of this exercise:

* To assess the significance of the similarity between sequences (alignment, P and E values…)

* To get experience with some of the basic multiple sequence alignment procedures

* To see how different algorithms produce different alignments

* To try one example of alignment-based database searching

* To determine common motif elements within a given set of amino acid or

nucleotide sequences.

* To practice with BioEdit for editing sequence and alignments

Part 1: Find a pair of DNA sequences that show no significant similarity but are homologous

1. Choose a protein-coding DNA sequence (save it as FASTA format)

2. Translate it (save it as FASTA format)

3. Search a database using the protein sequence

4. Choose a significant but distant hit

5. Get its original DNA sequence (save it as FASTA format)

6. Compare this sequence to the initial DNA sequence (Alignments, P and E values…)

* Submit

1. The two DNA sequences, with any relevant information about them

2. Their translations

3. The top list of hits (~20) from the database search (not the alignments!)

4. The alignment of the two DNA sequences

5. The alignment of the two protein sequences

6. The assessment of similarity significance for the DNA and the protein

comparisons

Part 2: Find a pair of protein sequences (A and C) that show no significant similarity but are homologous

1. Choose a protein sequence A (or a coding DNA sequence and translate it), use a sequence based on your own interest

2. Search a database of your choice with protein sequence A (see slides)

3. Pick significant but distant hit B

4. Search a database with protein sequence B

5. Pick significant but distant hit C (C should probably not be in A's search output,

but it might be there with a high E value)

6. Compare A, B and C in pairs (see lectures notes for the websites)

7. Show that there is no significant similarity between A and C

8. Show the significant similarities

-between A and B

-between B and C

9. Pay attention to the region of overlap between A-B and B-C. If they are not "the

same", repeat from step 4 but use (for searching) only the part of B that is similar

to A

* Submit -

1. The three protein sequences with relevant information

2. The three pairwise alignments

3. The three significance estimates and conclusions

Part 3: Muti-sequence alignment.

1. Pick a protein sequence, again using your favorite protein sequence,

2. Search SwissProt (http://ca.expasy.org/tools/blast/) or NCBI

(http://www.ncbi.nih.gov/BLAST/), using any pairwise-based database search

program

* Keep the output for later on!

3. Choose a few significant but not identical hits, at least 3

* They should be no more than 80% identical with the query sequence

* For interesting results, they should also be <80% identical among themselves.

Fill up the table below:

Hit1 / Hit2 / Hit3
Hit1 / - / % / %
Hit2 / - / - / %
Hit3 / - / - / -

* The more distant the sequences are, the more interesting the results will be

4. Align the query sequence and the hits using ClustalW or ClustalX both require

FASTA-formatted input files.

5. Align the same sequences using the BlockMaker server

(at http://blocks.fhcrc.org/blocks/blockmkr/make_blocks.html)

* Submit the blocks produced, in text format

* Answer the following questions:

Were all the regions aligned by the BlockMaker aligned similarly by

Clustal? Conversely, are there regions that were aligned well by Clustal that

BlockMaker didn't report?

If not, can you explain the differences? Which result looks more reliable?

If yes, which program do you prefer, and why?

2