Internet-based sequence analyses2014-03-19

Molecular ecology and evolution(BIOR25)

1.Internet-based sequence analyses

The aim of this DryLab is to use internet (GenBank) to find cytochrome b gene sequences in primates, and align these sequences and measure some basic distances between the sequences. Based on the tools you learn in this DryLab, you will be able to compare DNA sequences that you have generated yourself with sequences from different species stored in e.g. GenBank. The methods you learn in this lab can also be used to identify species based on DNA sequences.

1.1 The approach in this lab is to find DNA sequences of the cytochrome b gene from some species of primates (monkeys), align these sequences and calculate sequence divergence on the level of DNA and amino acids. Known sequences can be found at the GenBank via the entrez nucleotide at NCBI (and you can later use the same methods/tools to analyse sequences from your own lab work).

Människa / Human / Homo sapiens
Chimpans / Chimpanzee / Pan troglodytes
Gorilla / Gorilla / Gorilla gorilla
Orangutang / Sumatran orangutan / Pongo pygmaeus abelii
Gibbon / Red-cheeked Gibbon / Hylobates gabriellae
Rhesusapa / Rhesus Macaque / Macaca mulatta

GenBankhomepage:

1.2 Get cytochrome b sequences from different primate species. Write “gorilla gorilla mitochondrion complete” in the search string box, choose “Nucleotide” and run a search in GenBank, click on Go. To find what you want, you sometimes have to play around with the keywords. Matching sequences will then be displayed. Choose for example “Gorilla gorilla gorilla mitochondrion, complete genome” (dubble click on the blue text). The next page that comes up will show the whole nucleotide sequence of the gorilla mitochondrion (at the end of the page) and you have to find where the cytochrome b sequence begins. Scroll down until you find the following text:

gene 14171..15311

/gene="CYTB"

/db_xref="GeneID:6742684"

Click on ”gene” so that you select the cyt b sequence, i.e. from nucleotide 14171 to 15311.

1.3 Copy from GenBank to BioEdit. The next step is to transfer the sequences from GenBank to BioEdit (the program for sequence analysis). Before you can paste the sequence into BioEdit, you have to transform them into the socalled FASTA format (starting with the > sign followed by sequence name and description on line one, and the actual sequence from line 2 onwards).

Click on “FASTA” (lowerright).

Select the nucleotide sequence (including the first line) and copy (Ctrl+C).

The next step is to paste the sequence into BioEdit.

  • Open the program BioEdit. [If you have not got BioEdit installed, download and install the latest version from Go to ”File””New alignment”.
  • Make sure that “Mode:” in BioEdit is set as “Edit” and “Insert” in the upper left corner
  • Put the marker in the box for the new alignment and click, and then go up with the mouse arrow and click on “File””Import from Clipboard”. Then BioEdit automatically convert the file to a continuous nucleotide sequence and puts it in your alignment file.
  • Change the name with ”Sequence””Rename””Edit Title”. Avoid blank space and symbols in the name! This will make it easier when finding a sequence in a phylogenetic tree later on.
  • Save the file (student computers; in GU-Student /My Documents)

1.4. Add cytochrome b gene sequences from another 5-6different species in your alignment file.You can either do this by searching for taxonomic names / gene regions as above or using one cytochrome b sequence, e.g. from the gorilla, doing a BLAST (1.5). Note that some of the options to explore in MEGA (later in this exercise) do not work if you have less than four sequences.

1.5 Execute a BLAST (Basic Local Alignment Search Tool):

  • Go to GenBank.

Find BLAST (in the menu under Nucleotide Tools). A click leads you to the BLAST-page.

  • Select “nucleotide BLAST”. Now you see an open field where you paste your copied sequence. Select the database “Nucleotide collection (nr/nt)”

  • Click on the “BLAST” button (lower left). The search may take up to a few minutes!You get the following result of the BLAST search. First an illustration of the “Distribution of 100 Blast hits on the Query Sequence”, followed by a long list with the names of the hits (sequences that show some similarity to your sequence). Hits are ranked according to similarity with the most similar sequence on the top of the list.

  • What does the E-value and Query coverage mean?
  • Do a new blast search but type in an arbitrary sequence of about 100 nucleotide (or copy and paste this one (and change “Optimize for” to “somewhat similar sequences (blastn)”: “gagggctttcggtatgcttgcacacattccggttcggctgcgtggtgcagatgacagatagcagatagacccttgtgtgtgcgaaatgtgtgcgagagcagagagatttccatttggccattggacccttggtaattgggaaacctta”
  • Compare with the results from the BLASTed gorilla. Coverage, E-value
  • BLAST the gorilla sequence again and continue.Choose >4 sequences from different species and add these to the file with the gorilla sequence. This can be done in several different ways. The simplest way is to do the same as you did with the gorilla sequence and add them one by one. A quicker way is to download all your chosen sequences in one batch. To do this, select the sequences you want by ticking the box next to the sequence name. Click Download (at the top of the list of sequences) and tick “FASTA (aligned sequences)” and then Continue. Save the .txt file and open it in BioEdit. Select the sequences and copy and paste them into the file with the gorilla sequence.Change the name of the sequences to something more handy than the GenBank names.
  • Make sure all your sequences are aligned and of the same length. If not, insert or delete bases at the beginning so the sequences match each other, and then trim the sequences so they all have the same length.

1.7.Open MEGA5 (free program that can be down loaded from (

1.8. Open your file [Open a File/Session] and answer the following questions as they appear:

  • "How do you want to open the file?” Analyze
  • "Nucleotide sequences" OK
  • "Protein-coding DNA" yes
  • Select genetic code "Vertebrate mitochondrial".

In order to look at the data:

  • Click Data / Explore Active Data

1.9. Explore the buttons at the top of the Sequence Data Explorer window:(and try to understand what these means)

  • C – Conserved sites
  • V – Variable sites
  • Pi – Parsimony-informative sites
  • S – Singleton sites
  • 0 – 0-fold degenerate
  • 2 – 2-fold degenerate
  • 4 – 4-fold degenerate

1.10. Compute DNA distances

  • Click Data / Quit Data Viewer
  • In the menu, go to distance / compute pairwise distances
  • In the Options summary, select “Model/Method = p-distance”, Substitution type=”Nucleotide” and “Gaps/Missing Data Treatment = Pairwise deletions”
  • Interpret the distances between the species
  • Do this separately for Codon Positions 1st, 2nd and 3rd. Why are the distances not the same?

1.11. Compute amino acid distances

  • In the menu, go to distance / compute pairwise distances and choose “Substitution type: Amino acid” and “Model/method = p-distance”
  • Compare the distances with what you got from the DNA-distances

1