Molecular Genetics and Evolution
Between 1990 and 1993, scientists working on an international research project known as the Human Genome Project were able to identify and map the 20,000-25,000 genes that define a human being. Scientists have also successfully mapped the complete genomes of over 1000 species, including the fruit fly, mouse, and E. coli, as well as having partially mapped many thousands more. Scientists have also mapped thousands of:
Transcriptomes – RNA nucleotide sequences
Exomes – Post-intron splicing mRNA exon-only sequences
Proteomes – Proteins and their amino acid sequences
The location and complete sequence of the genes and sequences in each of these species are available for anyone in the world to access via the internet.
Why is this information important? Being able to identify the precise location and sequence of human genes will allow us to better understand genetic diseases. In addition, learning about the sequence of genes in other species helps us understand evolutionary relationships among organisms. Many of our genes are identical or similar to those found in other species.
Suppose you identify a single gene that is responsible for a particular disease in fruit flies. Is that same gene found in humans? Does it cause a similar disease? It would take you nearly 10 years to read through the entire human genome to try to locate the same sequence of bases as that in fruit flies. This definitely isn’t practical, so a sophisticated technological method is needed. Bioinformatics is a field that combines statistics, mathematical modeling, and computer science to analyze biological data. Using bioinformatics methods, entire genomes can be quickly compared in order to detect genetic similarities and differences.
In this activity, you will use publically-available bioinformatics tools to construct phylogenetic trees of species that interest you. This activity will take place in three parts:
1. Follow instructions to build a phylogenetic tree from a protein sequence, for assigned species
2. On your own, build a phylogenetic tree for those same species, but from DNA this time
3. Choose your own species and gene/protein to study, and use one or more of the methods that you have learned to explain some element of evolutionary history for those species.
Work with partners who will be on the field trip the same day as you.
As you work, be sure to SAVE the sequences that you find in a Word document, and OFTEN!
Let’s get started! You will mainly be accessing information and tools available on the National Center for Biotechnology Information website.
National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)
The National Center for Biotechnology Information is a website sponsored by the federal government. The goal of the site is to advance science and health by providing access to biomedical and genomic information. It is a HUGE site with MANY amazing tools. It can be faster or slower to access depending upon how many scientists are using it at the moment!
Part 1: Phylogenetic Tree from Protein Sequence
Let’s build a phylogenetic tree to address the question, did whales evolve from terrestrial (land) or aquatic animals? We’ll study these species to answer our question:
Organism / Common Name / Scientific Name / Order / Relationship to WhalesWhale 1 / Blue Whale / Balaenoptera musculus / Cetacea / -
Whale 2 / Humpback Whale / Megaptera novaeangliae / Cetacea / -
Manatee/Dugong / Dugong / Dugong dugon / Sirenia
Seal / Harbor Seal / Phoca vitulina / Carnivora
Sea Lion / California Sea Lion / Zalophus californianus / Carnivora
Walrus / Walrus / Odobenus rosmarus / Carnivora
Land Carnivore / Dog/Wolf / Canis lupus / Carnivora
Artiodactyl / Hippopotamus / Hippopotamus amphibius / Artiodactyla
Perissodactyl / Indian Rhinoceros / Rhinoceros unicornis / Perissodactyla
Elephant / African Elephant / Loxodonta africana / Proboscidea
And we will study the amino acid sequence of the protein cytochrome b. Cytochrome b is a proton pump used in the electron transport chain of cellular respiration. It is found in all eukaryotes. (Pay attention here, this will be important to think about when you get to part 3!) It’s often used in phylogenetic studies because all eukaryotes conduct respiration, so they all have this protein, and it’s so fundamentally essential to cells that it doesn’t experience sustained mutations very often. Such proteins are more useful for looking deeper into evolutionary history than proteins that change frequently, but are not helpful for looking into the recent past. When it’s time for you to choose a protein or gene to study, bear in mind that you’ll want to choose one with appropriate species coverage and an appropriate mutation rate, depending upon the nature of which species you study!
Procedure - Retrieving Sequences:
1. Open a web browser and proceed to this URL for the NCBI homepage:
http://www.ncbi.nlm.nih.gov
2. The aim of our investigation today is to compare the cytochrome b sequences among different mammal groups. In the box next to “Search” (see picture above) click on the dropdown menu and select the “Protein” Database. In the box next to “for” type the phrase “cytb Mammalia” and click on the “Search” button (cytb is one of the server’s abbreviations for cytochrome b).
3. Each result lists a hyperlink to the protein as well as the scientific name of the organism from which the protein was obtained. As you scroll through the results, you will notice that there are upwards of 2,000 pages of entries for this particular protein. Most of the entries should say “378 aa” or “379 aa” This means that the entry is for a protein with 379 amino acids. Others are close, like 380 or 377. These are close enough in size to also likely be full and complete.
As you click through the result pages, you may notice there are proteins that are fragments and are much smaller in size than the complete proteins. Later, you want to be careful NOT to select such fragment records in your investigation. All protein samples you investigate should be the full normal size and listed as CYTB proteins. We need to compare apples to apples.
4. To access each individual organism, you must click on the hyperlink for the organism.
One early organism to come up in this example is Herpestes javanicus. Click on the blue hyperlink, and look at the line that says “SOURCE” to find its common English name: Small Indian Mongoose. Thanks to the source line, you can usually search either for scientific names or for English names.
This organism page has a great deal of information. In addition to the organism’s taxonomy, it provides links to research conducted on the protein, and, important for our investigation, a copy of the protein sequence for the organism is found at the bottom of the page.
5. Time to find our study species protein sequences. If you want to find many related you could search for groups. For instance, in the search bar, enter “cytb cetacea” (cetacea = the mammalian order to which whales belong) and you should get many search results, all for whales.
6. To find just one species, instead type “cytb blue whale” The link for Balaenoptera musculus, commonly known as the blue whale, should come up. Click on the link once you find it.
7. The information provided is organized in the exact same way as the page you viewed earlier. Our interest is in the amino acid sequence information. Although, as you saw earlier, the amino acid sequence is listed at the bottom of the page, this format will be difficult for us to deal with when we conduct our analysis. The easier format can be found by clicking on the “FASTA” link (it may not be in the same place on the page as it is in this screenshot, look around for it).
8. On the FASTA sequence page, select and copy the sequence as shown here.
9. After you select and copy the information, open a Microsoft Word document and paste this information into your Word document.
Notice that the amino acid sequence is given as a bunch of letters. Each letter stands for one amino acid, the same way that “A” stands for an adenine nucleotide in a DNA sequence. Here is the key:
10. SAVE your Word Document and modify your sequences so that they can be interpreted. To do so, you must remove any blank spaces from the headings so that the computer does not mistake letters for amino acids. In other words…..
The sequence given for the blue whale in FASTA format looks like this:
>gi|5835008|ref|NP_007068.1|CYTB_10457 cytochrome b [Balaenoptera musculus]
MTNIRKTHPLMKIINDAFIDLPTPSNISSWWNFGSLLGLCLIVQILTGLFLAMHYTPDTMTAFSSVTHIC
RDVNYGWVIRYLHANGASMFFICLYAHMGRGLYYGSHAFRETWNIGVILLFTVMATAFVGYVLPWGQMSF
WGATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHFILPFIIMALAIVHLIFLHETGSNNPTG
IPSDMDKIPFHPYYTIKDILGALLLILTLLMLTLFAPDLLGDPDNYTPANPLSTPAHIKPEWYFLFAYAI
LRSIPNKLGGVLALLLSILVLALIPMLHTSKQRSMMFRPFSQFLFWVLVADLLTLTWIGGQPVEHPYVIV
GQLASILYFLLILVLMPVTSLIENKLMKW
11. If you tried to enter this into a sequencing program, all the letters would be interpreted as amino acids, and you would not be able to accurately compare species. To avoid this problem, you must modify the information you obtained. Thus, the blue whale should be entered as:
Blue.Whale
MTNIRKTHPLMKIINDAFIDLPTPSNISSWWNFGSLLGLCLIVQILTGLFLAMHYTPDTMTAFSSVTHIC
RDVNYGWVIRYLHANGASMFFICLYAHMGRGLYYGSHAFRETWNIGVILLFTVMATAFVGYVLPWGQMSF
WGATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHFILPFIIMALAIVHLIFLHETGSNNPTG
IPSDMDKIPFHPYYTIKDILGALLLILTLLMLTLFAPDLLGDPDNYTPANPLSTPAHIKPEWYFLFAYAI
LRSIPNKLGGVLALLLSILVLALIPMLHTSKQRSMMFRPFSQFLFWVLVADLLTLTWIGGQPVEHPYVIV
GQLASILYFLLILVLMPVTSLIENKLMKW
The greater than (>) symbol indicates that what comes next is a name. For a name made up of two or more words, you MUST put a period in between each word! For example, if you sequence a Blue Whale, you must modify its name to read as:
Blue.Whale
If you fail to do so, the program will read the w-h-a-l-e (in whale) as amino acids rather than a name!
12. Use the database to find cytochrome b sequences for each species given at the start of this section. Find each organism’s sequence by searching the protein database using the abbreviation of the protein (cytb) and the scientific name or English name of the organsm. It can help to write the name in quotation marks, like so:
Be sure to double check your spelling for each scientific name, and make sure that you select a full copy of the cytochrome b protein (it should be approximately 378 amino acids in length). You may wish to collaborate with another student or pair, dividing the labor and then E-Mailing each other your documents. However, if you do this, be sure you DON’T INCLUDE whichever species they chose for their wild card option! You should only include yours.
After modifying all of your sequences, be sure to save your Word Document!!! You will be using it in later parts of the activity!
Sequence Alignments and Phylogenetic Tree Generation
13. A feature of the NCBI webpage is the Constraint Based Multiple Alignment Tool (COBALT). COBALT serves to use the resources of NCBI to compare sequences and put them into a progressive multiple alignment. From there, a simple phylogenic tree can be constructed from this alignment.
http://www.ncbi.nlm.nih.gov/tools/cobalt/
14. Return to your Word document that contains your cytochrome sequences. Select all of your sequences, Copy them and Paste them into the box where it indicates you should paste sequences.
15. Click the blue “Align” button. The alignment itself can take several moments to run, depending on how many scientists are currently on the website. Once it has completed, it will show you the analysis that it ran between all species. The information provided will include information on things like number of amino acids in the protein and information about the similarities and differences that occur in the sequences. You can get the program to construct a phylogenetic tree by clicking on the “Phylogenetic Tree” hyperlink in the top left hand corner of the page.
A phylogenetic tree shows the evolutionary relationship among various populations. These trees also take evolutionary time into consideration through the distances between sequences (or the “leaves” of the tree). Unchecking the “Show Distance” box will construct a phylogram that ignores time, choosing “Slanted” slightly changes the format of the phylogram.
Results: Take a screencapture (PrntScn button) of your phylogenetic tree, paste it into a Word document, LABEL the image “Part 1 – Protein” and E-Mail the document to yourself to ensure you have it tomorrow.
Part 2: Phylogenetic Tree from DNA Sequence
What if we analyze the DNA sequence for a protein, instead of the amino acid sequence? Would you expect to get results that are identical, similar, or dissimilar?
Building a phylogenetic tree from DNA is simple now that you know how to do it from amino acid sequences. The only difference is that instead of choosing “Protein” in the search bar at the top of the page, choose “Nucleotide.”
1. Let’s stick with cytochrome b for now, but pick FOUR species to study out of the group that we used in part 1. (Don’t choose the blue whale or sea lion; as of this writing, they do not have this gene sequenced and uploaded.) Search for their cytochrome b NUCLEOTIDE sequences. However, you need to add some terms to your search. Say you want to search for the Eurasian wolf. You should enter:
cytb “canis lupus lupus” NOT partial NOT complete genome
The “not partial” part helps ensure that you only get results that give you the whole gene. “Not complete genome” makes sure you get results that are only for that gene, not for whole chromosomes or genomes! (Your results will all say “mitochondrial” in this case because the cytb gene happens to be mitochondrial DNA, not nuclear DNA. This is not problematic.)
Notice that your results for a whole gene are each labeled about “1,140 bp.” That means the sequence is 1,140 base pairs long. Any results much longer or shorter are not good for comparison. For instance, the hippopotamus returns two search results, one that is 16,407 bp and labeled “complete mitochondrial sequence” and the other that is 1,140 bp and labeled “gene.” Clearly, one of these is useful for this study and the other is not. The African Elephant also has a number of unhelpful search results; read carefully!
2. Click the FASTA link, just like before, and copy what you find there, rewriting the species name just as in part 1, i.e. >Humpback.Whale
3. To make a phylogenetic tree, you will not be able to use COBALT. Instead, we will use a similar alignment tool called Clustal. Navigate to:
Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/)