Bioinformatics, Part 2
Adapted from a paper (http://www.lifescied.org/cgi/content/full/4/3/207; http://www.nslc.wustl.edu/elgin/genomics/Bio3055/manual.pdf) by April Bednarski and Himadri Pakrasi that was funded by a grant from the Howard Hughes Medical Institute of Washington University.
Glossary
Genome – The entire amount of genetic information for an organism. The human genome is the set of 46 chromosomes.
Homologous – With regard to amino acids, homologous amino acids have similar chemical properties and sizes. For example, glutamate can be considered homologous to aspartate because both residues have similar sizes and both residues contain a carboxylic acid side chain.
Sequence alignment – a sequence alignment is a way of arranging the sequences present in DNA, RNA, or proteins so as to identify regions that are similar.
Multiple sequence alignment – a sequence alignment of three or more biological sequences.
Conserved – the amino acid residues at a position in a multiple sequence alignment are identical throughout the alignment.
Conservative residue change – the amino acid residues at a position in a multiple sequence alignment are homologous.
ClustalW – A program for making multiple sequence alignments. www.ebi.ac.uk/clustalw/index.html
ExPASy – Expert Protein Analysis System - us.expasy.org/ A server maintained by the Swiss Institute of Bioinformatics. Home of SWISS-PROT, the most extensive and annotated protein database. The Swiss-Pdb Viewer protein-viewing program is also available at this site for free download.
FASTA – Fast Alignment Search Tool-All (since it works on both nucleotide and amino acid sequences). Associated with this software is a way of formatting a nucleic acid or protein sequence. It is important because many bioinformatics programs require that the sequence be in FASTA format. The FASTA format has a title line for each sequence that begins with a “>” followed by any needed text to name the sequence. The end of the title line is signified by a paragraph mark (hit the return key). Bioinformatics programs will know that the title line isn’t part of the sequence if you have it formatted correctly. The sequence itself does NOT have any returns, spaces, or formatting of any kind. The sequence is given in one-letter code. An example of a protein in correct FASTA format is shown below:
>K-Ras protein Homo sapiens
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI
LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP
MVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK
HKEKMSKDGKKKKKKSKTKCVIM
Sequence Manipulation Suite – bioinformatics.org/sms/ a website that contains a collection of web-based programs for analyzing and formatting DNA and protein sequences.
Procedure
NCBI – Gene
1. Go (again) to the NCBI homepage: http://www.ncbi.nlm.nih.gov
2. Search in the “Gene” database for “Homo sapiens PTGS2”. Click on the “PTGS2” entry. The section NCBI Reference Sequences (RefSeq) gives RefSeq accession numbers for the mRNA sequence of Homo sapiens prostaglandin G/H synthase 2 precursor. (The number starts with NM_.)
Write one of them here______.
3. Open the RefSeq entry by clicking on that number (first link in the section), then click on “FASTA” (near the top of the page). Copy the nucleotide sequence (including the title line designated by the “>” symbol) and paste it into a text or Word document.
4. Save the file as PTGS2rna.doc (or .txt) on your desktop. Review the entry for “FASTA” in the Glossary: understanding the FASTA format will help in working with the bioinformatics programs.
5. The amino acid sequence is conveniently obtained by first clicking on the “RefSeq Protein Product” link, which is in the second column of the page, then selecting the FASTA format again. Follow the steps given above to save the amino acid sequence in FASTA format as a document called PTGS2prot.doc.
Swiss-Prot Entry
1. Go to the Expasy website (http://us.expasy.org/). Under Databases select “UniProtKB” (a protein knowledgebase). At the top of the page, click “Fields >” (to the right of the search box). For the first field, select “Protein Name”, and enter, for the “Term”, Phospholipase C gamma 1. Click “Add & Search”, then click “Fields” again, and for the field, “Organisms”, use the term “Homo sapiens”. Click “Add & Search”, again. Select the one entry that has been reviewed (the gold star).
2. What is the “accession number” of this protein?
3. Click on the accession number. Write at least three alternate names for this protein.
4. In which two areas of the cell is this protein found? (Under “cellular component”)
5. What is its “cofactor” (needed for the enzyme to function)?
6. What is the PLC gamma1 amino acid length and molar mass in daltons of isoform 1 (under “Sequences”)?
7. Return to the home page of the ExPASy Proteomics Server; select the SWISS-2DPAGE database. Enter the accession number in the search box. Has anyone reported 2-D gel electrophoresis data?
Sequence Manipulation
1. Go to the Sequence Manipulation Suite (http://bioinformatics.org/sms/).
2. Under from the menu entry, “DNA Analysis”, click on “Translate”.
3. Clear the data entry box by clicking on “Clear”.
4. Copy the mRNA sequence in FASTA format from your file (PTGS2rna.doc) and paste it into the data entry box on the Sequence Manipulation website.
5. Select “Reading Frame 3” and “direct” from the pull-down menus, then click “Submit”.
6. When the Output window opens with your results, copy and paste the sequence into a Word document and save it as, “translate.doc” on your desktop.
7. Compare this sequence in the “translate.doc” file with the sequence in the “PTGS2prot.doc”.
What are the first residues that are the same in the sequences?
Do the sequences look like they are the same? (Note: protein sequences should start with a methionine, M.)
Multiple Sequence Alignment with ClustalW
1. Go to the ClustalW2 website, http://www.ebi.ac.uk/Tools/clustalw2/index.html.
2. The following are 6 FASTA formatted sequences of PTGS2 from different organisms. Copy and paste all of the FASTA formatted sequences into the data entry box.
>dog [Canis familiaris]
MLARALVLCAALAVVRAANPCCSHPCQNQGICMSTGFDQYKCDCTRTGFYGENCS
TPEFLTRIKLYLKPT
PNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHLIESPPTYNVNYGYKSW
EAFSNLSYYTRALPP
VPDDCPTPMGVKGKKELPDSKEIVEKFLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDHKRGPAFTKGL
GHGVDLNHVYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHV
PEHLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTLQIDDQEYNFQQFIYNNSILLEHGL
TQFVESFSRQIAGRV
AGGRNVPAAVQQVAKASIDQSRQMKYQSLNEYRKRFRLKPYTSFEELTGEKEMAA
GLEALYGDIDAMELY
PALLVEKPRPDAIFGETMVEMGAPFSLKGLMGNPICSPDYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTAFSVQDGQLTKTVTINASSSHSGLDDINPTVLLKERSTEL
>cow [Bos taurus]
MLARALLLCAAVALSGAANPCCSHPCQNRGVCMSVGFDQYKCDCTRTGFYGENCT
TPEFLTRIKLLLKPT
PNTVHYILTHFKGVWNIVNKISFLRNMIMRYVLTSRSHLIESPPTYNVHYSYKSW
EAFSNLSYYTRALPP
VPDDCPTPMGVKGRKELPDSKEVVKKVLLRRKFIPDPQGTNLMFAFFAQHFTHQF
FKTDFERGPAFTKGK
NHGVDLSHIYGESLERQHKLRLFKDGKMKYQMINGEMYPPTVKDTQVEMIYPPHV
PEHLKFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDVFQIDGQEYNYQQFIYNNSVLLEHGL
TQFVESFTRQRAGRV
AGGRNLPVAVEKVSKASIDQSREMKYQSFNEYRKRFLVKPYESFEELTGEKEMAA
ELEALYGDIDAMEFY
PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII
NTASIQSLICSNVKG
CPFTSFSVQDTHLTKTVTINASSSHSGLDDINPTVLLKERSTEL
>mouse [Mus musculus]
MLFRAVLLCAALGLSQAANPCCSNPCQNRGECMSTGFDQYKCDCTRTGFYGENCT
TPEFLTRIKLLLKPT
PNTVHYILTHFKGVWNIVNNIPFLRSLIMKYVLTSRSYLIDSPPTYNVHYGYKSW
EAFSNLSYYTRALPP
VADDCPTPMGVKGNKELPDSKEVLEKVLLRREFIPDPQGSNMMFAFFAQHFTHQF
FKTDHKRGPGFTRGL
GHGVDLNHIYGETLDRQHKLRLFKDGKLKYQVIGGEVYPPTVKDTQVEMIYPPHI
PENLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDILKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIASEFNTLYHWHPLLPDTFNIEDQEYSFKQFLYNNSILLEHGL
TQFVESFTRQIAGRV
AGGRNVPIAVQAVAKASIDQSREMKYQSLNEYRKRFSLKPYTSFEELTGEKEMAA
ELKALYSDIDVMELY
PALLVEKPRPDAIFGETMVELGAPFSLKGLMGNPICSPQYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTSFNVQDPQPTKTATINASASHSRLDDINPTVLIKRRSTEL
>Rabbit
MLARALLLCAAVALSHAANPCCSNPCQNRGVCMTMGFDQYKCDCTRTGFYGENCS
TPEFLTRIKLLLKPT
PDTVHYILTHFKGVWNIVNSIPFLRNSIMKYVLTSRSHMIDSPPTYNVHYNYKSW
EAFSNLSYYTRALPP
VADDCPTPMGVKGKKELPDSKDVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDLKRGPAFTKGL
GHGVDLNHIYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHI
PAHLQFAVGQEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDEQLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTFQIDDQQYNYQQFLYNNSILLEHGL
TQFVESFTRQIAGRV
AGGRNVPPAVQKVAKASIDQSRQMKYQSLNEYRKRFLLKPYESFEELTGEKEMAA
ELEALYGDIDAVELY
PALLVERPRPDAIFGESMVEMGAPFSLKGLMGNPICSPNYWKPSTFGGEVGFKIV
NTASIQSLICNNVKG
CPFTSFNVPDPQLTKTVTINASASHSRLEDINPTVLLKGRSTEL
>pig [Sus scrofa]
MLARALLLCAAVSLCTAAKPCCSNPCQNRGICMSVGFDHYKCDCTRTGFYGENCT
TPEFLTRIKLFLKPT
PNTVHYILTHFKGVWNIVNNIPFLRNAIMKYVLISRSHLIDSPPTYNMHYGYKSW
EAFSNLSYYTRALPP
VPDDCPTPMGVKGRKELPDSKEVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF
FKTDQKRGPAFTKGQ
GHGVDLSHVYGESLERQHKLRLFKDGKMKYQIIDGEMYPPTAKDTQVEMIYPPHT
PEHLRFAVGHEVFGL
VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV
QHLSGYHFKLKFDPE
LLFNQQFQYQNRIAAEFNTLYHWHPLLPDAFQIDGHEYNYQQFLYNNSILLEHGI
TQFVESFSRQIAGRV
AGGRNLPAAVQKVSKASIDQSREMRYQSFNEYRKRFLLKPYRSFEELTGEKEMAA
ELEALYGDIDAMELY
PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII
NTASIQSLICNNVKG
CPFTSFSVQDPQLAKTVTINASSSHSGLDDINPTVLLKERSTEL
>coral [Gersemia fruticosa]
MVAKFVVFLGLQLILCSVVCEAVNPCCSFPCESGAVCVEDGDKYTCDCTRTGHYG
VNCEKPNWSTWFKAL
IAPSEETKHFILTHFKWFWWIVNNVPFIRNTVMKAAYFSRTDFVPVPHAYTSYHD
YATMEAHYNRSYFAR
TLPPVPKNCPTPFGVAGKKELPPAEEVANKFLKRGKFKTDHTSTSWLFMFFAQHF
THEFFKTIYHSPAFT
WGNHGVDVSHIYGQDMERQNKLRSFEDGKLKSQTINGEEWPPYLKDVDNVTMQYP
PNTPEDQKFALGHPF
YSMLPGLFMYASIWLREHNRVCTILRKEHPHWVDERLYQTGKLIITGELIKIVIE
DYVNHLANYNLKLTY
NPELVFDHGYDYDNRIHVEFNHMYHWHPFSPDEYNISGSTYSIQDFMYHPEIVVK
HGMSSFVDSMSKGLC
GQMSHHNHGAYTLDVAVEVIKHQRELRMQSFNNYRKHFALEPYKSFEELTGDPKM
SAELQEVYGDVNAVD
LYVGFFLEKGLTTSPFGITMIAFGAPYSLRGLLSNPVSSPTYWKPSTFGGDVGFD
MVKTASLEKLFCQNI
AGECPLVTFTVPDDIARETRKVLEARDEL
For alignment select “Full”; for output format, select “aln w/numbers” so that particular residues (amino acids) in the alignment can be found; for the Output order select “input”. Click the “Run” button located in the lower right.
3. View the output- the SCORES table:
SeqA Name Len(aa) SeqB Name Len(aa) Score
======
1 dog 604 2 cow 604 90
1 dog 604 3 mouse 604 89
Note that different specific combinations are examined; DOG TO COW for example. You would expect a higher SCORE (right column; similarity of the gene sequence) between two mammals than a mouse and the coral. What is the similarity score for the gene found in mouse and coral? ______
View the cladogram at the bottom of the page. (To learn more about cladograms go to en.wikipedia.org/wiki/Cladogram.) Switch to the phylogram view. Which two species are most similar, based on this view? (Or can one even tell?)
Now for the most important part of this ClustalW analysis: an amino acid by amino acid comparison of the same protein from different species. Go a little ways down the web page and find ALIGNMENT. A button labeled 'Show Colors' will be displayed in the Alignment section of results page. If you press this button the alignment will be show in color according to the table below. (This option only works when you have chosen ALN or GCG as the output format).
AVFPMILW / Red / Small: small or hydrophobic; includes aromatic except TyrDE / Blue / Acidic
RHK / Magenta / Basic
STYHCNGQ / Green / Hydroxyl + Amine + Basic - Q
Others / Gray
CONSENSUS SYMBOLS: An alignment will display by default the following symbols denoting the degree of conservation observed in each column:
Symbol / Meaning* / The residues in that column are identical in all sequences in the alignment.
: / Conserved substitutions are present, according to the COLOR table above.
. / Semi-conserved substitutions are present.
(space) / ?
Figure 1. A Venn diagram showing the relationship of the 20 naturally occurring amino acids to some physio-chemical properties. Exarchos et al. BMC Bioinformatics, 2009, 10:113 (Creative Commons Attribution License)
Copy the alignment of amino acids in various species and paste it into a Word document. To make this file readable, do the following things:
a) Go to “Page Set-up” under “File” and change the page orientation to landscape.
b) Select all text and change to “Courier” font, size 10. Courier is the best font for alignments because all the letters are the same width. This is one of the major secrets of working with FASTA sequences.
c) Save and Print this file to the desktop as “ClustalW.doc” (send the file to yourself by email or place on a floppy or flash drive). Place a copy in your lab notebook.
4. Review the alignment. What does the presence of a space under a column in the alignment indicate about the relation of the residues?
5. Find the longest string of conserved (defined in glossary) residues (watch out for strings at the ends of rows). How many residues does it contain?
1