Bioinformatics, Part 2

Bioinformatics, Part 2

Adapted from a paper (http://www.lifescied.org/cgi/content/full/4/3/207; http://www.nslc.wustl.edu/elgin/genomics/Bio3055/manual.pdf) by April Bednarski and Himadri Pakrasi that was funded by a grant from the Howard Hughes Medical Institute of Washington University.

Glossary

Genome – The entire amount of genetic information for an organism. The human genome is the set of 46 chromosomes.

Homologous – With regard to amino acids, homologous amino acids have similar chemical properties and sizes. For example, glutamate can be considered homologous to aspartate because both residues have similar sizes and both residues contain a carboxylic acid side chain.

Sequence alignment – a sequence alignment is a way of arranging the sequences present in DNA, RNA, or proteins so as to identify regions that are similar.

Multiple sequence alignment – a sequence alignment of three or more biological sequences.

Conserved – the amino acid residues at a position in a multiple sequence alignment are identical throughout the alignment.

Conservative residue change – the amino acid residues at a position in a multiple sequence alignment are homologous.

ClustalW – A program for making multiple sequence alignments. www.ebi.ac.uk/clustalw/index.html

ExPASy – Expert Protein Analysis System - us.expasy.org/ A server maintained by the Swiss Institute of Bioinformatics. Home of SWISS-PROT, the most extensive and annotated protein database. The Swiss-Pdb Viewer protein-viewing program is also available at this site for free download.

FASTA – Fast Alignment Search Tool-All (since it works on both nucleotide and amino acid sequences). Associated with this software is a way of formatting a nucleic acid or protein sequence. It is important because many bioinformatics programs require that the sequence be in FASTA format. The FASTA format has a title line for each sequence that begins with a “>” followed by any needed text to name the sequence. The end of the title line is signified by a paragraph mark (hit the return key). Bioinformatics programs will know that the title line isn’t part of the sequence if you have it formatted correctly. The sequence itself does NOT have any returns, spaces, or formatting of any kind. The sequence is given in one-letter code. An example of a protein in correct FASTA format is shown below:

>K-Ras protein Homo sapiens

MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI

LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP

MVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK

HKEKMSKDGKKKKKKSKTKCVIM

Sequence Manipulation Suite – bioinformatics.org/sms/ a website that contains a collection of web-based programs for analyzing and formatting DNA and protein sequences.

Procedure

NCBI – Gene

1. Go (again) to the NCBI homepage: http://www.ncbi.nlm.nih.gov

2. Search in the “Gene” database for “Homo sapiens PTGS2”. Click on the “PTGS2” entry. The section NCBI Reference Sequences (RefSeq) gives RefSeq accession numbers for the mRNA sequence of Homo sapiens prostaglandin G/H synthase 2 precursor. (The number starts with NM_.)

Write one of them here______.

3. Open the RefSeq entry by clicking on that number (first link in the section), then click on “FASTA” (near the top of the page). Copy the nucleotide sequence (including the title line designated by the “>” symbol) and paste it into a text or Word document.

4. Save the file as PTGS2rna.doc (or .txt) on your desktop. Review the entry for “FASTA” in the Glossary: understanding the FASTA format will help in working with the bioinformatics programs.

5. The amino acid sequence is conveniently obtained by first clicking on the “RefSeq Protein Product” link, which is in the second column of the page, then selecting the FASTA format again. Follow the steps given above to save the amino acid sequence in FASTA format as a document called PTGS2prot.doc.

Swiss-Prot Entry

1. Go to the Expasy website (http://us.expasy.org/). Under Databases select “UniProtKB” (a protein knowledgebase). At the top of the page, click “Fields >” (to the right of the search box). For the first field, select “Protein Name”, and enter, for the “Term”, Phospholipase C gamma 1. Click “Add & Search”, then click “Fields” again, and for the field, “Organisms”, use the term “Homo sapiens”. Click “Add & Search”, again. Select the one entry that has been reviewed (the gold star).

2. What is the “accession number” of this protein?

3. Click on the accession number. Write at least three alternate names for this protein.

4. In which two areas of the cell is this protein found? (Under “cellular component”)

5. What is its “cofactor” (needed for the enzyme to function)?

6. What is the PLC gamma1 amino acid length and molar mass in daltons of isoform 1 (under “Sequences”)?

7. Return to the home page of the ExPASy Proteomics Server; select the SWISS-2DPAGE database. Enter the accession number in the search box. Has anyone reported 2-D gel electrophoresis data?

Sequence Manipulation

1. Go to the Sequence Manipulation Suite (http://bioinformatics.org/sms/).

2. Under from the menu entry, “DNA Analysis”, click on “Translate”.

3. Clear the data entry box by clicking on “Clear”.

4. Copy the mRNA sequence in FASTA format from your file (PTGS2rna.doc) and paste it into the data entry box on the Sequence Manipulation website.

5. Select “Reading Frame 3” and “direct” from the pull-down menus, then click “Submit”.

6. When the Output window opens with your results, copy and paste the sequence into a Word document and save it as, “translate.doc” on your desktop.

7. Compare this sequence in the “translate.doc” file with the sequence in the “PTGS2prot.doc”.

What are the first residues that are the same in the sequences?

Do the sequences look like they are the same? (Note: protein sequences should start with a methionine, M.)

Multiple Sequence Alignment with ClustalW

1. Go to the ClustalW2 website, http://www.ebi.ac.uk/Tools/clustalw2/index.html.

2. The following are 6 FASTA formatted sequences of PTGS2 from different organisms. Copy and paste all of the FASTA formatted sequences into the data entry box.

>dog [Canis familiaris]

MLARALVLCAALAVVRAANPCCSHPCQNQGICMSTGFDQYKCDCTRTGFYGENCS

TPEFLTRIKLYLKPT

PNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHLIESPPTYNVNYGYKSW

EAFSNLSYYTRALPP

VPDDCPTPMGVKGKKELPDSKEIVEKFLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDHKRGPAFTKGL

GHGVDLNHVYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHV

PEHLQFAVGQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTLQIDDQEYNFQQFIYNNSILLEHGL

TQFVESFSRQIAGRV

AGGRNVPAAVQQVAKASIDQSRQMKYQSLNEYRKRFRLKPYTSFEELTGEKEMAA

GLEALYGDIDAMELY

PALLVEKPRPDAIFGETMVEMGAPFSLKGLMGNPICSPDYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPFTAFSVQDGQLTKTVTINASSSHSGLDDINPTVLLKERSTEL

>cow [Bos taurus]

MLARALLLCAAVALSGAANPCCSHPCQNRGVCMSVGFDQYKCDCTRTGFYGENCT

TPEFLTRIKLLLKPT

PNTVHYILTHFKGVWNIVNKISFLRNMIMRYVLTSRSHLIESPPTYNVHYSYKSW

EAFSNLSYYTRALPP

VPDDCPTPMGVKGRKELPDSKEVVKKVLLRRKFIPDPQGTNLMFAFFAQHFTHQF

FKTDFERGPAFTKGK

NHGVDLSHIYGESLERQHKLRLFKDGKMKYQMINGEMYPPTVKDTQVEMIYPPHV

PEHLKFAVGQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDVFQIDGQEYNYQQFIYNNSVLLEHGL

TQFVESFTRQRAGRV

AGGRNLPVAVEKVSKASIDQSREMKYQSFNEYRKRFLVKPYESFEELTGEKEMAA

ELEALYGDIDAMEFY

PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII

NTASIQSLICSNVKG

CPFTSFSVQDTHLTKTVTINASSSHSGLDDINPTVLLKERSTEL

>mouse [Mus musculus]

MLFRAVLLCAALGLSQAANPCCSNPCQNRGECMSTGFDQYKCDCTRTGFYGENCT

TPEFLTRIKLLLKPT

PNTVHYILTHFKGVWNIVNNIPFLRSLIMKYVLTSRSYLIDSPPTYNVHYGYKSW

EAFSNLSYYTRALPP

VADDCPTPMGVKGNKELPDSKEVLEKVLLRREFIPDPQGSNMMFAFFAQHFTHQF

FKTDHKRGPGFTRGL

GHGVDLNHIYGETLDRQHKLRLFKDGKLKYQVIGGEVYPPTVKDTQVEMIYPPHI

PENLQFAVGQEVFGL

VPGLMMYATIWLREHNRVCDILKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIASEFNTLYHWHPLLPDTFNIEDQEYSFKQFLYNNSILLEHGL

TQFVESFTRQIAGRV

AGGRNVPIAVQAVAKASIDQSREMKYQSLNEYRKRFSLKPYTSFEELTGEKEMAA

ELKALYSDIDVMELY

PALLVEKPRPDAIFGETMVELGAPFSLKGLMGNPICSPQYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPFTSFNVQDPQPTKTATINASASHSRLDDINPTVLIKRRSTEL

>Rabbit

MLARALLLCAAVALSHAANPCCSNPCQNRGVCMTMGFDQYKCDCTRTGFYGENCS

TPEFLTRIKLLLKPT

PDTVHYILTHFKGVWNIVNSIPFLRNSIMKYVLTSRSHMIDSPPTYNVHYNYKSW

EAFSNLSYYTRALPP

VADDCPTPMGVKGKKELPDSKDVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDLKRGPAFTKGL

GHGVDLNHIYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHI

PAHLQFAVGQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDEQLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTFQIDDQQYNYQQFLYNNSILLEHGL

TQFVESFTRQIAGRV

AGGRNVPPAVQKVAKASIDQSRQMKYQSLNEYRKRFLLKPYESFEELTGEKEMAA

ELEALYGDIDAVELY

PALLVERPRPDAIFGESMVEMGAPFSLKGLMGNPICSPNYWKPSTFGGEVGFKIV

NTASIQSLICNNVKG

CPFTSFNVPDPQLTKTVTINASASHSRLEDINPTVLLKGRSTEL

>pig [Sus scrofa]

MLARALLLCAAVSLCTAAKPCCSNPCQNRGICMSVGFDHYKCDCTRTGFYGENCT

TPEFLTRIKLFLKPT

PNTVHYILTHFKGVWNIVNNIPFLRNAIMKYVLISRSHLIDSPPTYNMHYGYKSW

EAFSNLSYYTRALPP

VPDDCPTPMGVKGRKELPDSKEVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDQKRGPAFTKGQ

GHGVDLSHVYGESLERQHKLRLFKDGKMKYQIIDGEMYPPTAKDTQVEMIYPPHT

PEHLRFAVGHEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDAFQIDGHEYNYQQFLYNNSILLEHGI

TQFVESFSRQIAGRV

AGGRNLPAAVQKVSKASIDQSREMRYQSFNEYRKRFLLKPYRSFEELTGEKEMAA

ELEALYGDIDAMELY

PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPFTSFSVQDPQLAKTVTINASSSHSGLDDINPTVLLKERSTEL

>coral [Gersemia fruticosa]

MVAKFVVFLGLQLILCSVVCEAVNPCCSFPCESGAVCVEDGDKYTCDCTRTGHYG

VNCEKPNWSTWFKAL

IAPSEETKHFILTHFKWFWWIVNNVPFIRNTVMKAAYFSRTDFVPVPHAYTSYHD

YATMEAHYNRSYFAR

TLPPVPKNCPTPFGVAGKKELPPAEEVANKFLKRGKFKTDHTSTSWLFMFFAQHF

THEFFKTIYHSPAFT

WGNHGVDVSHIYGQDMERQNKLRSFEDGKLKSQTINGEEWPPYLKDVDNVTMQYP

PNTPEDQKFALGHPF

YSMLPGLFMYASIWLREHNRVCTILRKEHPHWVDERLYQTGKLIITGELIKIVIE

DYVNHLANYNLKLTY

NPELVFDHGYDYDNRIHVEFNHMYHWHPFSPDEYNISGSTYSIQDFMYHPEIVVK

HGMSSFVDSMSKGLC

GQMSHHNHGAYTLDVAVEVIKHQRELRMQSFNNYRKHFALEPYKSFEELTGDPKM

SAELQEVYGDVNAVD

LYVGFFLEKGLTTSPFGITMIAFGAPYSLRGLLSNPVSSPTYWKPSTFGGDVGFD

MVKTASLEKLFCQNI

AGECPLVTFTVPDDIARETRKVLEARDEL

For alignment select “Full”; for output format, select “aln w/numbers” so that particular residues (amino acids) in the alignment can be found; for the Output order select “input”. Click the “Run” button located in the lower right.

3. View the output- the SCORES table:

SeqA Name Len(aa) SeqB Name Len(aa) Score

======

1 dog 604 2 cow 604 90

1 dog 604 3 mouse 604 89

Note that different specific combinations are examined; DOG TO COW for example. You would expect a higher SCORE (right column; similarity of the gene sequence) between two mammals than a mouse and the coral. What is the similarity score for the gene found in mouse and coral? ______

View the cladogram at the bottom of the page. (To learn more about cladograms go to en.wikipedia.org/wiki/Cladogram.) Switch to the phylogram view. Which two species are most similar, based on this view? (Or can one even tell?)

Now for the most important part of this ClustalW analysis: an amino acid by amino acid comparison of the same protein from different species. Go a little ways down the web page and find ALIGNMENT. A button labeled 'Show Colors' will be displayed in the Alignment section of results page. If you press this button the alignment will be show in color according to the table below. (This option only works when you have chosen ALN or GCG as the output format).

AVFPMILW / Red / Small: small or hydrophobic; includes aromatic except Tyr
DE / Blue / Acidic
RHK / Magenta / Basic
STYHCNGQ / Green / Hydroxyl + Amine + Basic - Q
Others / Gray

CONSENSUS SYMBOLS: An alignment will display by default the following symbols denoting the degree of conservation observed in each column:

Symbol / Meaning
* / The residues in that column are identical in all sequences in the alignment.
: / Conserved substitutions are present, according to the COLOR table above.
. / Semi-conserved substitutions are present.
(space) / ?

Figure 1. A Venn diagram showing the relationship of the 20 naturally occurring amino acids to some physio-chemical properties. Exarchos et al. BMC Bioinformatics, 2009, 10:113 (Creative Commons Attribution License)

Copy the alignment of amino acids in various species and paste it into a Word document. To make this file readable, do the following things:

a) Go to “Page Set-up” under “File” and change the page orientation to landscape.

b) Select all text and change to “Courier” font, size 10. Courier is the best font for alignments because all the letters are the same width. This is one of the major secrets of working with FASTA sequences.

c) Save and Print this file to the desktop as “ClustalW.doc” (send the file to yourself by email or place on a floppy or flash drive). Place a copy in your lab notebook.

4. Review the alignment. What does the presence of a space under a column in the alignment indicate about the relation of the residues?

5. Find the longest string of conserved (defined in glossary) residues (watch out for strings at the ends of rows). How many residues does it contain?