1
BIOINFORMATICS
BIO 208 Genetics revised s09
Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. The simplest tasks used in bioinformatics concern the creation and maintenance of databases of biological information. Nucleic acid sequences (and the protein sequences derived from them) comprise the majority of such databases. Bioinformatics includes the development of new algorithms and statistics with which to assess relationships among members of these large data bases and the analysis and interpretation of various data including nucleotide and amino acid sequences, protein domains, and protein structures (computational biology).
Computational molecular biology includes:
o finding the genes in the DNA sequences of various organisms
o developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences
o clustering protein sequences into families of related sequences and the development of protein models
o aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships.
NCBI = National Center for Biotechnology Information
Established in 1988 as a national resource for molecular biology information, the NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information.
BLAST= Basic Local Alignment Search Tool
BLAST programs can be used to search both DNA and protein sequences on the NCBI server. The program you will use is BLASTp (p for protein).
GenBank
GenBank has grown from 680,338 base pairs in 1982 to 22 billion base pairs in 2002. Over 30,000 people per day access GenBank online.
OMIM
Online Mendelian Inheritance in Man is a comprehensive compendium of human genes and genetic phenotypes. The full-text, referenced overviews in OMIM contain information on all known Mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype and is updated daily.
PubMed
PubMed is a service of the U.S. National Library of Medicine that includes over 18 million citations from MEDLINE and other life science journals for biomedical articles back to 1948. PubMed includes links to free full text articles and related resources.
PROBLEM CONTEXT (adapted from National Science Foundation supported NYSCATE module field tested at MCCC 2002)
You and your research partners attended a presentation at which you learned of an effective folk remedy used for the prevention of fungal disease in humans. The pasty nature of the remedy is provided by the structural protein, keratin. During the presentation, evidence was presented suggesting that there is a lower incidence of breast cancer and heart disease among those who take the folk remedy.
The folk remedy contains the following:
Water
Salt
Pigeon feather extract
Muskmelon seeds
Southern copperhead snake venom
Your start-up biotech company is interested in the therapeutic effects of these agents and needs to identify the specific protein responsible for the observed anti-cancer and anti-heart disease effects.
After identifying the active protein, your company will isolate the gene encoding the protein. The gene will be engineered and cloned into bacteria. By growing the bacteria in culture, you will be able to purify large quantities of the protein which can be used in FDA regulated clinical trials.
Three amino acid sequences have been isolated from the folklore remedy. You will use the NCBI’s BLAST program to search the protein database to identify the protein from which these amino acid sequences were obtained. You will hypothesize as to which protein might possess the anti-cancer, anti-heart disease function desired.
Objectives:
Ø To fully understand the purpose of the laboratory exercise including the problem context
Ø To determine what NCBI stands for and what type of information it contains
Ø To view some of the organism databases offered in the BLAST assembled genomes
Ø To convert amino acid sequences into FASTA format
Ø To utilize the BLAST searching tool to identify proteins using short amino acid sequences
Ø To evaluate the contents of the folklore remedy with respect to their use in medicine
Ø To identify a journal article relevant to the anti-cancer or anti-heart disease effects of the component(s) in the folk remedy
Ø To explain why the selected protein is likely to have anti-cancer and anti-heart disease activity
I. IDENTITY OF THE UNKNOWN PROTEINS
All sequences for the BLAST alignment programs are entered in FASTA format. View the table on the last page of this handout to see the one letter FASTA code for each of the amino acids.
Question 1. What is the 1 letter code for the following amino acid sequence?
Methionine – Lysine – Leucine- Tyrosine – Serine –Leucine-Leucine-Serine-Leucine-Leucine-Phenylanaline-Leucine-Glycine-Valine-Leucine-Tryptophan-Arginine-Serine-Glutaminc Acid- Glycine- Valine- Alanine- Serine-Serine-Serine-Asparagine-Aspartic acid-Aspartic acid-Valine-Glycine
1 letter code (FASTA format) à ______
The three proteins isolated from the folklore remedy have the following amino acid sequences:
Protein 1: The above sequence from question1
Protein 2: MSCYNPCLPC QPCGPTPLAN
Protein 3: dapanpccda atcklttgsq cadglccdqc
A. Searching the BLAST database
To use BLAST, access the NCBI home page at http://www.ncbi.nlm.nih.gov/
Ø From the top of the main page, click on BLAST
Question 2. What are the scientific AND associated common names of 4 species that BLAST assembled nucleotide genomes can search? For example, Pan troglodytes is the West African chimpanzee
Ø Click on protein BLAST (searches a protein data base with a protein or amino acid query)
Ø Enter the amino acid sequence of protein 1 in FASTA format (see above)
Ø Select the nr database = non-redundant protein sequences. Scroll down and click on BLAST.
View the color key for alignment scores. A score above 50 will be considered relevant in this exercise. You can either click on the bar the represents the highest score, or scroll down to sequences producing significant alignments and view the sequence that produced the strongest hit.
Question 3. The sequence that has the best match has the highest score. What is the probable identity of the protein?
Ø Click on the gb accession number and read the information on this protein
Question 4. What organism does the unknown protein appear to have been isolated from? List both genus and species AND common name.
Question 5. Scroll down to the bottom of the page and examine the amino acid sequence. How many amino acids long is the entire protein?
Question 6. Examine the amino acid sequence carefully and locate your original query sequence within in it. Which amino acids (from what position to what position in the protein) was your query sequence? ______to ______
C. Determination of protein function
Search Wikipedia to find out the function of protein 1 in the organism you identified. Wikipedia is a free online encyclopedia which anyone can edit. It contains accurate information with respect to the bioinformatics exercise.
Re-read the problem context on the second page of the handout. Examine the medical applications and components of the folklore remedy. This will assist you in determining the function of the protein you have identified.
Question 7. What is the function of this protein in the folklore remedy? Based on the problem context, would your company pursue this protein?
D. Identification of Additional Proteins in the Folklore Medicine
Examine the FASTA sequence of protein 2. Use the table on the last page of the handout to determine the amino acid sequence of this protein.
Question 8. The amino acid sequence of protein 2 is (the 3 letter amino acid abbreviations can be used):
Ø Use protein BLAST to determine the identity of this protein.
Question 9.
a. What is the probable identity of protein 2?
b. What is the genus and species of the organism that protein 2 was isolated from (remember to click on the gb accession number)?
c. What is the common name of the organism?
d. Which one of the folklore ingredients did you identify?
e. Reread the problem context that describes the uses of the folk remedy. The protein you identified is has a structural, or binding, function. Explain (Wikipedia can be used):
Ø Use protein BLAST to determine the identity of protein 3
Question 10.
a. What is the probable identity of protein 3?
b.What is the genus and species from organism protein 3 was isolated from (remember to click on a gb accession number)?
c. What is the common name of the organism?
d. Which one of the folklore ingredients did you identify?
e. What is (are) the function(s) of the protein with respect to your interests in the biotech startup company described in the problem context (search Wikipedia, read the information, and click on the see also ADAM protein link)?
Ø Scroll down the Wikipedia page and examine the article written about this protein. List one of the articles that has been published in a journal (Journal names are italicized, textbooks have ISBN numbers).
Question 11.
Identify the author (s), title, journal, and year of publication for the relevant article that you have chosen.
II. SUMMARY
Provide an analysis of your investigation. Of the three proteins analyzed, which do you think is the candidate responsible for reducing a person’s risk of developing breast cancer and heart disease? Which element of the folklore remedy would you purify and pursue as a pharmaceutical? What is the basis for your opinion? What are the roles of the 2 proteins not selected for further study in the folklore remedy?
One and three-letter amino acid symbols
Alanine Ala A
Arginine Arg R
Asparagine Asn N
Aspartic acid Asp D
Asparagine or Aspartic acid Asx B
Cysteine Cys C
Glutamine Gln Q
Glutamic acid Glu E
Glutamine or Glutamic acid Glx Z
Glycine Gly G
Histidine His H
Isoleucine Ile I
Leucine Leu L
Lysine Lys K
Methionine Met M
Phenylalanine Phe F
Proline Pro P
Serine Ser S
Threonine Thr T
Tryptophan Trp W
Tyrosine Tyr Y
Valine Val V