March 1, 2006 Laboratory 5 Bio/CS–251

A Brief Tour of the Swiss-Prot/TrEML Database

Today we are going to introduce a specialized protein database that according to BFD, page 118, conveys the vision of its founder, Amos Bairoch. The database is still under the supervision of Bairoch and a large team of annotators at EBI in Hinxton and SIB at Geneva. This team is assisted by an automatic translation tool for GenBank submissions of EMBL nucleotide entries called, TrEMBL that uses sequence similarity as a main criterion. These sequences are then visually inspected and manually corrected. Final approval of these submissions is given by Bairoch and they become bona fide SWISS-PROT entries. Thus one of the key Protein information resources rests almost entirely on the shoulders of this one remarkable man. Let’s explore and enjoy the results of his efforts!

We will be engaged with Chapter 4 of Claverie & Notredame’s BFD but, as usual, supply our own protein sequence. Just so our friends at NCBI won’t think that we don’t love them any more, go to the NCBI web page.

a.  In the “Search” pull down menu, select Protein and in the “For” box type “Whale Myoglobin”.

b.  Choose the first entry on the “Entrez Protein” page, PO2185

c.  You are now on the familiar “Brief” Summary page.

Q1: How do we know that the Accession Number that we are using is the

correct Swiss-Prot Accession number?

Q2: What entry tells us that we are indeed dealing with Sperm Whale

myoglobin?

d.  If we click on the NCBI link under DBSOURCE (P02185) we are taken to a page on UniProt, which is an NCBI derivative of SWISS-PROT combined with some other protein databases. This can give us an alternative entrance into SWISS-PROT. Click on the “Niceprot View” link on the top of the page and you are in SWISS-PROT at the reference you are exploring.

e.  Enlarge this window and continue your exploration.

f.  Use the reading from BFD p. 124 to assist you with the next questions.

Q3: When was this sequence first integrated into SWISS-PROT? Does it replace any earlier entries? Has it ever needed to be modified? Are researchers still working on this sequence?

Q4: What is the function of this protein?

Q5: Obtain the list of the other proteins in the SWISS-PROT database that are in the same family as this protein. Assuming that there are about 100 references per screen, how many similar proteins are listed in SWISS-PROT?

Protein visualization is a powerful tool. The data for this tool is generally obtained from X-ray diffraction data. Go to the 3-D Structure Databases section of the report and choose the ExPASy link for the first reference.

Q6: Copy and paste the Header line and the 13 lines following it here.

Also Paste the picture of the Ribbon 3D structure of the protein here.

Your instructor will supply you with the information to include the

graphical display in this report.

A more interesting picture can be obtained by following the RCSB link and choosing the KiNG Display Option. This will allow you to use your mouse to move the 3-D structure around and explore its secondary structure. You can turn certain aspects of the structure by checking and unchecking the different boxes on the left side bar.

Q7: How many different types of secondary structure occur in this protein?

(a-helix, b-sheet, and random coil)?

Q8: How many alpha helices are included in this representation?

Q9: How many random coils are present?

Q10: What is a “het”? Is this “het’ composed of amino acids? What is its

function?

Now, we would like you to apply your learning with myoglobin to study the primary, secondary, and tertiary structure of a different protein, the familiar TATA-Binding Protein (TBP). TBP, as you will recall, is a universally conserved protein in eukaryotes. This protein binds directly to DNA at TATAAA or similar sequences and, along with a host of associated proteins, recruits RNA Polymerase II to the promoters of protein-encoding genes so that they can be transcribed into messenger RNA (mRNA).

g.  Begin with the primary structure of TBP. Go to the NCBI homepage and search

under “Protein” for “TBP Homo sapiens”. Open the first entry on the Entrez

Protein page, and convert to a FASTA format. Paste the FASTA-formatted entry

here:

h. Analyze the primary structure of this protein:

Q11: What is the length of this protein?

Q12: At a glance, this protein sequence has one strikingly unusual feature. What is

this feature?

Q13: Calculate the % frequency for the most abundant amino acid in TBP. (For ease of

reading, it may be useful to convert the lower-case protein sequence to upper-

case letters).

h.  More analysis of the primary structure of TBP: Back-button to the previous NCBI page (GenPept entry for TBP Homo sapiens). Scroll down to CDS under FEATURES, and open the link for the UniProt entry. From here open the link under ENTRY NAME. This should lead you to the UniProt/TrEMBL (SWISS-PROT) page for this protein.

After navigating to the UniProt/TrEMBL page once again via the “Niceprot View”

link, scroll to the bottom of this SWISS-PROT page, and use ProtParam to learn

more about the amino acid composition.

Paste the ProtParam output for amino acid composition and molecular formula

here:

Q14: Does this protein contain all 20 of the possible amino acids? Explain.

Q15: Does your calculation in Q13 above agree with the calculation from

ProtParam? Explain.

j. Analyze the secondary structure of TBP: For this section follow the guidelines and

information in pp. 341-348 in BFD for predicting a-helices, b-strands/b-sheets, and

random coils. Go to the following link: http://bioinf.cs.ucl.ac.uk/psipred/. This is

the gateway to PSIPRED, the Protein Structure Prediction Server at the University

College of London. Open the link titled CLICK HERE TO ACCESS THE

SERVER, enter the TBP sequence, your e-mail address, and submit the job. It will

take approximately 5-15 minutes to receive the output via e-mail, so go ahead to the

next lab exercise.

Q16: Paste the e-mail output from PSIPRED here, and answer the following

questions:

(1) How many a-helices are predicted?

(2) How many b-strands are predicted?

(3) How many random coils are predicted?

Keep in mind that the PSIPRED program only predicts the secondary structure. In fact,

the exact secondary and tertiary structure for human TBP was solved using X-ray

crystallography in 1996 (Nikolov et al. 1996. Crystal structure of a human TATA

box binding protein/TATA element complex. PNAS USA 93: 4862-4867). Let’s

examine this structure, and compare it with the PSIPRED predictions.

k.  Analyze the tertiary structure of human TBP: Return to the UniProtKB/TrEMBL

(SWISS-PROT) page and scroll to Cross-references, then open the link to the Protein

Database (PDB). Under the photo of the TBP molecule, under Display Options,

open the link to the KiNG tool (Kinemages), and wait for the applet to load. Use this

tool as before to answer the following questions:

Drag and drop the image of human TBP here:

Q17: Write out the secondary structure of this protein, beginning with the NH2-

terminal a-helix and ending with the COOH-terminal b-strand. Note that

the N- and C- termini are proximal to one another. For example, you

would write the a- and b- sequence of the protein as follows: a-b-a-b-

…etc., and indicate whether the b-sheets are parallel or antiparallel.

Q18: A bit redundant with Q17, but to reinforce the concept….: How many b-

strands does the protein contain? And, how many b-sheets does it

contain?

Q19: Compare the actual, known secondary structure from PDB with the

PSIPRED prediction. Are they the same? Are they reasonably similar?

Or, do they appear to come from different universes? How does this

comparison help you to judge the accuracy of the PSIPRED tool?

Q20: Back to the PDB tertiary structure: TBP interacts specifically with the

minor groove of the DNA at two locations. Does it do so via a-helices, b-

sheets, or random coils?

Q21: What structural motif(s) in the protein is likely to interact non-specifically

with the phosphate backbone of the DNA? Explain briefly.

Q22: Is the TATA-Binding Protein associated with any human diseases?

Specifically, does the glutamine-rich repeat sequence play a role in

diseases associated with TBP? How will you/did you locate information

on this topic?