Goldminer Tutorial

There are two approaches to gene annotation. The first is by matching against known and annotated genes in existing databases, with representatives such as BLAST (Altschul et al. 1990; Altschul et al. 1997), FASTA (Pearson and Lipman 1988; Pearson and Miller 1992), or pFAM (Bateman et al. 1999; Bateman et al. 2004). The second, best represented by GENSCAN (Burge and Karlin 1997) and GLIMMER (Salzberg et al. 1998), is by matching with known gene structures and involves two categories of methods: the neural network algorithms and the hidden Markov model. In this laboratory, we will gain advanced computational skills in the first approach. Goldminer is a Windows-based program that automates local and remote BLASTing and pFAM searches to annotate sequences such as expressed sequence tags (ESTs).

Gene annotation in large-scale gene expression studies is similar to genome annotation in that both involve a large number of sequence fragments to be annotated and that both would search against databases of known genes. The main difference between the two is that genome annotation would also involve annotating non-transcribed sequences whereas gene annotation in gene expression studies involves annotation of only transcribed sequences.

Suppose that you study gene expression of goldfish brains and have already accumulated a large number of ESTs (expressed sequence tags). How are we going to know what gene products these ESTs code for? If there is a well-annotated goldfish genome, then we can just BLAST the ESTs against the goldfish genome. Unfortunately, no goldfish genome is available.

Fortunately, there is a well annotated zebrafish genome. Because goldfish and zebrafish are phylogenetically related, we can BLAST the goldfish ESTs against the zebrafish genome. This can be done fairly quickly because we can have the zebrafish genome installed locally. For those goldfish ESTs that do not have matches in zebrafish genome, then we can BLAST them against the databases in GenBank maintained by NCBI. Remote BLASTing is quite slow, so one should do remote BLAST only with a subset of sequences that do not find matches by local BLAST.

Unless you happen to know the function of the matched gene returned from a BLAST query, BLASTing itself presents little information on gene function. Yet the function of the gene is crucially important in interpreting the altered gene expression patterns. For this reason, one essential component of gene annotation is to classify genes into functional categories. The pFAM server is made of a database of multiple alignments of protein domains or conserved protein regions and the associated searching tools. The aligned protein domains are associated with the function of the proteins. The profile hidden Markov models (profile HMMs) built from the pFAM alignments are used to assign a new protein to an existing protein family.

Objectives:

1. Develop a clear conceptual framework involved in gene annotation by advanced BLAST and pFAM searches by using Goldminer (the gold part comes from goldfish as you might have guessed).

2. Gain hands-on experience in annotating a subset of goldfish ESTs.

Procedures:

1. Install Goldminer from http://aix1.uottawa.ca/~xxia/software/goldminer.htm. Unless your computer is extremely old, all you need to do is just clicking the Goldminer.msi file and then click the Open button, following by a few more clicks on the Next button in response to dialog boxes. The program is large because I packed a zebrafish CDS database with it. The default installation directory is C:\Program Files\Goldminer. Under this directory, three subdirectories are created during the installation process:

a. Plate directory which contains a single sample file: CaNCBI.FAS with 49 sequences from 49 goldfish mRNAs. You may put your own sequences into this same directory.

b. BLASTDB directory which contains the sample zebrafish CDS BLAST files for you to practice local BLAST with the CaNCBI.FAS file.

c. ESTDB directory which contains files with annotated sequences.

2. Use Goldminer to do annotation.

a. Open EST sequence file

i. Click Start|All Programs|Goldminer to start the program

ii. Click ‘Tools|Options’ to set the program defaults (You do not need to do this if you use the Goldminer default). The “EST plate directory” is where you should store your unannotated sequence files. The default is GoldminerDir\Plate (where GoldminerDir is the Goldminer installation directory, being “C:\Program Files\Goldminer” by default). The “EST database directory” stores files containing annotated or partially annotated sequences, and the default is GoldminerDir\ESTDB. The “BLAST program directory” is where the BLAST programs are located and you are advised to leave it as the default. The “BLAST database directory” is where you have stored your personal local BLAST databases, and the default is GoldminerDir\BLASTDB.

iii. Click ‘File|Open plate files’ to read in the CaNCBI file. Goldminer can recognize many different sequence formats, but FASTA format is the most frequently used sequence format in gene expression studies. Hence the default of FASTA format. At this point, we do not know what genes these sequences are, and most columns are blank.

b. Local BLAST. You are advised to always do local BLAST first so that only a small fraction of the sequences will then be searched against remote BLAST databases in NCBI. This reduces the chance of overloading the NCBI BLAST server.

i. Click ‘BLAST|ReBlast against genomic DB’.

ii. A dialog appears. In the ‘EST option’, click ‘ReBlast All ESTs’. In the bottom frame, leave the default unchanged, i.e., ‘Blast against local database’.

iii. Specify the local database by clicking the “Browse” button. If you keep the default, you will see the zebrafish.rna file. Double-click it to set.

iv. Set other BLAST parameters if necessary. Leave as default if you do not know what they mean.

v. Click the ‘Done’ button to start local BLASTing. Once the BLASTing is finished (it may take quite a while depending on your computer speed), you will see the output with some ESTs annotated with goldfish genes.

vi. Many of the ESTs have now been annotated against zebrafish genes, with highly significant e-values. You may note that sequence A2 has no match.

vii. A few hidden functions

· Now right click anywhere in the ‘Matched gene’ column and click ‘Find’. In the dialog, enter ‘casein kinase 1’ (without quotes) and click OK. You will find three genes (D5-D7) highlighted in red (you may have to scroll down to see them). If the sequences are from your own cloning experiment, this would mean that the transcript of the casein kinase 1 gene has been cloned multiple times, and they all match the same zebrafish casein kinase 1 gene (NM_152951.1). This provides useful information in two ways. First, the casein kinase 1 gene in goldfish must be highly expressed in the brain tissue. Second, if you are study gene expression by spotted cDNA microarray, then there is no need to spot these replicate clones of the casein kinase 1 transcripts into multiple sets of probe cells. Only one set of probe cells is sufficient.

· Now right click the ‘GeneID’ entry for sequence D5 (i.e., NM_152951.1) and click ‘GenBank Sequence’. The annotated zebrafish casein kinase 1 gene is displayed for you to obtain further information about the gene.

· You may also right click the ‘GeneID’ entry for sequence D5 (i.e., NM_152951.1) and then click ‘Show HSP” (HSP stands for high-scoring sequence pair) to see the details of the matched segments.

viii. Click ‘File|Save’ to save the sequences.

c. Remote BLAST against NCBI database: You may have noted that several sequences (e.g., A2) do not have a matched entry or have only a poorly matched entry in the zebrafish BLAST database. Naturally you would wish to know if you can find better matches in other databases. It is often impractical to store all databases locally because of the sheer amount of disk space need and because it is very difficult to keep updating these terabyte-size databases. So we will take advantage of the regularly updated databases maintained at NCBI.

i. Click ‘BLAST|ReBlast against genomic DB’.

ii. A dialog appears. In the ‘EST option’, set the option to ReBlast ESTs with e-value greater than 0.01 (or smaller). In the bottom frame, choose the option to ‘Blast against NCBI databases’. What is an e-value? What does the default e-value of 0.01 mean?

iii. Specify the NCBI database (or just leave the default of ‘nr’ which stands for non-redundant) and set other BLAST parameters if necessary (or just use the default value).

iv. Click the ‘BLAST’ button to start BLASTing against the chosen NCBI database. Note that the NCBI BLAST server often needs to handle thousands of queries per hour, and is prone to being flooded. We could be selfish and send all queries to BLAST quickly, but selfishness is incompatible with a civilized society. So Goldminer will send only one query EST at a time and do not send another until the first has been processed. This guarantees that NCBI will never identify us as bad citizens (or the Goldminer programmer an inconsiderate scientist). You can leave Goldminer to do its job and go about other businesses. Because of the slowness, a progress bar is implemented to alleviate your frustration (No progress bar is implemented for the local BLAST which is fairly fast).

v. Once the BLASTing is over, those sequences that do not have matches or have only poor matches may find new matches or stay the same as before. You may note that A2, which does not have a match before after local BLAST, now has a good match. At this point there is still no information on functional classification.

vi. Click ‘File|Save’ to save the sequences.

d. Remote search against pFAM server hosted in Washington University at St-Louis.

i. Click ‘Sequence|Translate into AA’ to translate the nucleotide sequences into amino acid sequences. You will be asked to specify a translation table. All known translation tables have been implemented in Goldminer. For our sequences, the first (Standard) translation should be used. Goldminer will then translate each sequence in three frames and choose the one with the fewest stop codons (which is “*” in amino acid sequences). You will find that some sequences will have “*” in the resulting translation. This can have two causes: (1) the sequence may contain non-coding fragments but are all treated as coding during the translation, and (2) the sequence could be a transcript from the mitochondrial genome which means that the “Standard” translation table is not appropriate for the sequence. I will show you how to deal with these two problems latter.

ii. Click ‘Func.Pred.|pFAM’ and set the parameters. If you do not know them, just use the default.

iii. Click the ‘Submit’ button to start. It may take a long time to finish. So a progress bar is implemented.

iv. Once the search is complete, most sequences would have been functionally annotated. A few of them will find no match, which may be due to wrong translation.

v. Click ‘File|Save’ to save the sequences.

vi. Click ‘Sequence|Restore ESTs’ to restore the original nucleotide sequences.

vii. Click ‘Func.Pred.|pFAM’ and leave the default option of ‘Nucleotide sequences translated into aa sequences in three frames’

viii. Choose either ‘RePFAM ESTs with no match’ or ‘RePFAM ESTs with e-values larger than, say, 0.01’.

ix. Click the ‘Submit’ button to start. Those previously mistranslated sequences should now have matches.

x. There might still be sequences with no matches, and these may be mitochondrial sequences. The annotations added to these sequences after the local and remote BLAST will tell you whether they are mitochondrial sequences.

xi. Click ‘File|Save’ to save your file.

xii. Right-click anything in the ‘pFamID’ column and then click ‘pFam Gene’ will take you to the pFAM seed protein in its function group. For example, right-click the first pFamID, i.e., ‘DEAD’, will take you to the full annotation of the DEAD/DEAH box helicase at pFAM server.

3. There are a few more functions (such as copy, sort, change, find, etc.) in Goldminer. The column width can be user-resized. The last column is for custom annotation. Right-click and click ‘Change’ to add whatever you have in your mind. You can right-click to find other functions.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215:403-410

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402

Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res 27:260-2.

Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR (2004) The Pfam protein families database. Nucleic Acids Res 32:D138-41.

Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic dna. J. Mol. Biol. 268:78-94

Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2444-2448

Pearson WR, Miller W (1992) Dynamic programming algorithms for biological sequence comparison. Methods Enzymol 210:575-601.

Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544-8.