Dabbling in Bioinformatics

Bio/CS – 251 April 12, 2006

Laboratory 10: Hunting for genes in a DNA sequence

Today, we examine web-based tools for finding genes hidden in eukaryotic genomes. We will locate and identify genes in stretches of DNA sequence from Aspergillus nidulans, a common bread mold fungus and the favorite organism studied by Dr. James.

This laboratory will rely on methods and websites that are presented in Chapter 5, pp. 158-172 in Bioinformatics for Dummies.

Objective: Obtain the following DNA sequence, and ORF it! (in other words, find all of the

Open Reading Frames, or ORFs, and determine what proteins they encode)

Go to the GenBank entry tool at http://www.ncbi.nlm.nih.gov/entrez/

In the Search pull down window select Gene and press Go. A new page will appear. Along the left column, point your browser to “Genomic Biology”. Then, along the right column of the new page, under “Genome Resources”, find the line labeled “Aspergillus”. On that line click the button labeled with a G. A new page, “Aspergillus Genome Resources” will appear. Along the right column, open the link to “A. nidulans Database at the Broad Institute”. This will direct you to the A. nidulans website, which is constructed and maintained by the The Broad Institute at Massachusetts Institute of Technology (MIT)

a. The A. nidulans genome sequence is ~30 million base pairs in length, or about 1/100th the size of the human genome, and codes for roughly 8500 genes. It is broken up into manageable-sized chunks called contigs. A contig is one contiguous stretch of DNA assembled from a number of smaller, overlapping sequences. Today you will identify and study all of the genes encoded in a small, 15,000 bp sub-region of Contig #26. To obtain this chunk of DNA sequence, do the following:

b. Check that your browser is pointed to the Aspergillus nidulans database:

http://www.broad.mit.edu/annotation/fungi/aspergillus_nidulans/

c. Point your browser to “Browse Regions”.

d. In the box labeled “Supercontig number”, enter 1.26

In the box labeled “Start”, enter 357000

In the box labeled “Stop”, enter 372000

Click on the hotlink button labeled “DNA Sequence”

Copy/paste this 15 kb sequence here, and convert it to 10 pt courier font:

For the following exercises, follow along in pp. 158-163 in BFD.

e. Use ORF Finder to locate all of the potential Open Reading Frames (ORFs) in this 15 kb

stretch of DNA. ORF Finder will predict ORFs, i.e., long stretches of DNA that could

potentially contain a protein-coding portion of a gene. ORF Finder is a graphical analysis tool

that finds all open reading frames of a selectable minimum size, usually 100 nucleotides.

(1) To access ORF Finder, go to NCBI: http://www.ncbi.nlm.nih.gov

(2) Under HOTSPOTS in the right column, choose ORF Finder:

http://www.ncbi.nlm.nih.gov/gorf/gorf.html

(3) Copy/paste the 15 kb sequence into the ORF Finder box (just the sequence!), and click on the OrfFind button.

Six parallel horizontal bars will appear on the screen. Each will contain a number of blue boxes. Each blue box represents one potential ORF i.e., one potential exon. You will see that a blizzard of potential ORFs can be found in this region. Only a small number of them represent genuine protein-coding regions. Your assignment is to find out which ones belong to real genes! Begin by pasting a copy of the screen below. Also, keep this page open for further investigations.

f. Before you test the ORFs to see which ones are real, answer the following questions:

Q1: What do the six bars represent? Explain why there are six bars, and explain how the three parallel bars at the top differ from the three parallel bars at the bottom.

Q2: Could a single gene be contained in multiple, adjacent or overlapping ORFs? In other words, is the protein-coding region of a gene necessarily contained in a single ORF, or could the protein-coding region be broken up into more than one ORF? Why or why not?

g. Sometimes it can help to determine which potential ORFs are real by comparing your output from ORF Finder with another gene-finding tool, called GeneMark.

Leave your ORF Finder window open, showing the ORF map of the 15 kb region we are studying.

Open a new window, and point your browser to http://opal.biology.gatech.edu/GeneMark/.

Then, choose the link corresponding to “Gene Prediction in Eukaryotes” associated with the rat icon. Click on the link labeled GeneMark-E and GeneMark.hmm-E . If you follow the directions on pp. 162-163, and choose C. elegans as the species most closely related to A. nidulans, then click on the Start GeneMark.hmm button, you will receive a PDF output that shows the position of all probable genes in the 15 kb sequence.

Unfortunately this tool, like ORF Finder, also predicts many more functional ORFs than really exist. Careful examination of the output may help to narrow the field. However, another way to make sense of your ORFs is to return to ORF Finder and use the associated blastp feature to BLAST a subset of the ORFs.

h. Return to the ORF Finder window, and proceed as follows:

Find the FOUR (4) bona fide genes in this 15,000 bp region. Find all of the ORFs (exons) corresponding to these four genes, as follows:

(1) Based on the assumption that the longer the ORF, the more likely it is to represent a bona fide gene, use Blastp to BLAST the largest eight (8) ORFs, for starters. For each of these 8 BLAST searches, do the following:

(a) First, click on the blue ORF that you intend to BLAST. Use an organized approach: Begin at the left, and work to the right. When you click on the desired ORF, the screen will refresh and the highlighted ORF will become purple. Also, the DNA sequence of the ORF, and a corresponding translation, will be displayed. For each ORF that you search, paste the DNA + protein sequences into a MSWord file, and label it clearly for identification purposes.

(b) Second, BLAST the ORF. For each Blastp search, ask for a graphical output and specify 10 descriptions + 10 alignments. Obtain the output, and then paste the output below the sequence + translation from (a) above. Use 10 point Courier font throughout.

During this effort, you will need to use your judgement to assess the quality of the Blastp hits that are produced, and decide if the hits are significant or if they are meaningless. In any event, for the time being save the output of these searches. Clues for making good judgements include the following:

1. e-value: is the e-value <10-15?

2. Does the ORF contain a putative conserved domain? If so, what is it? List it or copy in a description of the conserved functional domain (a conserved domain is a protein region that is the same or very similar in many proteins, because it provides a function that is common to many proteins)

If the answer to these two questions is YES, then you have probably hit a bona fide gene.

Each time you begin work with a new ORF, start a new page in your MSWord document.

i. After you have identified each of the four different genes, go back and BLAST the appropriate smaller ORFs that are adjacent to each identified gene on either side, to learn if the gene is contained on more than one ORF.

j. Completing the assignment:

To complete this assignment to identify the four real genes, you will probably need to

BLAST 17-18 total ORFs from this 15,000 nt sequence.

Please submit the following to complete this assignment:

(1) Sequence + translation of each ORF that belongs to a real gene.

(2) Blastp outputs for each real-gene ORF that includes 10 descriptions + one keystone alignment to an orthologous gene whose function is well-described and well-understood. In other words, don’t necessarily choose an alignment because it has the highest e-value; an alignment to a “hypothetical protein” is uninformative. If your 8th-best alignment is the first one to list a protein with a real name (e.g., cyclic AMP-dependent protein kinase), and this alignment’s e-value is similar to each the 7 better matches, then use this identification for your Aspergillus nidulans ORF(s).

(3) A schematic diagram depicting the order of the four genes and the distances separating each one.

(4) In addition, the schematic diagram must show the relative position and the reading frame of each ORF belonging to a gene. If multiple ORFs (exons) belong to the same gene, this must be clearly described and diagrammed.

(5) Finally, after we’ve made you go through all of this labor, we’re going to teach you another tool that you can use to check the veracity of your work.

-- Go to the A. nidulans database

(http://www.broad.mit.edu/annotation/fungi/aspergillus_nidulans/)

-- As before, point your browser to “Browse Regions”.

- In the box labeled “Supercontig number”, enter 1.26

- In the box labeled “Start”, enter 357000

- In the box labeled “Stop”, enter 372000

-- This time, instead of obtaining the 15kb DNA sequence, obtain the Feature Map for this region.

-- WOW! Print out the Feature Map and answer the following questions:

A. What do the blue boxes represent?

B. What do the green boxes represent?

C. Click on each blue box corresponding to a gene that you identified by using OrfFinder. This will reveal more annotation about each sequence.

Answer the following question: does your repetitive blastp searching agree with the automated gene-finding annotation with regard to gene identity and gene structure.

Be sure to document similarities and differences between your manual efforts and the autocalling software that Feature Map uses.