CSH Computational Genomics Course

Nov 2-8, 2005

Workshop VII (?): Comparative Genomics for Gene Regulatory Elements

These exercises will continue to examine the ENm008 region from the last workshop, but now we’ll look at conservation of transcription factor binding sites and regulatory potential. Of course, you can look at any locus you like, but you should stick with the human genome, May 2004 (hg17) assembly.

1. Conserved transcription factor binding sites (cTFBSs)

Ivan Ovcharenko designed zPicture and Mulan to export their alignments to the rVista and MultiTF servers to obtain information on TFBSs and whether or not they are conserved.

If your RID from the last workshop is still “active” use it to return to your multiple alignments in Mulan. If not, trym11040822693626, or repeat the alignments.

From the summary page (which includes the dynamic visualization and the dot-plots), send the alignments to multiTF. Stick with thedefault settins on the page “Defining transcription factor binding sites.”

Choose the “transcription factors” GATA1, NFE2, and TAL1BETAE4. There are multiple entries related to GATA1 and TAL1; these are other weight matrices for the binding site. Feel free to choose others - follow your own interests. Click on submit. (My RID for this exercise is mlr11042005210219578, you may want to use it if you run into problems.)

View the results as “dynamic visualization.” Where do you see groups of cTFBSs? Compare these to the pattern of “all sites”, i.e. including the sites found in only a single species. Which is more selective?

You may wish to explore other features such as the ability to search for any user-defined consensus sequence (on the page“Defining transcription factor binding sites”), and a tool for finding clusters of cTFBS. In fact, dcode.org has a server for that genome-wide for human and mouse. The pairwise alignments can be sent from zPicture to an equivalent cTFBS finder, called rVista.

2. Regulatory potential

At the UCSC Genome Browser, bring up the 5x Regulatory Potential track (full mode, max set to 0.2) and TFBS conserved (pack mode) (both are under “Expression and Regulation”), along with the RefSeq genes and Conservation. Zoom in on chr16:100,001-170,000.

Do you see any noncoding regions with high RP and some cTFBSs that look like good candidates for cis-regulatory modules?

Note that this version of cTFBSs has similarities and differences to those obtained with MultiTF. Can you think of factors that would contribute to the differences?

3. Predict some cis-regulatory modules (CRMs)

Use Galaxy to import data from the UCSC Table Browser to predict some CRMs, e.g. in chr16:100,001-170,000.

For 5x Regulatory Potential, use the Table Browser filter to get positions with RP>=0.1. Import it as “data points”, which is UCSC’s “wiggle” format.

At Galaxy, convert the wiggle format to interval format, using the converter located under “Get Genomic Scores”. Then merge them, using the “merge” tool under “Operate on Genomic Intervals.”

Import cTFBS.

Import the exons of RefSeq Genes. After you query for RefSeq genes, the next page gives you the option of limiting the output to exons.

Subtract the exon intervals from the high RP intervals (“Difference” under “Operate on Genomic Intervals”).

For the non-exonic, high RP intervals, find the ones that overlap with a cTFBS (“Overlap” under “Operate on Genomic Intervals”).

These are rather stringentlyselected candidate CRMs.

How many did you find and where are they?

1