Lab 1. Genome Browsing

Date: 5/12/13

Table of Contents

Exercise 1. UCSC Genome Browser

Exercise 2. Integrated Genome Viewer (IGV)

Exericse 3. Ensembl genome browser

Exercise 4. Viewing custom data

Data that you will need for these assignments can be found at:

Exercise 1. UCSC Genome Browser

(

General information:

Briefly, the genome browser is a concept where mRNA sequences and other information is ‘mapped’ on the genome sequence. Usually, information from one specific source (such as ‘mRNAs from GenBank’ or ‘human-mouse conservation’) is in a separate ‘track’. The trick is how to select the information (the tracks) you are interested in, and not get overwhelmed by the rest.

Task 1.

1.FINDING A GENE IN THE GENOME

a) Click “Genomes” on the blue bar at the top of the screen. This brings you to the Genome Browser Gateway, where you can select between different assemblies for different genomes. Select the human genome assembly from March 2006 (the most recent human assembly). In the box labeled “position or search term”, you can type in the name of a gene, an accession number or a chromosomal region. Some examples are given further down on the web page. For this exercise, we will investigate a gene called ADAM2, so enter that name in the position-box and click “Submit”.

b) You should now see a list of genes (mRNA sequences, really) associated withthe text “ADAM2”. The regions of the genome where these mRNA sequences align are also indicated as chromosome: start-end (the numbers are base positions on the chromosome). The different sections in the list (Known genes, RefSeq genes etc.) correspond to tracks in the Genome Browser; this will become clear soon. Try to find the ADAM2 gene in the list. Does it align in multiple genomic locations? If not, why do you see the same gene several times? Click on one of the hyperlinks for ADAM2.

2.ADJUSTING THE DISPLAY

a) You should now be presented with a stunning view of a chromosomal region. At the absolute top, we see a cartoon image of the chromosome we are looking at. Of course, the gene occupies a very small part of it, so the red marker close to the center of the chromosome shows the location of the ‘window’ we are looking at. Just below the cartoon is the actual window showing some different data sources that map to this region. At the top of the image is a scale that tells you which region of the chromosome you are looking at in actual numbers (genomic coordinates). Below are a number of tracks, showing different features in this particular region (default is ‘STS Markers’, ‘UCSC Known genes’, ‘RefSeq Genes’, ‘mRNAs from Genbank’, ‘ESTs’, ‘conservation tracks’, ‘SNPs’ and ‘Repeat Elements’).

To avoid information overload, you can select which tracks to display from a number of pull-down menus under the image. As you see, there are MANY tracks to choose from, and many of them have different display modes (available options are full, pack, squish, dense and hide) The tracks of primary interest are usually those that display alignments of mRNA and EST sequences to the genome. Make sure that Known Genes, RefSeq Genes, Human mRNAs and Conservation are displayed in ‘full’. Adjust spliced ESTs to be displayed in 'pack' or 'full'. Hide or display other tracks as you like. Note that each track name is a hyperlink that brings up information about how the track was constructed. When you are done, click the 'refresh' button above the pull-down menus to see the new settings in effect. If you are still unhappy with how some track is displayed, you can click on the track name in the image to expand or collapse that track.

b) Above the image are buttons for moving and zooming. Zoom out to get an idea of the genomic context.

3.INTERPRETING THE VIEW

a) Start by looking at the “Human mRNAs” track. Make sure that you have them in full view. Each figure consisting of boxes connected by lines represents the alignment of one mRNA sequence (the accession is given to the left) to the genome. It is important to remember that it is as a spliced mRNA molecule aligned to the genome; it will produce an alignment with large gaps corresponding to exons (boxes) and introns (connecting lines between boxes). The arrows indicate the direction of transcription inferred from the sequences. The “RefSeq Genes” track shows alignments of mRNA sequences from the RefSeq database to the genome. The “UCSC Known Genes” track summarizes the most reliable information from various sources (UniProt, RefSeq and GenBank).

b) Go back to the view of the genomic region. Do the mRNA and EST sequences indicate this gene to be alternatively spliced? Since there are artifacts in sequence databases, you should carefully inspect the evidence for odd splice variants before you believe in them.

c) Go back to the view of the genomic region and turn on the ‘Genscan Genes’ track. Make sure the track is shown in full. How well does the Genscan track agree with the mRNA alignments (you might need to zoom out to make sure the entire predicted gene is displayed)? Why could that be?

4. EVOLUTIONARY CONSERVATION TRACKS

a) Look at the Conservation track. This track shows you the level of conservation between human and a number of other species, based on whole-genome alignments. Note that the Y-axis is not a measure of percentage identity, but likelihood. What parts of the ADAM2 gene seem to be conserved? Are the alternatively spliced exon(s) conserved? Is there conservation upstream of the gene? Use your biology skills to explain.

b) Let's try to find the orthologous mouse gene. The most intuitive way to do it would perhaps be to choose a mouse assembly in the Genome Browser Gateway and enter ADAM2 in the position field, just as we did for human. However, this approach is risky, since orthologs do not always have the same names. In this case, it turns out that the intuitive approach gives you a clue as to where the mouse ortholog is located, but not a reliable answer (try it!). It is better to click on the gene name and look at the description of that gene. If you scroll down the page you find homologs in other species and can click on the mouse homolog.

Here is another approach: Open up a new Genome Browser window and select BLAT from the blue bar at the top. BLAT takes a sequence and aligns it with one of the genome assemblies on the UCSC site. Select the most recent mouse genome assembly. In a separate window, find the sequence of one of the human ADAM2 mRNAs that you have looked at, display it as FASTA and paste it into the large input field on the BLAT page. Set query type to “translated RNA” (Why translated? When would it make sense not to use the translated sequence?) and click Submit. The format of the search results should look familiar. Note that the entire mRNA sequence could not be aligned. Try to explain why not. Find the best alignment and click the 'browser' hyperlink to see that region of the mouse genome. Note that your alignment is displayed as a separate track.

Does it correspond to any mouse mRNAs and/or ESTs? Zoom out! This is just one way to find a potential ortholog. Try to think of a few other ways; you should know some by now.

c) Compare the gene structures (exon-intron structures) of the human and mouse genes. Can you find the same splice variants in the two organisms? Are the genes of approximately equal length? What about the mRNAs?

5. GENE EXPRESSION AND REGULATION TRACKS

Click again on the name and have a look at the description of the gene. There you can find information about the function of the gene (Gene Ontology), domains in the gene and other interesting stuff.

Now look at the microarray expression data where you can find data from several different tissues and experiments. For now, look at the Normal Human Tissue arrays. In which human tissues is this gene mainly transcribed?

If you are interested in the medical relevance of this gene, click on the quick link to OMIM (Online Mendelian Inheritance in Man), which is the main disease gene database that is freely available.

6. LOADING CUSTOM TRACKS

We have provided you with a bed files containing peaks from a histone-3-lysin-4-trimetylation (H3K4me3) chip-seq experiment in a mouse myoblast cell line. C2C12_myoblast_H3K4me3.bed can be downloaded at the assignment website.

To view this data in the context of all other information available at UCSC, go to genomes, select the mouse assembly mm9, and click “add custom tracks”. Upload the bed file and go to the browser window. If you are interested in the methylation state in the promoter region of a specific gene, you may type the name of the gene in the “gene” window. Or you may search a specific region by writing the location in the “position” window.

Search for the gene SSbp1.

To view other information on regulation, go to the section “Expression and regulation” and select the tracks that you think might be relevant. A suggestion is to choose some datasets with transcription factor binding sites (TFBS) and histone modifications.

Are there any chip-seq peaks from our experiment surrounding that gene? Have any H3K4me3 peaks been detected in other experiments? What type of tissues/cell lines? Are there any other types of histone modifications reported in the same region?

Zoom out 10x to see the neighboring genes. Do they also have H3K4me3 peaks?

(Optional) Prepare PCR primers for the H3K4me3 peak

Get the DNA sequence for the HeK4me3 of SSbp1. Use Primer3 to identify potential primer sequences, by pasting in the DNA sequence. Take the primers and PCR fragment and map to the genome using BLAT to confirm that they are indeed targeting a fragment inside the ChIP-Seq peak.

Exercise 2. Integrated Genome Viewer (IGV)

(

General information:The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.Unlike the UCSC genome browser, this genome browser runs on your own computer which makes loading data sets much faster.

Task 1. Download and Install IGV

Go to and run the Integrative Genomics Viewer.

Task 2. Search and visualize the same gene (ADAM2) as used for UCSC Genome Browser (Exercise 1)

Change zoom-in levels and browse the nearby genes. Notice how interactive and fast the browser is.

Task 3. Access and visualize the human tissue RNA-Seq data

Is GATA4 expressed at higher levels in heart than skeletal muscle?

Why isn’t this comparison of mapped reads quantitative?

Does the heart subject have any SNPs (single nucleotide variations) in GATA4?

Where? Zoom in on the variant.

Beyond IGV: Is this variant a known SNP? (need external data sources)

Exericse 3. Ensembl genome browser

General Information:Ensembl is a joint scientific project between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute. Ensembl aims to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well-known genome browsers for the retrieval of genomic information.

Website with Ensembl tutorials:

Task 1. Download homology data

Go to the mouse Uox (urate oxidase) gene.

Is this gene missing in any primates?

Download Uox homologs (in FastA format) from as many species as possible.

More information on the computational inference of homologs in Ensembl:

Task 2. BioMart data download

Use BioMart to get a list of all human genes on Chromosome 1 and the corresponding mouse homologs.

Exercise 4. Viewing custom data in IGV

General information:Genome browsers are also great for visualizing your own (or someone else) data in the context of all other genome information. In this exercise, you will view next-generation sequencing data from a new enhancer mapping method (called STARR-Seq). Find more information about the method here:

The raw data was downloaded from EBI, European Nucleotide Archive (ENA) ( and mapped to the human hg19 genome using STAR aligner (

The exercise is compare the enhancer mappings generated using STARR-Seq to other existing methods such as DNAse Hypersensitive sites sequencing, FAIRE or ChIP-Seq to various histone marks or co-factors.

Task 1. Browse the data in UCSC Genome Browser or IGV

The short read sequencing data aligned to the human genome is available in a file called starrseq.bam and starrseq.bam.bai at the assignment server.