Analysis of pre-mRNA secondary structures

and alternative splicing

Michael Hiller

Department of Developmental Biology, StanfordUniversity,StanfordCA94305, USA

Email:

Keywords:

pre-mRNA secondary structure, single-stranded, conserved secondary structure

1. Abstract

While many RNA species like tRNA or miRNA have conserved secondary structures that are essential for their function, the traditional view is that mRNAs are unstructured. However, spliced and unspliced mRNAs can formsecondary structures that influence many processing steps. There is increasing evidence that pre-mRNA secondary structure plays a role in alternative and constitutive splicing. For example, structures can affect binding of proteins or RNAs to splice sites or regulatory motifs. I present methods to analyze secondary structures in the context of alternative splicing using web-based tools. In particular, I focus on findingsingle-stranded regions, checking if a mutation leads to structural changes and finding alternative splice events that overlap conserved structures.

2. Theoretical background

2.1. Alternative splicing and secondary structures

Pre-mRNA secondary structures can influence alternative splicing by several mechanisms.

(i) Splicing factors have domains that bind to single-stranded RNA. Consequently, sequestration of a binding site into a double-strandcan prevent protein binding. This was shown for the mouse fibronectin EDA exon, where a structural change brings a critical splice enhancer into a double-strand, which leads to exon skipping [1]. Bioinformatics analysis found that experimentally verified splicing factor binding sites tend to be single-stranded and that seems to hold for predicted splice enhancers in exons too[2]. Furthermore, some mutations that change alternative splicing patterns might do so by changing secondary structures [2].

(ii) Secondary structures can sequester splice sites, making them inaccessible to the factors that recognize them. This seems to be a frequent and often conserved mechanism [3, 4].

(iii) The splicing of mutually exclusive exons can be controlled by long-range secondary structures. For example, structures conserved over 300 million years of insect evolution control the splicing of a complex cluster of mutually exclusive exons in the DSCAM gene [5].

(iv) Secondary structures can effectively shorten long distances (e.g. between splice sites) by bringing regions spatially close together. This can have an effect on the splicing efficiency and on alternative splicing [3].

(v) Secondary structures formed by sequences flanking alternative exons might ‘loop out’the exon and contribute to its skipping [6].

2.2. Computational prediction of secondary structures

Currently there are no experimental methods to observe pre-mRNA folding in real conditions, i.e. with ongoing transcription, splicing and other processing steps. Therefore, secondary structure analysis often relies on computational predictions. The fundamental principle behind secondary structure prediction tools like mfold [7] or RNAfold[8] is to find the structure that minimizes the free energy (the most stable structure).

In general, one should consider not only the optimal but also suboptimal structures or directly compute the probability of a structure in the thermodynamic ensemble of all structures.This probability of a structuresis given by

where the term exp(-E(s)/RT) represents the statistical weight of the structure s, E(s) is the free energy of s, R is the gas constant, T is the temperature and Q is the Boltzmann partition function (the summed statistical weight of all possible structures). The partition function is also used to compute the probability of individual base-pairs.

To measure the single-strandedness of a particular region,one can compute the probabilitythat all nucleotides (nt) in this region are unpaired. This probability (denoted PU value) is given by

where Eall is the free energy of the ensemble of all structures andEunpaired is the free energy of all structures that are completely single-stranded in this region[9]. The PU value is between 0 and 1 with higher values indicating a higher single-strandedness.

Conservation of a secondary structure is a good indication that the structure has a functional role. Two widely used methods to find conserved structures in multiple sequence alignments are EvoFold [10] and RNAz [11]. Both approaches search for stable structures with compensatory mutations that preserve base-pairs.

3. Protocol

3.1 Input sequences

Since (alternative) splicing happens at the pre-mRNA level, unspliced transcript sequences (which should be identical to the genome sequence) must be analyzed.An important consideration is the length of the sequences to analyze. Since splicing happens during transcription it is unrealistic to analyze entire pre-mRNAs.In general, short-range structures are more likely to occur in the cell than long-range structures[2], therefore the sequence should be limited to the region of interest and the length should not exceed several hundred nt. However, long-range secondary structures do occur [3, 5]. Since the accuracy of prediction algorithms drops for long sequences, one should require further evidence for the existence of a long-range structure such as structural conservation [5].

3.2 Predicting single-stranded regions

Double-strands can inhibit the interaction of splicing factors with regulatory motifs. Thus, single-stranded splicing motifs can have a stronger effect on the splicing pattern [2].With limited resources, secondary structure prediction can help to prioritize further motif analysis by identifying single-stranded regions first and then regulatory motifs within them.

3.2.1 Computing base-pair probability plots

To compute the probabilities for all possible base-pairs, go to the RNAfold web server [8] at and input your sequence into the text box. T characters are automatically converted to U, but there is no warning if the sequence contains characters other than A,C,G,T, or U. Select the radio button “minimum free energy (MFE) and partition function”.

For sequences from species other than mammals, the energy parameters might have to be rescaled to a temperature different from 37°C (“rescale energy parameters to given temperature (C)”in “Show advanced options”). After “proceed”the entire sequence will be folded. To get the base-pair probability plot, click on the “pdf”link in the line “You may look at the dot plot containing the base-pair probabilities”. An explanation of this plotand how to find single-stranded regions is given in Figure 1.A similar plot can be computed with mfold (

3.2.2 Computing PU values

PU values are an alternative way as they quantify single-strandedness of a region directly. PU values can be computed with the server at

It operates in two different modes. In the global mode, the PU value for a region is computed by folding the entire sequence. In the local mode, the PU value is computed by folding a symmetrical local window around the region where the window size is determined by the region size and the length of the up-/downstream context. The region size has to be fixed (3-9 nt, called “length of motif”) and a PU value is computed using a sliding window procedure. To avoid a strong dependency on a fixed context length, the local mode allows to vary the length of the symmetrical window by setting “minimal context length”(11 to 30 nt) to a value less than “maximal context length”. In this case, the PU value for a region is the average of the PU values computed from the different context lengths. illustrates that procedure.

The output is a line plot that shows the PU value for each starting point of a region (Figure 2B). High values indicate regions that are more likely to be single-stranded. Note that in the local mode no PU value can be computed for the first and last “maximal context length”nt.

3.3 Predicting if a mutation leads to structural changes

Mutagenesis experimentsare essential to find thepre-mRNA regions thatregulate alternative splicing. Observed changes in a splicing pattern are usually interpreted as changes in regulatory motifs, however, such mutations might also lead to structural changes, which alter the structure conformation of unaffected motifs[2]. Thus, structural changes should be considered in the experiment design. For example, if several mutations would destroy a putative motif,one should pick the mutation that leads to the smallest structural change.

3.3.1 Comparing base-pair probability plots

Use the protocol of 3.2.1 to compute two base-pair probabilities plots. A comparison of the two plots shows if and in which regions structural changes might occur. The variation of the dot-size in the plot indicateschanges in the base-pair probabilities.

3.3.2 Computing PU values

Changes in single-strandedness can be computed with the webserver at

This comparison uses the local mode of PU value computation averaging over context lengths 11 to 30 nt, therefore input sequences should have at least 30 nt up- and downstream of the mutated region.Input the wild-type and mutant sequence in the two textboxes. The computation can take a few minutes.

The output consists of 5 aligned bar charts. The first two plots show the splice enhancer/silencer scores from [12]for each starting point of a hexamer in the wild-type and mutant sequence. This shows if the mutation creates or destroys regulatory motifs. Hexamers with ascore >0.8 and <−0.8 are considered to be strong enhancers and silencers, respectively. The next two plots show the PU valuesfor the wild-type and mutant sequence. PU value differences are given in the fifth plot where values 0 correspond to hexamers that are more single-stranded in the wild-typesequence and values 0 correspond to hexamers that are more single-stranded in the mutant sequence. These 5 plots show if hexamers that become more single/double-stranded correspond to strong enhancer or silencer motifs or if the single-strandedness of unchanged splicing motifs is drastically changed.

3.4 Finding alternative splice events that overlap conserved secondary structures

The University of California Santa Cruz (UCSC) Table Browser [13]allows this analysis for human.

Go to Set the clade to “Mammal”, genometo “Human”,assemblyto “Mar. 2006”, group to "Genes and Gene Prediction Tracks" and track to "Alt Events”.

Click on the radio button "genome" in the line "region" for a genome-wide analysis or input a specific region (chromosome:start-end) in the “position”textbox.

To filter for specific splice event types, click on Filter: “create”and set name "does" match "type", where type can be cassetteExon, retainedIntron, altThreePrime,altFivePrime, altPromoter, altFinish or bleedingExon. Click "submit".

Click on intersection: “create”, set group to "Genes and Gene Prediction Tracks", set track to "EvoFold", enable the radio button "All Alt Events records that have any overlap with EvoFold", click "submit".

To get links to the regions where alternative splice events overlap EvoFold predictions, set “output format”to "hyperlinks to Genome Browser". The results can also be downloaded for further analysis in BED or GTF format or displayed as a "custom track" in the Genome Browser (the regions will be labeled "table browser query on knownAlt" in the browser window and are shown as black bars on top).

A complete guide to the UCSC table browser is available at annotations are available for the human ENCODE regions (group "Pilot ENCODE Regions and Genes", track "Vienna RNAz"). An EvoFold track is also available for D. melanogaster assembly “Apr. 2006”or “Apr. 2004”.

4 Example of an experiment

The mouse Src gene has an 18 nt neural cell-specific alternative exon[14]. The sequence of the upstream intron flank and the exon is

AACCAGTATCCTCCCTTGTCCTGGGCCCTGTCTTCGCACCTCAGCCTCTCCTTCTCTCTGCTTCTCTCTCGCTGGCCCTTAGGAGGAAGGTGGATGTCAG

where the underlined parts indicate two regulatory motifs bound by the polypyrimidine tract binding protein (PTB) [14]. In Figure 2, this sequence is used to demonstrate how to find single-stranded regions and how to test if mutations lead to structural changes.

5 Troubleshooting

Problem / Solution
Runtime too long / Try to analyze shorter sequences or split them into overlapping parts. The Vienna RNA package (RNAfold) can also be downloaded for Windows and Linux to run it locally.
Sequences too long / Try to split them in overlapping parts and analyze those separately.
Correctness of sequences / Unspliced RNA sequences should be 100% identical to genomic sequences. This can be checked with Blat for species with assembled genomes. Blat converts U to T.

Acknowledgments

I thank Klaus Huse and Rileen Sinha for helpful comments.Funding was provided bythe German Research Foundation (Hi 1423/2-1) and the Human Frontier Science Program (LT000896/2009-L).

Figures

Figure 1: Base-pair probability plot

Larger dots represent higher probabilities. Probable stems can be identified as diagonals of large dots. Two structures are shown at the bottom in dot-bracket notation. Single-stranded regions can be identified as blocks ‘reflected’at the main diagonal that contain no probable base-pairs.

Figure 2: Example analysis.

(A) Base-pair probability plot (protocol 3.2.1). Larger single-stranded regions were highlighted. The two PTB binding sites (red) [14] are found in these regions. The plot also indicates that the exon and 3’splice site are sequestered in a stem.

(B) PU value plots (protocol 3.2.2 or 3.3.2; using hexamers and context lengths 11 to 30 nt) show that the upstream PTB site (CTCTCT, yellow) is rather single-stranded (top). Mutating this site to CTGGGT (middle) likely leads to structural changes and decreases the single-strandedness of this motif, while a mutation to CTAAAT (bottom) does not change the predicted structure.

Abbreviations

PTB, polypyrimidine tract binding protein

PU, probability of being unpaired

[1]E. Buratti, A.F. Muro, M. Giombi, D. Gherbassi, A. Iaconcig, F.E. Baralle, RNA folding affects the recruitment of SR proteins by mouse and human polypurinic enhancer elements in the fibronectin EDA exon, Mol Cell Biol 24 (2004) 1387-1400.

[2]M. Hiller, Z. Zhang, R. Backofen, S. Stamm, Pre-mRNA secondary structures influence exon recognition, PLoS Genet 3 (2007) e204.

[3]V.A. Raker, A.A. Mironov, M.S. Gelfand, D.D. Pervouchine, Modulation of alternative splicing by long-range RNA structures in Drosophila, Nucleic Acids Res 37 (2009) 4533-4544.

[4]P.J. Shepard, K.J. Hertel, Conserved RNA secondary structures promote alternative splicing, RNA 14 (2008) 1463-1469.

[5]B.R. Graveley, Mutually exclusive splicing of the insect Dscam pre-mRNA directed by competing intronic RNA secondary structures, Cell 123 (2005) 65-73.

[6]Y. Lian, H.R. Garner, Evidence for the regulation of alternative splicing via complementary DNA sequence repeats, Bioinformatics 21 (2005) 1358-1364.

[7]M. Zuker, Mfold web server for nucleic acid folding and hybridization prediction, Nucleic Acids Res 31 (2003) 3406-3415.

[8]I.L. Hofacker, Vienna RNA secondary structure server, Nucleic Acids Res 31 (2003) 3429-3431.

[9]M. Hiller, R. Pudimat, A. Busch, R. Backofen, Using RNA secondary structures to guide sequence motif finding towards single-stranded regions, Nucleic Acids Res 34 (2006) e117.

[10]J.S. Pedersen, G. Bejerano, A. Siepel, K. Rosenbloom, K. Lindblad-Toh, E.S. Lander, J. Kent, W. Miller, D. Haussler, Identification and classification of conserved RNA secondary structures in the human genome, PLoS Comput Biol 2 (2006) e33.

[11]S. Washietl, I.L. Hofacker, P.F. Stadler, Fast and reliable prediction of noncoding RNAs, Proc Natl Acad Sci U S A 102 (2005) 2454-2459.

[12]M.B. Stadler, N. Shomron, G.W. Yeo, A. Schneider, X. Xiao, C.B. Burge, Inference of splicing regulatory activities by sequence neighborhood analysis, PLoS Genet 2 (2006) e191.

[13]D. Karolchik, A.S. Hinrichs, T.S. Furey, K.M. Roskin, C.W. Sugnet, D. Haussler, W.J. Kent, The UCSC Table Browser data retrieval tool, Nucleic Acids Res 32 (2004) D493-496.

[14]R.C. Chan, D.L. Black, The polypyrimidine tract binding protein binds upstream of neural cell-specific c-src exon N1 to repress the splicing of the intron downstream, Mol Cell Biol 17 (1997) 4667-4676.

1