Structure of RNA, RNA World, role of splicing and alternative splicing history.

Emanuele Buratti, Maurizio Romano and Francisco E. Baralle*

International Centre of Genetic Engineering and Biotechnology (ICGEB), Trieste, Italy.

*address correspondence to: Prof. Francisco E. Baralle, Padriciano 99, 34149 Trieste, Italy, Phone: +39-040-3757337, Fax: +39-040-3757361, E-mail: .

ABSTRACT

RNA molecules were initially considered as the simple middle-man between the information stored in the DNA and the effector functions of proteins. This view, however, did not last very long. In fact, the picture that immediately began to emerge was one consistent with a situation in which RNA molecules play a crucial role in determing the nature of this information. We know today that RNA molecules are fundamental components of the catalytic processes occurring in macromolecular machineries such as the spliceosome and the ribosome. In addition, they have been described to affect the regulatory networks in constitutive and alternative splicing processes and in the fine tuning of translational processes. Furthermore, new discoveries stemming from recent global analyses of the human transcriptome have highlighted the presence of a vast array of small and large noncoding RNAs whose function and regulation we still largely ignore. Today, more than ever before in the history of RNA research, there is a strong consensus towards a view where RNA molecules occupy a central position in many cellular processes, certainly many more that could be foreseen or guessed at even just ten years ago. The purpose of this chapter will be to provide an overview of these advances and a brief history of the discovery of the splicing process itself.

Keywords: RNA, RNA structure, RNA World, Alternative SplicingStory.

Structure of RNA.

At its most basic molecular level, RNA molecules can be represented by a linear sequence of four classical bases: adenine and guanine (A/G, both purines), cytosine and uracil (C/U, both pyrimidines). In the RNA molecule, each of these bases (schematically represented in Fig.1) is bound to the 1' position of a ribose sugar that, through its 3' position, utilizes a phosphate group to link with the 5' position of the next ribose. One of the most important features that distinguish RNA from DNA is the presence of a hydroxil group (-OH) in the 2' position of the ribose sugar. The presence of this sugar makes RNA more vulnerable to degradation. In general, RNA molecules in the eukaryotic cell are considered to be in a single-stranded configuration. However, when RNA molecules form double stranded helical filaments, these bases can interact through hydrogen cytosine with guanine, adenine with uracil, and guanine with uracil. It should be noted, however, that these are only the most common combinations and that several "noncanonical" base pairs have been described in RNA molecules [1]. The helical filaments formed by these interactions (which usually occur between nearby regions) represent the first hierarchical level of RNA structure, usually referred to as secondary structure. It should also be noted that the presence of the 2' ribose allows formation of B-form types of helices as opposed to the A-form found for DNA. In a second stage, the helices themselves can interact with each other to form what is known as the tertiary structure [2,3]. The rules that exactly define the final outcome of these folding processes and the various factors that influence them are still the subject of active studies [4]. The reason is that an apparently simple process in vitro is often complicated in vivo by factors difficult to categorize, for example, the role played by proteins in the stabilization of the structure or in chaperoning processes [5].

In any case, the practical consequence of these numerous possibilities for intra- and inter-molecular interactions is that confer to RNA molecules an extraordinary flexibility at the structural level [6,7]. The structural characteristics of RNA molecules have important consequences also at the functional level as they widen and modulate the possible interactions and metabolic process that the molecule can execute[8].

It is this extraordinary flexibility that over the years has profoundly challenged the classical view that considered RNA as a simple "middle-man" between the information stored in the DNA sequence of an organism and the proteins, the effector molecules. It is now unquestioned, in fact, that cellular RNAs in their various forms (described briefly below) also play an ever increasing role in gene regulation, either through their catalytic properties or through post-transcriptional modifications. As it has been nicely put in a recent review by Mendes Soares and Valcarcel, we have only begun to characterize these influences and today's ever increasing complexity of the RNA transcriptome can be compared to Borges's infinite "Book of Sand', where new pages appear everytime the reader decides to open it [9].

Enzymatic RNAs and RNA World

One of the most interesting features of RNA is its ability to act as an enzyme [10]. In the 1980s, in fact, it was observed for the first time by Thomas R. Cech and Sidney Altman that introns in Tetrahymena thermophila and the bacterial P complex were capable of catalyzing their own removal. Since then, many such RNA molecules (called "ribozymes") that have intrinsic enzyme-like activity in the complete absence of protein cofactors have been described and are recently reviewed in Toor et al. [11]. More recently, the catalytic activity of RNA has also been proposed to be responsible for the correct functioning of the spliceosome [12] and of the peptidic bond formation in the ribosome [13].

It is the discovery of the enzymatic activity of RNAs that led Walter Gilbert to the concept of a primitive "RNA World" which could have existed and lived on planet Earth even before the appearance of proteins and DNA [14]. A schematic depiction of the RNA World hypothesis is reported in Fig.2. It is, of course, impossible to prove that anything resembling the RNA World actually existed in the world's history but the fact that such an hypothesis can even be made in the first place demonstrates clearly the flexibility of RNA to play several physiological roles.

Common types of human RNAs

In the last decade a lot of progress has been made with regards to the general field of RNA biology and biochemistry, as recently reviewed by [15]. Several topics, such as the discovery of RNA interference and microRNAs have literally transformed the research field with regards to both our theoretical knowledge and our practical possibilities. Before going to alternative splicing (the main focus of this book), it is worth to provide an overview of the different RNAs that can be found in the cell nucleus. It is now known, in fact, that he human genome encodes for a whole range of structural and mediator RNAs and especially some new families of small RNAs (schematically reported in Fig.3) whose function is the subject of very active research [16,17].

The most abundant species are the ribosomal RNAs (rRNA) [13]. They are the core of the ribosomes, the ribonucleic particles in charge of translating the information encoded in mRNAs into proteins in all cells. The ribosomes in eukaryotes are formed by two subunits; the 60S and the 40S named according to their sedimentation coefficient. These subunits contain the 28S/5S rRNA and 18S rRNA tightly associated to proteins. The transfer RNAs (tRNAs) are the other key players in translation. Each tRNA is associated with an amino acid and recognizes the messenger RNA (mRNA) through a three nucleotide sequence known as codon.

There is an additional class of relatively abundant small RNAs, the small nucleolar RNAs (snoRNAs) [18]. The snoRNAs, as indicated by their name, localize to the nucleolus and are mainly involved in rRNA maturation, although they also play important functions in protein translation, mRNA splicing, and genome stability. There are two classes of snoRNAs (C/D and H/ACA box) that function as ribonucleoprotein (RNP) complexes to guide the enzymatic modification of target RNAs. Generally, C/D box snoRNAs guide the methylation of target RNAs while H/ACA box snoRNAs guide pseudouridulation [19]. It has also been recently discovered that snoRNAs can be additionally processed to yield smaller molecules, called sno-derived RNAs (sdRNAs), that are associated with Ago7 and may thus be associated with gene silencing and transcriptional repression processes [20].

Another important and well characterized type of RNAs are the small nuclear RNAs (snRNA). Based on sequence homology and common protein factors the snRNAs are divided in two classes, the Sm and Lsm classes. The Sm class is made up by U1, U2, U4, U4atac, U5, U7, U11 and U12, whereas the Lsm class is made up of U6 and U6atac. After assembly with snRNP proteins, all resulting snRNPs particles form the core of the spliceosome (major or minor) and catalyse the removal of introns from pre-mRNA. The only exception is represented by U7 snRNP whose functions is in histone pre-mRNA 3' end processing.

An important class of small RNAs discovered in the early 1990' is represented by microRNAs (miRNAs) [21,22]. These are 21 to 23 nucleotides long RNAs that regulate gene expression through imperfect mis-matching to the non-coding 3’ region of messenger RNAs [23,24]. During the last decades hundreds of miRNAs have been cloned and map to the human genome. Functionally connected to miRNAs is the parallel development of short interference RNAs (siRNAs) that are also 21 to 23 nucleotides long RNA molecules, as the miRNAs, and they associate with members of the Argonaute family of proteins. Both miRNAs and siRNAs fall under the umbrella of what is called "RNA interference", or "RNAi". However, as opposed to miRNA, siRNAs are not encoded in the genome and their synthesis is rather triggered by the presence of long double stranded RNAs usually associated with viral infections. Moreover, they recognize perfectly complementary RNAs and induce their cleavage and subsequent degradation.

The last big group of small RNAs, discovered less than five years ago, are the Piwi RNAs (piRNA) recently reviewed by Choudhuri [25]. They are approximately 30 nucleotides long and they associate with the Piwi proteins that constitute a cluster within the Argonaute super-family of proteins. The expression of both piRNAs and Piwi proteins is restricted to the male germ line. So far thousands of these small RNAs have been cloned. Many of them map to unique sites in the genome. However, a big proportion maps to many positions in the genome that correspond to repetitive elements. Although, the mechanisms of action of the piRNA are not clear they are thought to prevent retro-transposition of millions of transposable elements present in the genome [26].

In any case, the tale of small RNAs is still unfolding. For example, the most recent discovery comes from the analysis of the human transcriptome where it has been observed that many promoters, particularly those of highly expressed transcripts, produce relatively short RNAs that are transcribed just downstream from the transcription start site (TSS) [27]. The function of these transcripts is at the moment completely unknown but it has been speculated that they might somehow help to regulate transcriptional activity.

Finally, it is also worth mentioning that alongside these small classes of short noncoding RNAs the mammalian genomes is also known to produce many long non-coding RNAs with a whole range of functions that include the control of mRNA abundance, initiation of translation, transposon jumping, chromosome architecture, stem cell maintenance, organ development, and carcinogenesis [28,29].

Mammalian Messenger RNAs (mRNAs)

Mature mammalian messenger RNAs (mRNAs) can be divided in three regions; the protein coding sequence, the 5’ untranslated sequence (5’UTR) and the 3’ untranslated region (3’UTR). The coding sequence goes from the protein initiation codon (AUG) until the stop codon (UAA, UAG or UGA). Mature mRNAs are obtained by precursors that are normally referred to as pre-mRNAs. Pre-mRNAs are the first product of gene transcription and usually contain the sequences that will form the mature mRNA (exons) separated by other type of sequences (referred as introns). The splicing reaction involves the recognition of exon boundaries by the spliceosomal machinery and the excision of the introns. Splicing can be constitutive (the exon in question always forms part of the mRNA) or alternative (the specific exon can be excised from a proportion of the mRNAs). The advantage of alternative splicing resides in the possibility of obtaining multiple transcripts starting form a single transcript (some of these possibilities are schematically represented in Fig.4).

Several chapters in this book are devoted to the way the spliceosome is assembled and how it operates (both normally and in disease). For this reason the rest of this chapter will be devoted to provide a brief history of alternative splicing, focusing especially on the early years and the scientific breakthroughs that have laid the foundations of today's research.

A brief history of splicing (1977-1994).

The history of splicing begins in the summer of 1977, when different researchers presented the results of their studies at the Cold Spring Harbor symposium: Philip Sharp and his colleagues, S. Berget, A.J. Berk and T. Harrison carried out an electron microscopy study where purified mRNA coding for the major Adenovirus coat protein (hexon) was annealed to a portion of adenoviral DNA genome encompassing the whole hexon gene[30]. Interestingly, the two chains were hybridizing perfectly over shorts segments and not over the entire length. It was apparent that the RNA leader sequence was not complementary to DNA sequences located upstream from the hexon gene. These results indicated that the mature hexon RNA derived by joining four separate segments of viral RNA. In parallel, Richard Roberts with colleagues L. Chow, R. Gelinas, and T. Broker showed using electron microscopy that the 5' terminal sequences of several adenovirus mRNAs, isolated late in infection, were complementary to genomic sequences distant from the DNA from which the main coding sequence of each mRNA is transcribed [31]. As a result of this discovery, both group leaders were awarded the Nobel Prize for Medicine in 1993.

The most appealing theory proposed to explain all these observations was to suggest that a recombination event determined an intramolecular ligation at the RNA level based on the annealing between leader sequences and coding segments with the removal of intervening sequences. Importantly, between 1977 and 1978 several lines of evidences confirmed the presence of introns in animal genes, R.A. Flavell with A.J. Jeffreys in the rabbit beta globin coding sequence [32] and Pierre Chambon and Bert O'Malley demonstrating independently the presence of intervening sequences in the chicken ovalbumin gene [33,34], and. Finally, the group leaded by S. Tonegawa (Nobel Prize in 1987) demonstrated the presence of similar genomic inserts also in a mouse immunoglobulin gene [35]. At that, it was apparently clear that in higher eukaryotes “silent” intervening sequences occurred within protein coding sequences as a general rule, rather than representing the peculiarity of just a few genes. At the beginning of 1978, Walter Gilbert who had developed a method for deciphering the sequence of nucleotide portions of genes, suggested the names of "exons" and "introns" to describe coding and intervening sequences [36].

The step beyond these discoveries was the mapping of possible sequences which might allow to distinguish between these exonic and intronic segments, something that was obtained by two research groups both working on chicken ovalbumin gene who highlighted the recurrence of four or five bases at intron/exon borders [33,37]. These prototype sequences were identified as the 5’ and 3’ splice sites and were summarized as the “GT-AG” rule. Later, the examination of non-consensus splice sites in a variety of gene [38] will allow the discovery of the "AT-AC" minor spliceosome [39,40]. During these same years, a lot of research was focused on elucidating the "mechanism" of splicing and the progress made in these early years has been very well described in a recent reflection article by Christine Guthrie [41].

These first insights on the mechanisms underlying constitutive splicing also suggested the possibility that a single heterogeneous nuclear RNA might be the substrate for generating different mature mRNAs. In 1982, there was the first description of developmental control of alternative splicing when the calcitonin gene was found to produce by alternative splicing the hormone calcitonin and a calcitonin gene-related peptide (CGRP) [42]. During the same year, the first report of a disease connected to alternative splicing was observed in a thalassemic patient carrying a pentanucleotide deletion in intron 1 of the 2-globin gene [43]. In the following years, alternative splicing was shown to be a common mechanism used by different cell-types or developmental stages to produce structurally and functionally different protein isoforms, such as in the human fibronectin gene [44] and the cardiac troponin T gene [45].

Later in 1988, it was shown through electron microscopy that pre-mRNA splicing can occur on the nascent transcripts of early Drosophila embryo genes [46]. Following on this discovery, "intron-definition" and "exon-definition" models were developed particularly in the Berget lab to explain the way that the spliceosome defines these entities [47].