Title: Array analysis

Pierre de la Grange

GenoSplice technology, Centre Hayem, Hôpital Saint-Louis, Paris, France

*Address correspondence to: Pierre de la Grange, GenoSplice technology, Centre Hayem, Hôpital Saint-Louis, 1 avenue Claude Vellefaux, 75010 Paris, France;tel: +33 (0) 157276839; fax: +33 (0) 157276 831; E-mail:

  1. Abstract

Alternative Splicing is the main mechanism allowing to increase the transcriptome diversity by generating multiple RNA isoforms from a single gene. This mechanism concerns more than 90% of human genes and is altered in many diseases.Recently, expression microarrays have been developed that can detect changes in term of exon-level expression and splice site selection. Currently,the biggest challenge for the expression microarrays dedicated to alternative splicing study is the bioinformaticsanalysis of array data, their RT-PCR validation and their subsequent biological interpretation. Despite these problems,microarrays revealed an unexpected number of alternative splicing events, whatever the experimental conditions that were compared together to find these events (pathologies, tissues, hormonal treatment, siRNA against splicing or transcription factor, etc). It is important to underline that arrays dedicated to splicing analysis also provide robust analysis in term of global gene expression (i.e., transcriptional effect) and thus, could replace in the medium term current standard technologies for large-scale gene expression analysis such as the Affymetrix U133 arrays.

  1. Theoretical background

2.1.Microarray: general principles

Microarray principle uses base-pairing (hybridization between nucleic acids) such as Northern. The major difference from Northern is that expression of many genes can be detected simultaneously (from thousands genes to the whole-genome). The majority of arrays used to study alternative splicing, use oligonucleotides that are attached to glass slides. Each oligonucleotide is design to target a specific genomic region and one spot gathers several thousands of the same oligonucleotide. Subsequent signal of the spot depends on the number of hybridized oligonucleotides of this spot (should be proportional to the amount of the corresponding RNA region from sample).

Microarray slides can be produced by ink-jet printing (Agilent, ExonHit, Jivan) or by photolithography (Affymetrix). The major advantage of ink-jet printing is that it can be easily customized, since it does not require the generation of photolithograpical masks. The drawback is the smaller number of spots per array (currently around 250,000). The major advantage of arrays generated by photolithography is their high number of spots and a high reproducibility between arrays (currently around 6,500,000).

2.2.Probe design of splicing microarrays: interest and limitation

Since probe sequence can be design to target a specific genomic region, all the different gene region can be targeted, including exons, parts of exons, introns, and even exon-exon junctions. Information regarding the exon/intron gene structure and the different alternative events (i.e., the annotations) used for a microarray design (i.e., selection of probes of the array) are often based on known annotations from publicly available databases (from EST and cDNA alignments against genomics sequence) and can also includeinformation from predictions (e.g., in silico gene and exon predictions, cross-species conservation).Depending on the array used (see below), two major kinds of probes can be included on the chip: exon probes and exon-exon junction probes.

Exon probes detect expression of a specific exon ore a specific exon part (e.g., part corresponding to a 3’ alternative splice site, intronic region corresponding to an intron retention event). This kind of probes is also useful for studying the global gene expression regulation (i.e., the transcriptional effect). Another advantage of this kind of probes is that events that are not known to be alternative can be detected (e.g., “new” exon cassette). Annotations based on in silico and cross-species conservation can also lead to discovery of “new” genes and exons.

Exon-exon junction probes hybridize half to the end of one exon and half to the beginning of the next exon. Depending on the alternative splicing events belonging between these two exons, exon-exon junction probes can vary (e.g., if exon 3 is known to be a cassette exon, exon-exon junction probes will be designed between exons 2 and 3, between exons 3 and 4, and between exons 2 and 4). This kind of design is particularly efficient to study alternative event regulation but should not be used for studying the global gene expression regulation.Another limitation of this design is that exon-exon probes are designed to target known alternative events only due to array space (however, custom microarrays designed to study expression regulation of few genes only can include all possible exon-exon junctions).

2.3.Available splicing microarrays

Two kinds of microarrays can be used for splicing studies: commercial microarrays and custom microarrays.

Three companies provide commercial microarrays for splicing study. Affymetrix provides the GeneChip® “Exon Array” system that is based on exon probes only but is able to detect expression of more than one million of exons for Human, Mouse and Rat (the 200,000 known exons plus around 800,000 putative exons). Currently, it is the most used microarray for splicing study.

ExonHit Therapeutics provides the“SpliceArray” (for Human, Mouse and Rat) that is based on both exon and exon-exon junction probes. As described above, annotations are based on known alternative events only.

JIVAN provides the “SpliceExpress” that is based on both exon and exon-exon junction probes. Annotations are also based on known alternative events.

Both Affymetrix and Agilent provide custom microarrays on demand. Agilent technology allows flexible custom data development (synthesis of oligonucleotides to be printed on the chip is) and provides a free software named “eArray” that greatly facilitates the design of probes.

Affymetrix is developing new generation microarrays for splicing study (“Junction Array”). This array will include both exon and exon-exon junction probes. No information is available regarding timing of this microarray commercialization.Other companies are certainly developing or will develop new microarray dedicated for splicing.

2.4.The different steps of the microarray data treatment

From the data acquirement by chips scanning and data quantitating, several steps are necessary to obtain relevant results that can be further exploited by biologists (see figure 1). The first one is the normalization of data. This step is necessary to compare intensities from all the chips of given experiment in order that differences found in the analysis step come from biological effect and not from other factors (e.g., date of experiment, technician that made the experiment, technical variation).

Another step in the pre-treatment of data (before the statistical analysis) is the background subtraction. Each spot of the chip will lead to have a signal value. This signal can be separated in two parts: the first one corresponds to the specific signal due to expression of the corresponding targeted genomic region and the second one corresponds to a non-specific signal corresponding to background. Thus, goal of this step is to estimate general background intensity and then to substrate this background to all probe intensity.

Objective of the statistical analysis step is to find relevant differences in term of gene expression between the tested experiment conditions at the gene (transcriptional effect) and exon levels (splicing effect).

After having list of regulated genes and exons, a visual inspection of these results is necessary. It allows to check these results in term of quality (reproducibility, fold-change…) but also to start their biological interpretation (e.g., what kind of alternative event is regulated?).

Subsequent functional analysis can be performed to predict the functional consequences of the predicted regulations (e.g., in which pathways the predicted regulated genes are involved?)

  1. Protocol

A microarray project aiming to study gene expression regulation betweentwo experimental conditionsshould include at least six arrays: for statistical reasons, each condition must be tested in triplicate in order to find biological effects and avoid technical variations. Taking into account this point, a project with 6 Exon Arrays will lead to analyze around 40 millions of data.

3.1.Normalization

Most commonly normalization is based on all genes on the array. The assumption isused that between two conditions the majority of genes do not change in terms of theirexpression level. Microarray intensities should always be looked at using log2 scale. Thisscaling should roughly adjust the variance to be the same for all intensities. Differences oflog2 intensities reflect the log2 ratios (M values) for a comparison. Then, a robust estimation of a “rescaling” factor (e.g., median of differences) has to be performed.There are many normalization methods. Which of the methods is most stable and gives bestresults is dependent on the type of data, the image analysis program, etc. To determine thebest method, it is a good idea to try several methods initially on a few datasets and inspect theresults visually using controls:

- Scale normalization: the simplest way to normalize data is simply to adjust the scale of thedata, e.g., set the median of differences to 0. This does not consider any region or intensitydependent effects;

- Quantile: Similar idea to scale normalization but more drastic, as all of the various quantilesare adjusted and not only the 50% quantile (median). This type of normalization is mostcommonly used for the Affymetrix arrays;

- Other methods are also applicable but are not described here (Lowess, VSN…).

3.2.Background subtraction

Each microarray gathers many control probes that are used to estimate the background intensity. For example, in the case of the Affymetrix Exon Arrays, background is based on the GC content of probes. Affymetrix probe length is always 25 nts. For each GC content (from 1 to 25), there are around 1,000 control probes with the same GC content that are not targeted a transcripted genomic region. Corrected signal intensity of a given probe is obtained by calculating median intensity of GC control probes with the same GC content, and then by subtracting this value to the raw signal intensity of the probe.

3.3.Statistical analysis

ExonHit Therapeutics and Jivan provide their own analysis system for their chips but Affymetrix does not provide any software for their Exon Arrays. Currently,several algorithms/software are available to analyze data from these arrays. Corresponding algorithms are based on several methods:

- The Splicing Index (the logarithm of the ratio of the exon signal to the total signal from the gene: log2[exon/total])[1];

- PAC (Pattern-Based Correlation);

- MIDAS (Microarray Detection of Alternative Splicing);

- ASNOVA [2].

For more information about these methods, see the corresponding “white paper” on the Affymetrix website [3].

Based on these methods, several commercially software/services are available. From them, three of them are most used:

- EASANA (GenoSplice technology):

- Genomics Suite (Partek):

- XRay (Biotique Systems):

Genomics Suite and XRay work as a client software and must be installed on the user’s computer. It seems important to underline that specific knowledge and/or a delay to easily use these software can be necessary to. EASANA is provided as a service: the user has to send its CEL files (obtained after scanning the chips) on the GenoSplice server and the company send back the result files and provides assistance for biological interpretation and/or other personalized services. No splicing or bioinformatic knowledge is necessary. EASANA is based on an algorithm developed within a EURASNET team.

3.4.Visualization of data

Generally, analysis software provides a visualization system allowing to check results. Most of them consist on showing the splicing index curve, for example the Genomics Suite from Partek (figure 1A).The BLIS interface from Biotique Systems (providing XRay) displays signal intensity of probesets in the different experiments (figure 1B).The EASANA visualization module from GenoSplice technology (developed in collaboration with EURASNET teams) displays both the mean intensity between all couple of experiments (i.e., treatment vs. control) and in each couple of experiments to show the reproducibility between experiments. A simple color system allows to explicitly retrieve intensity variation at the gene and exon level (figure 1C). Interestingly, signal intensities are displaying at the probe level (i.e., not at the probeset level).

3.5.Functional analysis of results

Functional analysis can be performed by powerful free software/website. Two examples are DAVID (david.abcc.ncifcrf.gov) [4] and PANTHER ( [5]. For these two tools, the user has to select a list of reference genes (e.g., the human Refseq genes) and to input a list of genes of interest to analyze their function/pathways where they are implicated. In the case of analysis of an Affymetrix Exon Array project, this list can be the list of regulated genes (transcriptional effect), or the list of genes where regulation at the exon level were predicted (splicing effect).

  1. Example of an experiment

In order to identify exons regulated by the splicing factors PTB and nPTB, an Exon Array project lead by Christopher W.J. Smith (Cambridge University, UK) was conducted. HeLa cells treated with siRNAs targeted PTB/nPTB were compared to those treated with control siRNA (to be published).

These microarray data were analyzed by different analysis systems including EASANA from GenoSplice technology.This system provides lists of regulated genes and regulated exons, each with two levels of confidence (“high” and “low”). The “high” level only considers “high-quality probes” from well-annotated genes by filtering probes according to their specificity in addition to their expression (GC content, overlap with repeat regions, cross-hybridization). The “low” level includes all probes corresponding to well-annotated probes. EASANA predicted 721 regulated genes by siPTB/nPTB using the “high” confident level (280 up; 441 down) and 1,543regulated genes by siPTB/nPTB using the “low” confident level (450 up; 1,093 down).In term of exon regulation, 218 exons were predicted to be regulated by siPTB/nPTB using the “high” confident level and 2,273 using the “low” confident level. In these two lists, exon 15 of the KIAA0652 gene was predicted to be specifically included with siPTB/nPTB. This exon was also predicted by other analysis systems that were run on this experiment.Visualization systemsfrom Genomics Suite (Partek), BLIS(Biotique Systems)and EASANA (GenoSplice technology) are presented on figure 1for this event.

  1. Troubleshooting

The major problem with array experiments is their poor reproducibility with other methods, notably with RT-PCR. The validation rate can be as low as 35% [6]. In the majority of cases it is around 50-80%. Since these numbers only address the false positive cases, the real error rates, that include false negatives, will be much higher. One reason for the poor reproducibility could be the large amount of unknown RNAs that often overlap with known transcripts [7].

The next problem concernsthe data analysis. All software/algorithm do not provide the same results for the same project. Even if one or two software/services are better than the others, several systems should be used ideally to gather maximum of results. According to knowledge and capacities available within teams, choice between software and services can be decided. In addition, it could also be possible to develop internal analysis system. However, it can be a very long work and results may not be as relevant as those provided by existing solutions.

Another important constraint is that array experiments give no connectivity information between distant exons, even with junction probes. For example, if two alternative events are predicted to be regulated in a same gene, it is not possible to know whether regulation of event #1 is associated with event #2 or not.

Figure legends

Figure 1:Visualization systems for the Affymetrix Exon Array data: example of the siPTB/nPTB effect on exon 15 of the KIAA0652 gene

Screenshots of visualization using Genomics Suite (Partek), BLIS (Biotique Systems) and EASANA (GenoSplice technology) are provided. For each, a green rectangle indicates position of the exon 15 of KIAA0652, which is regulated by siPTB/nPTB.

  1. Genomics Suite (Partek). 1)Structure of the gene according to Refseq; 2) Splicing Index of each probeset, same scale than the gene scheme, blue=siPTB/nPTB and red=siCTRL.
  2. BLIS (Biotique Systems). 1) Tracks corresponding to gene annotation from EnsEMBL and Refseq; 2) Intensity of probesets in the six samples: the three first lines correspond to siCTRL and the three last lines tosiPTB/nPTB.
  3. EASANA (GenoSplice technology). 1) Options available to filter probes to be displayed according to their expression level and specificity (GC content, overlap with repeat regions, cross-hybridization); 2) Exon/intron gene structure with alternative events in red (exon 15 is known to be a cassette exon); 3) Regulation at the probe level: each bar corresponds to one probe, color of bar corresponds to probe regulation (red=up-regulation and green=down-regulation), exon position is retrieved by the grey track.

References