How noisy and replicable are DNA microarray data?
Suman Sundaresh1,3,*, She-pin Hung2,3,*, G.Wesley Hatfield2,3, and Pierre Baldi1,3
1. School of Information and Computer Science, University of California, Irvine CA 92697
2. Department of Microbiology and Molecular Genetics, College of Medicine, University of California, Irvine CA 92697
3. Institute for Genomics and Bioinformatics, University of California, Irvine CA 92697
E-mail: , , , (corresponding author)
* These authors contributed equally to this work.
Abstract: This paper analyzes variability in highly replicated measurements of DNA microarray data conducted on nylon filters and Affymetrix GeneChipsTM with different cDNA targets, filters and imaging technology. Replicability is assessed quantitatively using correlation analysis as a global measure and differential expression analysis and ANOVA at the level of individual genes.
Keywords: DNA microarrays, sources of variation, replication, correlation, differential expression analysis, ANOVA
Bibliographical notes: Suman Sundaresh is a PhD student in the Computer Science Department at UC Irvine. She gained her MSc and BSc (Hons) in Computer Science from the National University of Singapore. Her research interests are in the areas of data mining, machine learning and biomedical informatics.
She-pin Hung is a post doctoral researcher in the Department of Microbiology and Molecular Genetics conjunctions with the Institute for Genomics and Bioinformatics at UC Irvine. She received her PhD from the University of California at Irvine in 2002. Her research interests are in the areas of global gene expression profiling with the use of DNA microarrays and bioinformatics.
G. Wesley (Wes) Hatfield, Ph.D., is a Professor of Microbiology and Molecular Genetics in the College of Medicine and Associate Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine. Dr Hatfield holds a Ph.D. degree from Purdue University, and a B.A. degree from the University of California at Santa Barbara. His primary areas of scientific expertise include molecular biology, biochemistry, microbial physiology, functional genomics, and bioinformatics. His recent academic interests include the application and development of genomic and bioinformatics methods to elucidate the effects of chromosome structure and DNA topology on gene expression. He has received national recognition for his scientific contributions including the Eli Lilly and Company Research Award bestowed by the American Society of Microbiology.
Pierre Baldi is a Professor in the School of Information and Computer Science and the Department of Biological Chemistry and the Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine. He received his PhD from the California Institute of Technology in 1986. From 1986 to 1988 he was a postdoctoral fellow at the University of California, San Diego. From 1988 to 1995 he held faculty and member of the technical staff positions at the California Institute of Technology and at the Jet Propulsion Laboratory. He was CEO of a startup company from 1995 to 1999 and joined UCI in 1999. He is the recipient of a 1993 Lew Allen Award at JPL and a Laurel Wilkening Faculty Innovation Award at UCI. Dr. Baldi has written over 100 research articles and four books. His research focuses in biological and chemical informatics, AI, and machine learning.
Introduction
This paper analyzes and quantifies certain aspects of “noise” contained in DNA microarray data. A DNA microarray experiment comprises several steps such as cDNA spotting, mRNA extraction, target preparation, hybridization, image scanning and analyses. These procedures can be further subdivided into dozens of other elementary steps each of which can introduce some amount of variability and noise.
In addition to the variability introduced by the instruments and the experimenter, there is biological variability which also has multiple sources ranging from fluctuations in the environment to the inherently stochastic nature of nano-scale regulatory chemistry (Barkai and Leibler, 2000; Hasty et al., 2000; McAdams and Arkin, 1999) [4,7,17] —transcription alone involves dozens of individual molecular interactions. These compounded forms of “noise” may lead one to doubt whether any reliable signal can be extracted at all from DNA microarrays. Here, we show that, while certainly noisy, DNA microarray data do contain reliable information.
In this study, we look at highly replicated (up to 32x) experiments performed by different experimenters at different times in the same laboratory, using, as a model organism, wild type Escherichia coli. In addition, we obtain these microarray measurements using two different formats, nylon filters and Affymetrix GeneChipsTM. Given the overwhelming number of variables that can in principle contribute to the variability, we focus on a particular subset of variables of great relevance to biologists. In particular, we measure the consistency of the results obtained using the filter technology across different filters and mRNA preparations. We also compare filters to Affymetrix GeneChipTM technology and study the effects of five different image processing methods.
Replicability is assessed quantitatively using correlation analysis and differential expression analysis. We use correlation as a global measure of similarity between two sets of measurements. While a correlation close to one is a good sign, it is a global measure that provides little information at the level of individual genes. Thus, we use differential expression analysis at the level of individual genes to detect which genes seem to behave differently in two different sets of measurements. The data sets and software used in our analysis are available over the Web at http://www.igb.uci.edu/servers/dmss.html.
Our approach differs from and complements previous related studies (Coombes et al. 2002; Piper et al., 2002) [5,18]. In particular, we use higher levels of replication (32x), relatively simpler biological samples (E. coli versus S. cerevisiae or human B-cell Lymphoma cell lines) and more diverse microarray technologies (filters and Affymetrix Gene Chips). Part of these other studies also focus on the analysis of variables that are outside the scope of the present study, such as exposure time or inter-laboratory variability.
Methods
Filter Dataset
The first dataset (“filter dataset”) we use consists of 32 sets of measurements from 16 nylon filter DNA microarrays containing duplicate probe sites for each of 4,290 open reading frames (ORFs) hybridized with 33P-labeled cDNA targets from wild-type Escherichia coli cells cultured at 37oC under balanced growth conditions in glucose minimal salts medium. The experimental design and methods for these experiments are described in detail in Arfin et. al. (2000), Baldi and Hatfield (2002) and Hung et. al. (2002) [1,2,10] and illustrated in Figure 1.
Each filter contains duplicate probes (spots) for each of the 4,290 open reading frames (ORFs) of the E. coli genome. In Experiment 1, Filters 1 and 2 were hybridized with 33P-labeled, random hexamer generated, cDNA targets complementary to each of three independently prepared RNA preparations (RNA1) obtained from the cells of three individual cultures of a wild-type (wt) E. coli strain. These three 33P-labeled cDNA target preparations were pooled prior to hybridization to the full-length ORF probes on the filters (Experiment 1). Following phosphorimager analysis, these filters were stripped and again hybridized with pooled, 33P-labeled cDNA targets complementary to each of another three independently prepared RNA preparations (RNA2) from the wt strain (Experiment 2).
This procedure was repeated two more times with filters 3 and 4, using two more independently prepared pools of cDNA targets (Experiment 3, RNA3; Experiment 4, RNA4). Another set of filters, Filters 3 and 4, were used for Experiments 3 and 4 as described for Experiments 1 and 2. This protocol results in duplicate filter data for four experiments performed with cDNA targets complementary to four independently prepared sets of pooled RNA. Thus, since each filter contains duplicate spots for each ORF and duplicate filters were used for each experiment, 16 measurements (D1-D16) for each ORF from four experiments were obtained. These procedures were performed with another two pairs of filters 5-8 for experiments 5-8 to obtain another 16 measurements (D17-D32) for each ORF.
The filter dataset is fairly representative of other filter datasets in the sense that it corresponds to experiments carried out by different people at different times in the same laboratory. In particular, of the 32 filter measurements, the data from measurements 1-16 were obtained 6 months later than the data from measurements 17-32. During this intervening period the efficiency of the 33P labeling was improved. Consequently, more signals marginally above background were detected on the filters for measurements 1-16. In fact, when we edit out all of the genes that contain one or more measurements at or below background in at least one experiment we observe the expression of 2,607 genes for measurements 1-16 and 1,579 genes for measurements 17-32. If we consider the dataset containing all 32 measurements, we find that only 1257 genes have all 32 expression measurements above background. The log (natural) transformed values (Speed, 2002) [19] of these above background 1257 gene expression values for all 32 measurements were used for subsequent analyses.
GeneChip Dataset
To address another DNA microarray technology, we use a second dataset (“GeneChip dataset”) that contains data from four Affymetrix GeneChipTM experiments that measure the expression levels of the same E. coli RNA preparations used for the filter experiments 1-4. The experimental design and methods for these experiments are illustrated in Figure 2 and described in detail by Hung et al. (2002) [10]. The four GeneChip measurements are each processed by five methods, MAS 4.0 and MAS 5.0 software of Affymetrix, dChip software of Li and Wong (2001) [15], RMA (Irizarry et. al., 2003) [12,13] and GCRMA (Wu and Irizarry, 2004) [21]. The dataset thus consists of 20 replicate measurements. The 2370 genes whose expression levels are above background for all the 20 measurements are used in the subsequent analyses. We log (natural) transformed the measures processed with MAS 4.0, 5.0 and dChip. RMA and GCRMA functions differ from most other expression measuring methods as they return the expression measures in log (base 2).
The gene expression measurements for each experiment of both datasets from the filters and the GeneChips are globally normalized by dividing each expression measurement with a value above background on all sixteen filters or four GeneChips by the sum of all the gene expression measurements of that filter or GeneChip. Thus, the signal for each measurement can be expressed as a fraction of the total signal for each filter or GeneChip, or, by implication, as a fraction of total mRNA. This normalization is not applied to the RMA and GCRMA measurements which already have a built-in normalization step.
The datasets obtained from the experiments described above allow us to investigate the effects of not only the environmental and biological factors, but also the consistency of measurements taken from two different DNA microarray technologies.
Image Processing Software
The GeneChip dataset contains 20 replicates, where each of the four GeneChip measurements is processed with five image processing software, MAS4.0, dChip and MAS5.0, RMA and GCRMA.
In the Affymetrix MAS 4.0 software, the mean and standard deviation of the PM (perfect match) - MM (mismatch) differences of a probe set in one array are computed after excluding the maximum and the minimum values obtained for that probe set. If, among the remaining probe pairs, a difference deviates by more than 3SD from the mean, that probe pair is declared an outlier and not used for the average difference calculation of both the control and the experimental array. A flaw of this approach is that a probe with a large response might well be the most informative but may be consistently discarded. Furthermore, if multiple arrays are compared at the same time, this method tends to exclude probes inconsistently measured among GeneChips.
Li and Wong (2001) [15] developed a statistical model-based analysis method to detect and handle cross-hybridizing probes, image and/or GeneChip defects, and to identify outliers across GeneChip sets. A probe set from multiple chips is modeled and the standard deviation between a fitted curve and the actual curve for each probe set for each GeneChip is calculated. Probe pair sets containing an anomalous probe pair measurement(s) are declared outliers and discarded. The remaining probe pair sets are remodeled and the fitted curve data is used for average difference calculations. These methods are implemented in a software program, dChip, which can be obtained from the authors.
A different empirical approach to improve the consistency of average difference measurements has been implemented in the more recent Affymetrix MAS 5.0 software. In this implementation, if the MM value is less than the PM value MAS 5.0 uses the MM value directly. However, if the MM value is larger than the PM value, MAS 5.0 creates an adjusted MM value based on the average difference intensity between the ln PM and ln MM, or if the measurement is too small, some fraction of PM. The adjusted MM values are used to calculate the ln (PM – adjusted MM) for each probe pair. The signal for a probe set is calculated as a one-step bi-weight estimate of the combined differences of all of the probe pairs of the probe set.
In Irizarry et. al., (2003) [12,13], it is demonstrated that the ln(PM-adjusted MM) technique in MAS5.0 results in gene expression estimates with elevated variances. The RMA (robust multi-array analysis) approach applies a global background adjustment and normalization and fits a log-scale expression effect plus probe effect model robustly to the data. This method has been implemented in the Bioconductor affy package (http://www.bioconductor.org). The cel files of this GeneChip data set were pre-processed with the default setting at the probe level and the expression value of each gene was obtained.
An extension of RMA is discussed in Wu and Irizarry (2004) [21] based on molecular hybridization theory takes into account the GC content of each of the probe sequence for the calculation of non-specific binding. This method, called GCRMA, is also available as part of the Bioconductor project in the gcrma package. The cel files of this GeneChip data set were pre-processed with the default setting at the probe level and the expression value of each gene was obtained.
Correlation and Differential Expression Analyses
To measure the consistency between two sets of measurements globally, such as different filters or different cDNA target preparations, we use the Pearson’s correlation coefficient. We compute and analyze matrices of correlation coefficients between different sets of measurements.
