Preprocessing Affymetrix Data

PREPROCESSING AFFYMETRIX DATA.

Preprocessing of Affymetrix data is an articulated process consisting of the following steps:

Importing the CEL files and the PHENODATA;
Summarize the expression values per each probe set.

Summarizing expression values is constituted of the following steps:

BACKGROUND correction;
NORMALIZATION;
SUMMARIZATION.

All these operations are supported in the package affy.

Preliminary operations.

1. Start R-Gui;

2. Choose the directory where the CEL files and the PHENODATA file are contained as the working directory;

3. Load the affy package, through the command: library(affy).

Importing data.

dat <- ReadAffy()

where dat is the new AffyBatch object that will be created at the end of the process operated by the function ReadAffy.

Now we can import the phenodata.

> pData(dat)<-read.table("phenod.txt", header=T, row.names=1, sep="\t")

the phenodata object contains all the information about the samples of our dataset. It is always important to keep track off all the information about our samples, since they can strongly affect the final results.

> pData(dat)

diagnosis age gender kidney subject

GSM11805.CEL NORM 70 F R 1

GSM11814.CEL RCC 70 F R 1

GSM11823.CEL NORM 72 F R 2

GSM11830.CEL RCC 72 F R 2

GSM12067.CEL RCC 58 M L 3

GSM12075.CEL NORM 58 M L 3

GSM12079.CEL RCC 64 M R 4

GSM12098.CEL NORM 64 M R 4

GSM12100.CEL RCC 55 M L 5

GSM12105.CEL RCC 65 F R 6

GSM12268.CEL NORM 51 M L 7

GSM12270.CEL RCC 67 M R 8

GSM12283.CEL NORM 67 M R 8

GSM12298.CEL RCC 50 M L 9

GSM12300.CEL NORM 50 M L 9

GSM12399.CEL RCC 65 M L 10

GSM12444.CEL NORM 65 M L 10

The MIAME information is stored in the slot @description of the dat object. We can visualize it:

> dat@description

Experiment data

Experimenter name:

Laboratory: my lab

Contact information:

Title:

URL:

PMIDs:

No abstract available.

Information is available on: preprocessing

We can fill in the information regarding the Experimenter name, the Laboratory, Contact information, and Title:

Experimenter name:

> dat@description@name="Dario Greco"

Laboratory:

> dat@description@lab="DNA-microarray Lab, Institute of Biotechnology"

Contact information:

> dat@description@contact=""

Title:

> dat@description@title="the greatest finding of the century"

now we can visualize again the MIAME slot:

> dat@description

Experiment data

Experimenter name: Dario Greco

Laboratory: DNA-microarray Lab, Institute of Biotechnology

Contact information:

Title: the greatest finding of the century

URL:

PMIDs:

No abstract available.

Information is available on: preprocessing

Exploring the affybatch.

Typing just the name of the Affybatch we have created importing the data, we get a summary about the dataset.

> dat

AffyBatch object

size of arrays=712x712 features (67343 kb)

cdf=HG-U133A (22283 affyids)

number of samples=17

number of genes=22283

annotation=hgu133a

We can also visualize the phenodata of the dataset.

> pData(dat)

disease age gender kidney subject

GSM11805.CEL NORM 70 F R 1

GSM11814.CEL RCC 70 F R 1

GSM11823.CEL NORM 72 F R 2

GSM11830.CEL RCC 72 F R 2

GSM12067.CEL RCC 58 M L 3

GSM12075.CEL NORM 58 M L 3

GSM12079.CEL RCC 64 M R 4

GSM12098.CEL NORM 64 M R 4

GSM12100.CEL RCC 55 M L 5

GSM12105.CEL RCC 65 F R 6

GSM12268.CEL NORM 51 M L 7

GSM12270.CEL RCC 67 M R 8

GSM12283.CEL NORM 67 M R 8

GSM12298.CEL RCC 50 M L 9

GSM12300.CEL NORM 50 M L 9

GSM12399.CEL RCC 65 M L 10

GSM12444.CEL NORM 65 M L 10

It is possible to visually explore the dataset using different methods.

We can produce a density / intensity histogram.

Notice that the shape of the curves is far away from being normal.

> hist(dat)

The boxplot is another way to visualize the distribution of the intensities in each array of the dataset.

>boxplot(dat)

It is also possible to visualize the image representing each CEL file (in this example, we visualize the first slide of the dataset only).

In this way, it is possible to pinpoint technical problems occurring eventually only to one region of the array.

>image(dat[,1])

In order to access the intensities of PM and MM, we can use the functions pm and mm.

In this example, we visualize the PM and the MM intensities for the probe cells of the probeset “78383_at” in the first three arrays of the dataset.

> pm(dat, "78383_at")[,1:3]

GSM11805.CEL GSM11814.CEL GSM11823.CEL

78383_at1 129.8 197.8 133.5

78383_at2 330.3 454.3 308.0

78383_at3 150.3 183.5 150.3

78383_at4 139.8 192.3 163.3

78383_at5 528.3 585.5 670.3

78383_at6 234.8 298.5 250.3

78383_at7 198.3 206.8 186.8

78383_at8 203.5 261.0 253.8

78383_at9 88.3 115.8 119.5

78383_at10 283.0 355.3 270.0

78383_at11 224.3 308.5 274.8

78383_at12 463.3 643.8 484.3

78383_at13 79.5 66.0 79.0

78383_at14 60.3 83.0 73.0

78383_at15 73.3 69.0 77.3

78383_at16 506.5 506.3 529.3

> mm(dat, "78383_at")[,1:3]

GSM11805.CEL GSM11814.CEL GSM11823.CEL

78383_at1 124.8 139.8 128.3

78383_at2 128.8 131.3 118.0

78383_at3 171.0 192.3 176.8

78383_at4 165.3 205.3 177.5

78383_at5 552.0 777.5 780.8

78383_at6 96.0 114.0 108.5

78383_at7 173.0 236.3 199.3

78383_at8 97.0 131.8 107.3

78383_at9 84.3 94.0 98.3

78383_at10 138.5 152.0 154.0

78383_at11 165.0 261.3 230.5

78383_at12 228.8 298.0 260.3

78383_at13 72.8 69.5 68.3

78383_at14 58.3 68.0 77.5

78383_at15 78.5 71.0 68.3

78383_at16 226.0 277.0 258.3

We can produce a plot with the PM intensity of each probe cell of the probeset in each array. As you can notice, the shape of the curves is very similar for each array (different line colors). This gives a demonstration of the idea of “probe response” that is behind the Li and Wong model (MBEI)

> matplot(pm(dat, “78383_at”), type=”l”, xlab=”probe”, ylab=”PM intensity 78383_at”)

In the same way, the MM intensities are plotted:

> matplot(mm(dat, “78383_at”), type=”l”, xlab=”probe”, ylab=”PM intensity 78383_at”)

Summarizing the expression values.

In affy there are several methodologies available to correct background, normalize and summarize expression values per each probe set of the dataset.

Here we will use the RMA and MAS5 methods that are implemented into ready made functions. We will also explore quickly the possibilities contained in the function expresso().

In order to apply the RMA method, we can type:

> datrma<-rma(dat)

We can visualize a summary of this new object. As you can notice, the class of this object is exprSet. Moreover, the phenodata are transferred from the AffyBatch object.

> datrma

Expression Set (exprSet) with

22283 genes

17 samples

phenoData object with 5 variables and 17 cases

varLabels

disease: disease

age: age

gender: gender

kidney: kidney

subject: subject

> pData(datrma)

disease age gender kidney subject

GSM11805.CEL NORM 70 F R 1

GSM11814.CEL RCC 70 F R 1

GSM11823.CEL NORM 72 F R 2

GSM11830.CEL RCC 72 F R 2

GSM12067.CEL RCC 58 M L 3

GSM12075.CEL NORM 58 M L 3

GSM12079.CEL RCC 64 M R 4

GSM12098.CEL NORM 64 M R 4

GSM12100.CEL RCC 55 M L 5

GSM12105.CEL RCC 65 F R 6

GSM12268.CEL NORM 51 M L 7

GSM12270.CEL RCC 67 M R 8

GSM12283.CEL NORM 67 M R 8

GSM12298.CEL RCC 50 M L 9

GSM12300.CEL NORM 50 M L 9

GSM12399.CEL RCC 65 M L 10

GSM12444.CEL NORM 65 M L 10

For the MAS5:

> datmas5<-mas5(dat)

The functions rma and mas5 are implemented in C++ language and for this reason they are fast and not too much memory demanding.

THE expresso() FUNCTION.

For a more flexible strategy, it is possible to use the function expresso():

For instance, the following function gives the same results of the function rma() but it is much slower and memory demanding:

eset <- expresso(affybatch.example,bgcorrect.method="rma", normalize.method="quantiles", pmcorrect.method="pmonly", summary.method="medianpolish")

In the package affy there are available also functions to run only one of the steps included in the expresso() function.

Background correction.

It is possible to visualize the available methods with the following command:

> bgcorrect.methods

[1] "mas" "none" "rma" "rma2"

In order to perform only background correction on the Affybatch object, just type:

> dat.bgc <- bg.correct(dat, method="rma")

Normalization.

It is possible to visualize the available methods with the following command:

> normalize.AffyBatch.methods

[1] "constant" "contrasts" "invariantset" "loess"

[5] "qspline" "quantiles" "quantiles.robust"

In order to perform only the normalization, we can type:

> dat.norm<-normalize(dat, method="quantiles")

Note that both the functions bg.correct() and normalize()take in input and give in output an Affybatch object.

Similarly, it is possible to visualize all the implemented methods for the PM correction and summarization by the following commands:

> pmcorrect.methods

[1] "mas" "pmonly" "subtractmm"

> express.summary.stat.methods

[1] "avgdiff" "liwong" "mas" "medianpolish" "playerout"

Final operations.

Save the workspace giving a new name (e.g. “Affy_preprocessing.Rdata”);
Save the history.

Exercise.

Chose two different background correction methods and apply them separately to the Affybatch object containing the row data.

Chose two different normalization methods and apply them separately to the bg.corrected objects you have created in the previous step.

Now visualize the bg.corrected and normalized datasets using boxplot and hist.

Try to make your own conclusions about the performance.

Thank you for your attention!

Dario

For contacts:

Dario Greco

Institute of Biotechnology - University of Helsinki

Building Cultivator II

P.O.Box 56 Viikinkaari 4

FIN-00014 Finland

Office: +358 9 191 58951

Fax: +358 9 191 58952

Mobile: +358 44 023 5780

Email: