R Notes 2011 LAB 2

Topics covered:

·  One-way ANOVA

·  Levene’s test

·  ANOVA with nested design

> setwd("G:/Courses/A205/R/Lab2")

lab2a<-read.table('Lab2a.txt', header=T)

# read.table reads white spaces as separators as default. You can specify in

# the parameters of read.table the type of separator: e.g.: sep=”\t“ for tab

# only as separator. When you use a new command it is useful to ask the

# manual what are the default values and its usage (> ?read.table).

# alternatively: read.csv can be used for comma separated values or

# read.delim for tab separated values. Remember to look at the manual for the

# usable parameters: you will discover the very handy parameter sep, that

# allows to set how the values are separated: sep=“ “ is the default value

# for read.table, sep=”\t” or sep=”,” can be used for tab and comma separated

# values respectively.

> lab2a

Culture N_level

1 3DOk1 24.1

2 3DOk1 32.6

3 3DOk1 27.0

4 3DOk1 28.9

5 3DOk1 31.4

6 3DOk5 19.1

7 3DOk5 24.8

8 3DOk5 26.3

9 3DOk5 25.2

10 3DOk5 24.3

11 3DOk4 17.9

12 3DOk4 16.5

13 3DOk4 10.9

14 3DOk4 11.9

15 3DOk4 15.8

16 3DOk7 20.7

17 3DOk7 23.4

18 3DOk7 20.5

19 3DOk7 18.1

20 3DOk7 16.7

21 3DOk13 14.3

22 3DOk13 14.4

23 3DOk13 11.8

24 3DOk13 11.6

25 3DOk13 14.2

26 Comp 17.3

27 Comp 19.4

28 Comp 19.1

29 Comp 16.9

30 Comp 20.8

# When input data in R you should always check how R has interpreted them.

# Eventually you may need to make some adjustments. You can look at the data

# by typing again the name of the data.frame you created (Lab2_N), if the

# data are composed of thousand or rows you can visualize only the first of

# the last lines with the head and tail commands, respectively.

# The function str is useful to get a summary of the data and how R is

# interpreting them. If numerical factors, R by default will consider them

# continuous numbers; factors need to be recognized as such and composed of

# discrete levels; a very useful command is str.

> str(lab2a)

'data.frame': 30 obs. of 2 variables:

$ Culture: Factor w/ 6 levels "3DOk1","3DOk13",..: 1 1 1 1 1 4 4 4 4 4 ...

$ N_level: num 24.1 32.6 27 28.9 31.4 19.1 24.8 26.3 25.2 24.3 ...

# Or we can ask the question using the is.factor function (if you need to

# convert numerical values to factors you may use the function as.factor)

> lab2a<-read.table('Lab2a.txt', header=T)

> is.factor(lab2a$Culture)

[1] TRUE

> is.factor(lab2a$N_level)

[1] FALSE

# To select specific part of the table

> lab2a[lab2a$Culture=="Comp",]

Culture N_level

26 Comp 17.3

27 Comp 19.4

28 Comp 19.1

29 Comp 16.9

30 Comp 20.8

> lab2a[lab2a$N_level>="25",]

Culture N_level

2 3DOk1 32.6

3 3DOk1 27.0

4 3DOk1 28.9

5 3DOk1 31.4

8 3DOk5 26.3

9 3DOk5 25.2

One way ANOVA

# To define the one-way ANOVA model (result_variable~classification_variable)

# we can use the function lm. It is important to note the order of the

# arguments: the first argument is always the dependent variable (N_level).

# It is followed by the tilde symbol (~) and the independent variable(s).

> model<-lm(N_level~Culture, data=lab2a)

> anova(model)

Analysis of Variance Table

Response: N_level

Df Sum Sq Mean Sq F value Pr(>F)

Culture 5 845.72 169.144 25.363 7.537e-09 ***

Residuals 24 160.05 6.669

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# We use the anova function to visualize the ANOVA table of a linear model

# defined using lm (for aov model use summary). Summary can be used also with

# lm models and returns the R2:

> summary(model)

Call:

lm(formula = N_level ~ Culture, data = lab2a)

Residuals:

Min 1Q Median 3Q Max

-4.840 -1.750 0.660 1.245 3.800

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 19.8633 0.4715 42.130 < 2e-16 ***

Culture1 -7.7700 0.8166 -9.515 1.29e-09 ***

Culture2 -2.1433 0.4715 -4.546 0.000132 ***

Culture3 1.2633 0.3334 3.789 0.000896 ***

Culture4 -0.0540 0.2582 -0.209 0.836129

Culture5 -0.2327 0.2109 -1.103 0.280771

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.582 on 24 degrees of freedom [= Root MSE in SAS]

Multiple R-squared: 0.8409, Adjusted R-squared: 0.8077

F-statistic: 25.36 on 5 and 24 DF, p-value: 7.537e-09

# Reminder from previous lab: it is easy to calculate exact p-value from know

# F-values in R using pf.

> pf(25.36, 5, 24, lower.tail=F)

[1] 7.546316e-09

?pf
lower.tail= logical; if TRUE (default), probabilities are P[X <= x], otherwise, P[X > x].

# We can easily calculate the coefficient of variation:

> Nlevel_Mean<-mean(lab2a$N_level)

> Root_MSE<-sqrt(6.67)

> Root_MSE

[1] 2.582634

> Coeff_Var<-Root_MSE/Nlevel_Mean*100

> Coeff_Var

[1] 13.00202

In SAS

R-Square Coeff Var Root MSE Nlevel Mean

0.840866 13.00088 2.582408 19.86333

> plot(lab2a)

# same result boxplot(N_level ~ Culture, data=lab2a)

# We can easily extract the predicted and the residual values using the

# predict and residual functions.

> predi<-predict(model)

> predi

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

28.80 28.80 28.80 28.80 28.80 23.94 23.94 23.94 23.94 23.94 14.60 14.60 14.60 14.60 14.60 19.88 19.88

18 19 20 21 22 23 24 25 26 27 28 29 30

19.88 19.88 19.88 13.26 13.26 13.26 13.26 13.26 18.70 18.70 18.70 18.70 18.70

> resi<-residuals(model)

> resi

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

-4.70 3.80 -1.80 0.10 2.60 -4.84 0.86 2.36 1.26 0.36 3.30 1.90 -3.70 -2.70 1.20 0.82 3.52

18 19 20 21 22 23 24 25 26 27 28 29 30

0.62 -1.78 -3.18 1.04 1.14 -1.46 -1.66 0.94 -1.40 0.70 0.40 -1.80 2.10

# Adding the Predicted and residual vectors to the data.frame lab2a

> lab2a$predi<-predi

> lab2a$resi<-resi

# We can print the first 10 rows using head function to visualize the new

# table lab2a and use the write.table function to save it a tab delimited

# text file.

> head(lab2a, 10)

# 10 specifies the number of rows: 5 is default.

Culture N_level predi resi

1 3DOk1 24.1 28.80 -4.70

2 3DOk1 32.6 28.80 3.80

3 3DOk1 27.0 28.80 -1.80

4 3DOk1 28.9 28.80 0.10

5 3DOk1 31.4 28.80 2.60

6 3DOk5 19.1 23.94 -4.84

7 3DOk5 24.8 23.94 0.86

8 3DOk5 26.3 23.94 2.36

9 3DOk5 25.2 23.94 1.26

10 3DOk5 24.3 23.94 0.36

> write.table(Lab2a, "try3.txt", sep='\t')

Levene’s test

1- Calculating Levene’s test by hand (ANOVA of the square of the residuals)

> lab2a$resi2<-lab2a$resi^2

> modelLevene<-lm(lab2a$resi2~Culture, data=lab2a)

> anova(modelLevene)

Analysis of Variance Table

Response: lab2a$resi2

Df Sum Sq Mean Sq F value Pr(>F)

Culture 5 226.99 45.398 1.1528 0.3606

Residuals 24 945.17 39.382

2- Levene’s test using the ‘car’ package

# To install a new package, we can use the install.packages function.

> install.packages("car")

--- Please select a CRAN mirror for use in this session select close US location]

Content type 'application/zip' length 728229 bytes (711 Kb)

opened URL

downloaded 711 Kb

package 'car' successfully unpacked and MD5 sums checked

# The package is downloaded but not available. Each R session you need to

# activate the downloaded library (it saves memory to have only the necessary

# libraries open)

> library(car)

> help(car)

# If you want to change the list of default packages you need to modify the

# Rprofile file. Search the Rprofile file in your computer and open it with a

# txt editor (e.g. word pad). Once it’s open, you have to search for this

# part:

local({dp <- as.vector(Sys.getenv("R_DEFAULT_PACKAGES"))

if(identical(dp, "")) # marginally faster to do methods last

dp <- c("datasets", "utils", "grDevices", "graphics",

"stats", "methods", "car", "agricolae")

else if(identical(dp, "NULL")) dp <- character(0)

else dp <- strsplit(dp, ",")[[1]]

dp <- sub("[[:blank:]]*([[:alnum:]]+)", "\\1", dp) # strip whitespace

options(defaultPackages = dp)

})

# The part I have highlighted are the names of the two additional packages I

# want to be loaded at startup. Add the names of the packages between

# quotation mark. Save the new Rprofile file and restart R console to see the

# changes you have made. A word of advice: don’t change anything else in the

# Rprofile file, if you don’t know what you are doing…

# Running the Levene’s test: deviation from medians is the default. SAS

# calculates deviation from means.

> levene.test(model, center=mean)

Levene's Test for Homogeneity of Variance (center = mean)

Df F value Pr(>F)

group 5 0.5841 0.7119

24

# Caution: R/car uses a different version of the Levene’s test than SAS:

# R version is of Levene is based on absolute deviations (instead of the

# square deviations in SAS).

3- Alternative homogeneity of variance tests: Bartlett and Fligner-Killeen Tests

> bartlett.test(N_level~Culture, lab2a)

Bartlett test of homogeneity of variances

data: N_level by Culture

Bartlett's K-squared = 3.9834, df = 5, p-value = 0.5518

> fligner.test(N_level~Culture, lab2a)

Fligner-Killeen test of homogeneity of variances

data: N_level by Culture

Fligner-Killeen:med chi-squared = 2.7889, df = 5, p-value = 0.7325

ANOVA with nested design

> lab2b<-read.table('lab2b.txt', header=T)

> str(lab2b)

'data.frame': 72 obs. of 4 variables:

$ Trtmt : int 1 1 1 1 1 1 1 1 1 1 ...

$ Pot : int 1 1 1 1 2 2 2 2 3 3 ...

$ Plant : int 1 2 3 4 1 2 3 4 1 2 ...

$ Growth: num 3.5 4 3 4.5 2.5 4.5 5.5 5 3 3 ...

# Since R is not interpreting the first three variables as factors we need to

# declare that they are factors using the as.factor function.

> lab2b$Plant<-as.factor(lab2b$Plant)

> lab2b$Pot<-as.factor(lab2b$Pot)

> lab2b$Trtmt<-as.factor(lab2b$Trtmt)

> str(lab2b)

'data.frame': 72 obs. of 4 variables:

$ Trtmt : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

$ Pot : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 2 2 3 3 ...

$ Plant : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...

$ Growth: num 3.5 4 3 4.5 2.5 4.5 5.5 5 3 3 ...

To have all the variance components: nested design using linear mixed-effects models with lmer (lme4 package)

install.packages("lme4")

> library(lme4)

# We design a model where the dependent variable growth is function only of a

# random effect that considers the hierarchical design nesting of ‘Pot’

# within ‘Trtmt’ (1|Trtmt/Pot):

> model<-lmer(Growth ~ 1 + (1|Trtmt/Pot), data=lab2b)

> model

Linear mixed model fit by REML

Formula: Growth ~ 1 + (1 | Trtmt/Pot)

AIC BIC logLik deviance REMLdev

237.2 246.3 -114.6 230.3 229.2

Random effects:

Groups Name Variance Std.Dev.

Pot:Trtmt (Intercept) 0.30469 0.55198

Trtmt (Intercept) 2.81464 1.67769

Residual 0.93403 0.96645

Number of obs: 72, groups: Pot:Trtmt, 18; Trtmt, 6

Fixed effects:

Estimate Std. Error t value

(Intercept) 5.7847 0.7064 8.19

# Now we can calculate the variance components as percentage:

> variances<-c(0.30,2.8,0.93)

> variances/sum(variances)*100

[1] 7.444169 69.478908 23.076923

In SAS

SAS. Nested Random Effects Analysis of Variance for Variable Growth

Variance Sum of Error Variance Percent

Source DF Squares F Value Pr > F Term Mean Square Component of Total

Total 71 255.913194 3.604411 4.053356 100.0000

Trtmt 5 179.642361 16.69 <.0001 Pot 35.928472 2.814641 69.4398

Pot 12 25.833333 2.30 0.0186 Error 2.152778 0.304688 7.5169

Error 54 50.437500 0.934028 0.934028 23.0433

Growth Mean 5.78472222

Standard Error of Growth Mean 0.70640396

To have the p-value:

# Using lm

> anova(lm(Growth~Trtmt/Pot, lab2b))

Analysis of Variance Table

Response: Growth

Df Sum Sq Mean Sq F value Pr(>F)

Trtmt 5 179.642 35.928 38.4662 < 2e-16 ***

Trtmt:Pot 12 25.833 2.153 2.3048 0.01858 *

Residuals 54 50.438 0.934

# 38.5 is the incorrect F value for Trtmt because it is using the wrong error

# term. R computes F values using the residual MS as the error term (0.934

# in this case). The calculation of the correct F and P needs to be

# completed by hand. You need to know that the MSE Trtmt needs to be divided

# by the Trtmt/Pot error term (2.153) and not the residual.

Fvalue_trtmt<-35.928/2.1528

Fvalue_trtmt

16.69

pf(16.69, 5, 12, lower.tail=F)

4.880102e-05

Fvalue_treatment_pot<-2.1528/0.934

Fvalue_treatment_pot

2.30

pf(2.30, 12, 54, lower.tail=F)

0.01882839

In SAS

SAS: Tests of Hypotheses Using the Type III MS for Pot(Trtmt) as an Error Term

Source DF Type III SS Mean Square F Value Pr > F

Trtmt 5 179.6423611 35.9284722 16.687 <.0001

# We could use lm only and calculate the variance components by hand:

# MSSE= s2error = 0.93

# MSEE= s2 pot= (MSE - s2error)/4= (2.15-0.93)/4= 0.30

# MST= (MST –MSEE)/12= (35.93-2.15)/12=2.81

Nested design with ANOVA with multiple error terms (aov) [not covered in class]

> model_nest<-aov(Growth~Trtmt+Error(Trtmt/Pot), lab2b)

> summary(model_nest)

Error: Trtmt

Df Sum Sq Mean Sq

Trtmt 5 179.642 35.928

Error: Trtmt:Pot

Df Sum Sq Mean Sq F value Pr(>F)

Residuals 12 25.8333 2.1528

Error: Within

Df Sum Sq Mean Sq F value Pr(>F)

Residuals 54 50.437 0.934

# Again then the calculation of F and P needs to be completed by hand. You

# need to know that the MSE Trtmt needs to be divided by the Trtmt:Pot error

# term and not the residual.

PLS205 2011 2.8 R Lab 2