CSSS 508: Intro to R

3/03/06

Homework 8 Solutions

1) Download hw8prob1.dat from the class website and read it into R.

data1<-read.table("http://www.stat.washington.edu/rnugent/teaching/csss508/hw8prob1.dat")

dim(data1)

[1] 50 2

Plot the data and find a linear regression model using x to predict y.

x<-data1[,1]

y<-data1[,2]

plot(x,y,pch=16,xlab="x",ylab="y")

fit1<-lm(y~x)

summary(fit1)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-8.985 -4.206 -1.001 1.945 20.172

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.4783 0.9890 8.573 3.04e-11 ***

x 2.5286 0.3236 7.814 4.21e-10 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 6.858 on 48 degrees of freedom

Multiple R-Squared: 0.5599, Adjusted R-squared: 0.5507

F-statistic: 61.05 on 1 and 48 DF, p-value: 4.215e-10

abline(fit1,col=2)

text(locator(1),"y = 8.48 + 2.53x")

See next page for graph.

The fit line has a R2 of 0.55 and a residual standard error of 6.858.

Our line is plotted in red – it seems to be pulled up (steeper slope) toward the points in the upper right corner. It looks like there is a linear trend in the points along the lower half of the graph. The upper right-hand group of points influences the line.

Are there any outliers that affected your line?

Use identify( ) to locate them, remove them from your data set, and refit the model.

I’m going to remove the 10 points in the upper right-hand corner.

outliers<-identify(x,y,labels=seq(1,50),n=10)

outliers

[1] 7 14 16 18 20 24 26 45 49 50

data1.b<-data1[-outliers,]

x.b<-data1.b[,1]

y.b<-data1.b[,2]

Plotting the new data on the old axes:

plot(x.b,y.b,pch=16,xlab="x",ylab="y",xlim=c(min(x),max(x)),ylim=c(min(y),max(y)))

fit1.b<-lm(y.b~x.b)

summary(fit1.b)

Call:

lm(formula = y.b ~ x.b)

Residuals:

Min 1Q Median 3Q Max

-5.10662 -1.91223 0.07102 1.88650 5.65643

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.7193 0.4369 13.09 1.17e-15 ***

x.b 1.8049 0.1476 12.23 9.65e-15 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 2.763 on 38 degrees of freedom

Multiple R-Squared: 0.7973, Adjusted R-squared: 0.792

F-statistic: 149.5 on 1 and 38 DF, p-value: 9.649e-15

abline(fit1.b,col=2)

text(locator(1),"y = 5.72 + 1.80x")

This line has a much higher R2 (0.79) and a much smaller residual standard error.

A better choice.

2) Download hw8prob2.dat from the class website and read it into R.

data2<-read.table("http://www.stat.washington.edu/rnugent/teaching/csss508/hw8prob2.dat")

dim(data2)

[1] 100 2

You have an independent x variable and a dependent y variable, but the relationship is not necessarily linear. Use plotting diagnostics, R2, etc to choose a model for your data. Explore different combinations of powers of x.

Why did you choose the model you did?

x<-data2[,1]

y<-data2[,2]

Looking at the data to get some modeling ideas:

plot(x,y,pch=16,xlab="x",ylab="y")

Definitely not linear. We’re going to need higher powers of x.

x2<-x^2

x3<-x^3

x4<-x^4

x5<-x^5

Now we’ll examine different models.

summary(fit1<-lm(y~x))

Residuals:

Min 1Q Median 3Q Max

-11.8769 -1.6985 -0.1727 2.4313 5.7565

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3277 0.3429 -3.872 0.000195 ***

x -3.6252 0.3521 -10.296 < 2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 3.289 on 98 degrees of freedom

Multiple R-Squared: 0.5196, Adjusted R-squared: 0.5147

F-statistic: 106 on 1 and 98 DF, p-value: < 2.2e-16

summary(fit2<-lm(y~x2))

Residuals:

Min 1Q Median 3Q Max

-5.59277 -1.42059 -0.02910 1.36303 4.91466

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.7458 0.2636 6.624 1.89e-09 ***

x2 -4.2930 0.1911 -22.463 < 2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.914 on 98 degrees of freedom

Multiple R-Squared: 0.8374, Adjusted R-squared: 0.8357

F-statistic: 504.6 on 1 and 98 DF, p-value: < 2.2e-16

summary(fit3<-lm(y~x3))

Residuals:

Min 1Q Median 3Q Max

-4.102964 -0.938776 0.006417 0.968818 4.458356

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.16302 0.16579 -0.983 0.328

x3 -2.22456 0.07442 -29.893 <2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.492 on 98 degrees of freedom

Multiple R-Squared: 0.9012, Adjusted R-squared: 0.9002

F-statistic: 893.6 on 1 and 98 DF, p-value: < 2.2e-16

summary(fit4<-lm(y~x4))

Residuals:

Min 1Q Median 3Q Max

-3.8945 -0.5833 0.1448 0.8800 3.0194

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.28782 0.13539 2.126 0.036 *

x4 -1.37384 0.03544 -38.763 <2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.174 on 98 degrees of freedom

Multiple R-Squared: 0.9388, Adjusted R-squared: 0.9381

F-statistic: 1503 on 1 and 98 DF, p-value: < 2.2e-16

summary(fit5<-lm(y~x5))

Residuals:

Min 1Q Median 3Q Max

-3.5818 -0.6192 0.1198 0.7658 2.6435

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.26050 0.13506 -1.929 0.0567 .

x5 -0.73593 0.01994 -36.908 <2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.229 on 98 degrees of freedom

Multiple R-Squared: 0.9329, Adjusted R-squared: 0.9322

F-statistic: 1362 on 1 and 98 DF, p-value: < 2.2e-16

Higher powers are starting to fit worse.

It looks as if we may need some terms like x3 or x4.

Let’s take a look at our residuals.

par(mfrow=c(2,2))

Then for each fit: (fit1, fit2, fit3, fit4)

plot(fit1$resid)

abline(h=0)

title(“Y = X”)

We want small residuals scattered around zero. Y = X3 and Y = X4 look pretty good.

Let’s see what is more important, head-to-head:

summary(lm(y~x+x2+x3+x4))

Residuals:

Min 1Q Median 3Q Max

-3.33273 -0.64887 0.07955 0.70970 3.03096

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.02287 0.19761 0.116 0.908117

x 1.83452 0.46280 3.964 0.000143 ***

x2 0.30258 0.36742 0.824 0.412267

x3 -2.58795 0.56223 -4.603 1.29e-05 ***

x4 -0.32827 0.27223 -1.206 0.230864

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.07 on 95 degrees of freedom

Multiple R-Squared: 0.9507, Adjusted R-squared: 0.9486

F-statistic: 458.2 on 4 and 95 DF, p-value: < 2.2e-16

It looks like a combination of x and x3 might be a good model.

summary(final<-lm(y~x+x3))

Residuals:

Min 1Q Median 3Q Max

-3.21089 -0.67664 0.08948 0.66755 3.16449

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.1104 0.1219 0.905 0.368

x 2.2751 0.2342 9.715 5.53e-16 ***

x3 -3.1499 0.1091 -28.866 < 2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.068 on 97 degrees of freedom

Multiple R-Squared: 0.9499, Adjusted R-squared: 0.9489

F-statistic: 919.7 on 2 and 97 DF, p-value: < 2.2e-16

This model has the highest R2 and lowest residual standard error we’ve seen (albeit only very slightly better than the previous model – we prefer the simpler model).

We could have also used the step function to find us the best combination.

full.fit<-lm(y~x+x2+x3+x4+x5)

step(full.fit)

Start: AIC= 20.39

y ~ x + x2 + x3 + x4 + x5

Df Sum of Sq RSS AIC

- x5 1 0.004 108.759 18.397

- x2 1 0.061 108.817 18.449

- x4 1 0.096 108.852 18.482

<none> 108.755 20.393

- x 1 17.896 126.651 33.627

- x3 1 21.583 130.338 36.496

Step: AIC= 18.4

y ~ x + x2 + x3 + x4

Df Sum of Sq RSS AIC

- x2 1 0.776 109.536 17.108

- x4 1 1.665 110.424 17.916

<none> 108.759 18.397

- x 1 17.989 126.748 31.703

- x3 1 24.256 133.016 36.530

Step: AIC= 17.11

y ~ x + x3 + x4

Df Sum of Sq RSS AIC

- x4 1 1.011 110.546 16.027

<none> 109.536 17.108

- x 1 20.319 129.855 32.125

- x3 1 25.391 134.927 35.957

Step: AIC= 16.03

y ~ x + x3

Df Sum of Sq RSS AIC

<none> 110.55 16.03

- x 1 107.56 218.11 81.98

- x3 1 949.59 1060.13 240.10

Call:

lm(formula = y ~ x + x3)

Coefficients:

(Intercept) x x3

0.1104 2.2751 -3.1499

Step chooses the same model: Y = 0.1104 + 2.2751x – 3.1499x3

3) Download hw8prob3.dat from the class website and read it into R.

data3<-read.table("http://www.stat.washington.edu/rnugent/teaching/csss508/hw8prob3.dat")

dim(data3)

[1] 100 6

You have 5 independent variables (x1,x2,x3,x4,x5) and one independent variable y. Use lm( ) to estimate the regression model

y = B0 + B1x1 + B2x2 + B3x3 + B4x4 + B5x5.

Find the coefficient estimate and 95% confidence interval for each of the Bi. Plot all six estimates and their confidence intervals on the same plot (for easy comparison).

Draw a line at zero so you can easily identify which variables are significant. What conclusions do you draw?

x1<-data3[,1]

x2<-data3[,2]

x3<-data3[,3]

x4<-data3[,4]

x5<-data3[,5]

y<-data3[,6]

fit<-lm(y~x1+x2+x3+x4+x5)

summary(fit)

Residuals:

Min 1Q Median 3Q Max

-5.06003 -1.26856 0.06113 1.45287 4.18022

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.28071 2.92020 0.439 0.6620

x1 3.97565 2.51839 1.579 0.1178

x2 2.94527 0.09469 31.103 <2e-16 ***

x3 -3.17201 2.50976 -1.264 0.2094

x4 -3.96557 0.11176 -35.482 <2e-16 ***

x5 2.97360 1.50515 1.976 0.0511 .

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.91 on 94 degrees of freedom

Multiple R-Squared: 0.9588, Adjusted R-squared: 0.9566

F-statistic: 437.7 on 5 and 94 DF, p-value: < 2.2e-16

We should expect to see a significant confidence interval for the variables x2 and x4. The interval for x5 will be close to significant, but the other two (x1, x3) will not be significant.

Finding the Estimates and the 95% Confidence Intervals:

coef<-summary(fit)$coef

ci<-cbind(coef[,1],round(coef[,1]+1.96*coef[,2],3),round(coef[,1]-1.96*coef[,2],3))

colnames(ci)<-c(“Coef”,”Upper CI”,”Lower CI”)

Coef Upper CI Lower CI

(Intercept) 1.280706 7.004 -4.443

x1 3.975652 8.912 -0.960

x2 2.945265 3.131 2.760

x3 -3.172015 1.747 -8.091

x4 -3.965573 -3.747 -4.185

x5 2.973599 5.924 0.024

Now we plot them.

pt.est<-cbind(seq(1,nrow(ci)),ci[,1])

y.low<-min(-1,min(ci)-0.5)

y.high<-max(ci)+0.5

plot(pt.est,pch=16,cex=1.5,xlim=c(0,nrow(ci)+1),ylim=c(y.low,y.high),xaxt=”n”,xlab=””,ylab=”Coefficient”)

for(i in 1:nrow(ci)){

points(cbind(c(i,i),ci[i,2:3]),col=2,pch=16)

lines(cbind(c(i,i),ci[i,2:3]),col=2)

text(i,y.low,paste(“B”,i-1))

}

abline(h=0,lwd=2,col=”blue”)

title(“95% Confidence Intervals”)

See next page for graph.

Note that the confidence interval for x5 is technically above 0. This significance is unexpected because the p-value for x5 is 0.0511. It is not a rounding error. The p-values reported are from a t-distribution, not the normal. It is pretty commonly assumed that for large sample sizes, you can use z-values in place of t-values. That is, we used 1.96 to find the 95% confidence interval when technically we should use qt(1-alpha/2,n-p-2) = qt(0.975,93) = 1.985802.

This slightly larger number would give us a slightly wider confidence interval.

Our interval for x5 would then contain zero and so not be significant.

Rebecca Nugent, Department of Statistics, U. of Washington - 10 -