STAT 2607

Assignment #4

DUE: Mon. March 23 in class Sec A approx. 100 marks

Tues. March 24 in class Sec B

1. Problem 1 from Assignment 3 Continued

The president of a small chain of corner stores would like to develop a first-order linear model to estimate total annual sales based on the number of sales agents X1 and the advertising budget ($1000) X2. A small sample of size 5 gave the following results.

N.B. Remember that in assignment 3 you found and concluded that annual sales was linearly related to at least one of the explanatory variables.

a) Calculate the correlation coefficient between the explanatory variables X1 and X2.

b) Based on your correlation coefficient in (a) would interpreting b1 as the estimated change in the average value of Y for a unit increase in X1 make sense? Why or why not?

c) What feature of the value for b2 might have led you to suspect that X1 and X2 were highly correlated?

2. It is desired to develop a regression model which will predict how well students will do in their first year at Carleton. The following set of variables were proposed as possibly being related to first year GPA (Y).

X1 = age X2 = high school average mark X3 = gender

X4 = average no. of hours/week spent in Roosters

X5 = average no. of hours/week spent in class

X6 = average no. of hours/week spent on assignments

X7 = average no. of hours/week spent on studying

X8 = average no. of hours/week spent in paid employment

X9 = no. of courses

X10 = lives at home or not

a) Which explanatory variables are dummy variables?

b) Which ones might be subject to measurement error?

c) Which variables might be correlated with each other?

d) Think of at least 2 other variables that might (or should) be considered.

3. A study was conducted to examine the relationship between university salary Y, the number of years of experience of the faculty member, X1 and the gender of the faculty member, X2.

For the model where

a) Find the separate equations relating E(Y|X) to X1 for males and for females.


b) Set up the null and alternative hypotheses for testing whether the lines for males and females are parallel.

4. An experiment was conducted to examine the corrosion resistance of 4 different brands of outdoor lacquers for brass lamps. Each type of lacquer was applied to 4 independent random sample of brass lamps and all the lamps were put outside. The number of weeks until the first sign of corrosion was recorded. The results are shown below.

Totals

A| 43 44 45 41 35 46 254

B| 28 29 31 24 25 27 164

C| 39 36 41 43 29 35 232

D| 25 36 28 27 31 34 181

a) Set up the ANOVA table and test whether there are any differences between the average time to corrosion among the 4 different lacquers. Use = .05.

b) If appropriate, use the Tukey multiple comparison test to identify which lacquers differ in average corrosion time. Use = .05. Make a line summary of your results.

5. A hospital administrator wished to study the relation between patient satisfaction (Y) and patient age (X1 in yrs.), severity of illness (X2 an index), and anxiety level (X3 an index). Larger values of Y, X2 and X3 are respectively associated with more satisfaction, increased severity of illness, and more anxiety. The administrator randomly selected 23 patients.

The data is in /CourseWare/Stat2607/patients.dat

columns 1, 2, 3, 4 contain variables Y, X1, X2, X3 respectively.

Remember the form of your INFILE statement is:

INFILE '/CourseWare/Stat2607/patients.dat';

Remember also to put the following 4 statements at the top of your program:

DM OUTPUT 'CLEAR';

OPTIONS PAGESIZE=40;

OPTIONS LINESIZE=80;

FOOTNOTE 'yourname, studno';

a) Write a SAS program to fit a first order linear model in the 3 explanatory variables (see SAS Manual Sec. 13.6, & Ex 13.6.1) and print out:

- the correlation matrix giving the pairwise correlation coefficients between all 4 variables (corr option at end of PROC REG statement - i.e.

PROC REG LINEPRINTER CORR;

- the variance inflation factors (use the VIF option at the end of the model statement. See p122 Ex 13.7.1)

- the XTX, XTY, and (XTX)-1 matrices ( / xpx i at the end of the MODEL statement - see p119, Sec.13.6 & Ex 13.6.1)

N.B. You do NOT need the CLM option for this question.

- residual plots of the ei vs , and the ei vs each of the explanatory variables. Note that you can do all this in the single statement plot r.*(p. x1 x2 x3); or whatever you called your explanatory variables. (See Sec.13.6 & Sec. 13.4)

- a histogram of the residuals (OUTPUT statement in PROC REG), then use PROC CHART as in Assignment 2.

b) Write down the estimated regression function for this 3 variable model.

c) Based on the residual plot of the ei vs and the histogram of the ei do there seem to be any really serious assumption violations? Explain. Do the residual plots of the ei vs the explanatory variables indicate that any of these variables might need to be modified to correct an extremely obvious assumption violation? Explain.

N.B. For the rest of the question, assume that your residual plots and the histogram of the residuals did not show violations that are serious enough to invalidate estimation and inference.

d) For testing whether there is a linear relationship between Y and the explanatory variables at = .01: write down the calculated value of the F-statistic along with its p-value. Based on the p-value, would you reject H0? Why or why not?

e) If appropriate, test whether X3 contributes significantly to the model after X1 and X2 have been included. Use = .01.

f) Would you reject ? How about ? A formal test is not necessary. Just give the calculated value of the test statistic and the reason for your answer in terms of its p-value.

g) Use the information from the output of the xpx option to find the fitted equation for the SLR of Y on X3 (anxiety).

h) Use the output from the i option to show that = 0.821 as given in the parameter estimates table.

6. Use your output and results from question 5 and the output provided on the following pages to help answer the questions below.

a) Compare the estimated regression coefficients of X1 and X3 for the model with those found using the full model. What is the factor by which they have changed? Now compare the estimated regression coefficients of X1 and X2 for the model with those found using the full model. What is the factor by which they have changed?

b) Find SSR(X3/X1,X2) and compare it with SSR(X3)

c) List all the indications of multicollinearity you can find between the 3 explanatory variables.

d) Would you conclude, that in the model containing all 3 X's, multicollinearity was a problem for interpretation of the regression coefficients?

e) Do X2 and X3 jointly, make a significant contribution to the prediction of patient satisfaction in a model that includes X1 ? Use = .05

f) Which model would you choose? Why?

The SAS System 37

The REG Procedure

Model: MODEL1

Dependent Variable: satisfied

Number of Observations Read 24

Number of Observations Used 23

Number of Observations with Missing Values 1

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 4063.98230 2031.99115 19.53 <.0001

Error 20 2081.23509 104.06175

Corrected Total 22 6145.21739

Root MSE 10.20107 R-Square 0.6613

Dependent Mean 61.34783 Adj R-Sq 0.6275

Coeff Var 16.62824

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 147.07512 16.73345 8.79 <.0001

age 1 -1.24336 0.29612 -4.20 0.0004

anxiety 1 -15.89064 8.25560 -1.92 0.0686

The SAS System 38

The REG Procedure

Model: MODEL1

Dependent Variable: satisfied

„ƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒ†

RESIDUAL ‚ ‚

20 ˆ ˆ

‚ ‚

‚ 1 ‚

‚ ‚

‚ 1 ‚

‚ 1 ‚

10 ˆ 1 ˆ

‚ 1 ‚

R ‚ 1 ‚

e ‚ 1 1 ‚

s ‚ 1 1 1 1 ‚

i ‚ 1 ‚

d 0 ˆ 1 1 ˆ

u ‚ ‚

a ‚ 1 ‚

l ‚ 1 ‚

‚ ‚

‚ 1 ‚

-10 ˆ 1 ˆ

‚ ‚

‚ 1 1 ‚

‚ ‚

‚ 1 ‚

‚ 1 ‚

-20 ˆ ˆ

ŠƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒŒ

35 40 45 50 55 60 65 70 75 80 85

Predicted Value of satisfied PRED


The SAS System 49

Frequency

7 ˆ *****

‚ *****

‚ *****

‚ *****

6 ˆ ***** *****

‚ ***** *****

‚ ***** *****

‚ ***** *****

5 ˆ ***** *****

‚ ***** *****

‚ ***** *****

‚ ***** *****

4 ˆ ***** ***** ***** *****

‚ ***** ***** ***** *****

‚ ***** ***** ***** *****

‚ ***** ***** ***** *****

3 ˆ ***** ***** ***** *****

‚ ***** ***** ***** *****

‚ ***** ***** ***** *****

‚ ***** ***** ***** *****

2 ˆ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

1 ˆ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

-16 -8 0 8 16

Residual

The SAS System 41

The REG Procedure

Model: MODEL2

Dependent Variable: satisfied

Number of Observations Read 24

Number of Observations Used 23

Number of Observations with Missing Values 1

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 4081.21949 2040.60975 19.77 <.0001

Error 20 2063.99790 103.19989

Corrected Total 22 6145.21739

Root MSE 10.15873 R-Square 0.6641

Dependent Mean 61.34783 Adj R-Sq 0.6305

Coeff Var 16.55924

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 166.59133 24.90844 6.69 <.0001

age 1 -1.26046 0.28919 -4.36 0.0003

illness 1 -1.08932 0.55139 -1.98 0.0622

The SAS System 42

The REG Procedure

Model: MODEL2

Dependent Variable: satisfied

„ˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆ†

RESIDUAL ‚ ‚

20 ˆ ˆ

‚ ‚

‚ 1 ‚

‚ 1 ‚

‚ ‚

‚ 1 1 ‚

10 ˆ ˆ

‚ 1 ‚

R ‚ 1 ‚

e ‚ 1 ‚

s ‚ 1 1 1 ‚

i ‚ 1 1 1 ‚

d 0 ˆ 1 ˆ

u ‚ 1 ‚

a ‚ 1 ‚

l ‚ ‚

‚ ‚

‚ 1 ‚

-10 ˆ 1 ˆ

‚ 1 1 ‚

‚ 1 1 ‚

‚ ‚

‚ 1 ‚

‚ ‚

-20 ˆ ˆ

ŠˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆƒƒƒƒƒˆŒ

30 35 40 45 50 55 60 65 70 75 80 85

Predicted Value of satisfied PRED

The SAS System 48

Frequency

8 ˆ *****

‚ *****

7 ˆ *****

‚ *****

6 ˆ *****

‚ *****

5 ˆ ***** *****

‚ ***** *****

4 ˆ ***** ***** *****

‚ ***** ***** *****

3 ˆ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

2 ˆ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

1 ˆ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

-16 -8 0 8 16

Residual

The SAS System 45

The REG Procedure

Model: MODEL3

Dependent Variable: satisfied

Number of Observations Read 24

Number of Observations Used 23

Number of Observations with Missing Values 1


Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 3678.43585 3678.43585 31.31 <.0001

Error 21 2466.78154 117.46579

Corrected Total 22 6145.21739

Root MSE 10.83816 R-Square 0.5986

Dependent Mean 61.34783 Adj R-Sq 0.5795

Coeff Var 17.66674

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 121.83182 11.04221 11.03 <.0001

age 1 -1.52704 0.27288 -5.60 <.0001

The SAS System 46

The REG Procedure

Model: MODEL3

Dependent Variable: satisfied

„ƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒ†

RESIDUAL ‚ ‚

20 ˆ ˆ

‚ ‚

‚ 1 ‚

‚ ‚

‚ ‚

‚ 1 1 ‚

10 ˆ 1 1 1 1 ˆ

‚ 1 ‚

R ‚ ‚

e ‚ 1 ‚

s ‚ 1 1 ‚

i ‚ 1 ‚

d 0 ˆ 2 ˆ

u ‚ 1 ‚

a ‚ 1 ‚

l ‚ ‚

‚ ‚

‚ ‚

-10 ˆ 1 ˆ

‚ 1 1 1 ‚

‚ ‚

‚ ‚

‚ 1 1 ‚

‚ 1 ‚

-20 ˆ ˆ

ŠƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒƒƒƒƒˆƒƒŒ

35 40 45 50 55 60 65 70 75 80

Predicted Value of satisfied PRED

The SAS System 47

Frequency

8 ˆ *****

‚ *****

7 ˆ ***** *****

‚ ***** *****

6 ˆ ***** *****

‚ ***** *****

5 ˆ ***** *****

‚ ***** *****

4 ˆ ***** ***** *****

‚ ***** ***** *****

3 ˆ ***** ***** ***** *****

‚ ***** ***** ***** *****

2 ˆ ***** ***** ***** *****

‚ ***** ***** ***** *****

1 ˆ ***** ***** ***** ***** *****

‚ ***** ***** ***** ***** *****

Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

-16 -8 0 8 16

Residual

7. (Question 6 continued) Using SAS, fit your chosen model again, this time using the CLM and CLI options at the end of the MODEL statement (see Sec. 13.5, p120 ) to have confidence intervals printed out for the mean values and the individual values of patient satisfaction.

For your chosen model

a) Find a 95% C.I. estimate for β1.

b) Estimate with 95% confidence the average patient satisfaction for 35 year old patients with a severity of illness level of 60, and an anxiety index of 2.0.

c) Predict with 95% confidence the patient satisfaction of Ms. Brown, if she is 48 years old and has X2 = 60 and X3 = 2.7.

N.B. Observation 24 with a missing value for Y is the one with X1 = 48, X2 = 60, X3 = 2.7.