Homework 4 Solutions, Statistics 112, Fall 2006

This homework is due Thursday, November 2nd at the beginning of class.

1. This problem is based on Dielman, Problem 5.8. The data set MPGWT5.JMP contains data on the number of miles per gallon obtained by a car in city driving (CITYMPG) and the weight of a car in pounds (WEIGHT) for 147 cars listed in the Road and Track October 2002 issue. We would like to model E(CITYMPG|WEIGHT).

(a) Fit a simple linear regression model of Y=CITYMPG on X=WEIGHT. Construct a residual plot. What is the most obvious problem you see with the residual plot compared to what you would expect to see if the ideal simple linear regression model holds?

Bivariate Fit of CITYMPG By WEIGHT

Linear Fit

CITYMPG = 45.271204 - 0.0074892 WEIGHT

Summary of Fit

RSquare / 0.422743
RSquare Adj / 0.418762
Root Mean Square Error / 5.158216
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147

The linearity assumption of simple linear regression is violated. High weight cars have mostly positive residuals.

(b) Using Tukey's Bulging rule, try three appropriate transformations to try to achieve a better fit. Which transformation is best in terms of maximizing the (equivalently minimizing the root mean square error)? Does the transformation improve on the simple linear regression model?

By Tukey’s Bulging rule, we will try transformations: sqrt X, log X, and 1/X

Bivariate Fit of CITYMPG By sqrt(WEIGHT)

Linear Fit

CITYMPG = 71.445282 - 0.8895152 sqrt(WEIGHT)

Summary of Fit

RSquare / 0.447506
RSquare Adj / 0.443696
Root Mean Square Error / 5.046364
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio /
Model / 1 / 2990.8616 / 2990.86 / 117.4462
Error / 145 / 3692.5398 / 25.47 / Prob > F
C. Total / 146 / 6683.4014 / <.0001

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / 71.445282 / 4.689627 / 15.23 / <.0001
sqrt(WEIGHT) / -0.889515 / 0.082079 / -10.84 / <.0001

Bivariate Fit of CITYMPG By log( WEIGHT)

Linear Fit

CITYMPG = 229.72854 - 25.870649 log( WEIGHT)

Summary of Fit

RSquare / 0.468821
RSquare Adj / 0.465158
Root Mean Square Error / 4.948062
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio /
Model / 1 / 3133.3206 / 3133.32 / 127.9778
Error / 145 / 3550.0808 / 24.48 / Prob > F
C. Total / 146 / 6683.4014 / <.0001

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / 229.72854 / 18.47092 / 12.44 / <.0001
log( WEIGHT) / -25.87065 / 2.286862 / -11.31 / <.0001

Bivariate Fit of CITYMPG By 1/WEIGHT

Linear Fit

CITYMPG = -5.304672 + 82614.625 1/WEIGHT

Summary of Fit

RSquare / 0.500753
RSquare Adj / 0.49731
Root Mean Square Error / 4.797029
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio /
Model / 1 / 3346.7350 / 3346.74 / 145.4376
Error / 145 / 3336.6663 / 23.01 / Prob > F
C. Total / 146 / 6683.4014 / <.0001

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / -5.304672 / 2.20236 / -2.41 / 0.0173
1/WEIGHT / 82614.625 / 6850.443 / 12.06 / <.0001

In terms of R square, the transformation for 1/Weight is the best. It improves on the simple linear model.

2. Problem 1 continued.

(a) Use polynomial regressions to model E(CITYMPG|WEIGHT). Use the procedure described in Notes 10 for choosing the best order of the polynomial.

Bivariate Fit of CITYMPG By WEIGHT

Polynomial Fit Degree=2

CITYMPG = 47.665858 - 0.0084975 WEIGHT + 0.0000026 (WEIGHT-3264.44)^2

Summary of Fit

RSquare / 0.484452
RSquare Adj / 0.477292
Root Mean Square Error / 4.891612
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / 47.665858 / 2.357422 / 20.22 / <.0001
WEIGHT / -0.008497 / 0.000731 / -11.63 / <.0001
(WEIGHT-3264.44)^2 / 0.0000026 / 6.304e-7 / 4.15 / <.0001

Polynomial Fit Degree=3

CITYMPG = 41.369975 - 0.0067296 WEIGHT + 0.0000047 (WEIGHT-3264.44)^2 - 1.5329e-9 (WEIGHT-3264.44)^3

Summary of Fit

RSquare / 0.510103
RSquare Adj / 0.499826
Root Mean Square Error / 4.785011
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / 41.369975 / 3.257562 / 12.70 / <.0001
WEIGHT / -0.00673 / 0.000964 / -6.98 / <.0001
(WEIGHT-3264.44)^2 / 0.0000047 / 9.906e-7 / 4.78 / <.0001
(WEIGHT-3264.44)^3 / -1.533e-9 / 5.6e-10 / -2.74 / 0.0070

Polynomial Fit Degree=4

CITYMPG = 38.130921 - 0.0056485 WEIGHT + 0.0000029 (WEIGHT-3264.44)^2 - 3.1065e-9 (WEIGHT-3264.44)^3 + 9.323e-13 (WEIGHT-3264.44)^4

Summary of Fit

RSquare / 0.516902
RSquare Adj / 0.503294
Root Mean Square Error / 4.768395
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / 38.130921 / 3.973443 / 9.60 / <.0001
WEIGHT / -0.005648 / 0.001228 / -4.60 / <.0001
(WEIGHT-3264.44)^2 / 0.0000029 / 0.000002 / 1.81 / 0.0727
(WEIGHT-3264.44)^3 / -3.106e-9 / 1.245e-9 / -2.49 / 0.0138
(WEIGHT-3264.44)^4 / 9.323e-13 / 6.59e-13 / 1.41 / 0.1597

When the order is 3, the third order coefficient in the regression model is significantly different from zero, but when the order is 4, the p-value for the coefficient of the fourth order is 0.1597, thus the best order for the polynomial is 3.

(b) Compare the best transformation chosen in Problem 1 to the polynomial regression model chosen in Problem 2(a). Use RSquare Adj to make the comparison. Which model do you prefer?

Note: If the transformation model’s RSquare Adj is within 0.01 of the polynomial regression model’s RSquare Adj, it is reasonable to prefer the transformation model since it is simpler.

RSquare Adj for polynomial regression model is 0.477292, and RSquare Adj for 1/Weight model is 0.49731. Thus we prefer the 1/Weight model.

(c) Examine the residual plot of the model chosen in Problem 2(b). Based on looking at the residual plot, does the model provide an accurate description of E(CITYMPG|WEIGHT)?

Bivariate Fit of CITYMPG By 1/WEIGHT

Linear Fit

CITYMPG = -5.304672 + 82614.625 1/WEIGHT

There are still indications that the mean of the residuals is not zero for each range of 1/Weight, e.g., all residuals between .00042 and .0046 are less than zero, most residuals right around .0004 are greater than zero. This means that the model does not provide an entirely accurate description of E(CITYMPG|WEIGHT).

(d) Using the model chosen in Problem 2(b), predict the CITYMPG of a car of weight 3000 pounds and find a 95% prediction interval.

CITYMPG=-5.304672 + 82614.625/3000=22

The 95% prediction interval is (12.717, 31.750)

3. Problem 1 continued.

(a) Fit the transformation model

Assuming that in fact, , explain how to interpret the coefficient on log(WEIGHT).

CITYMPG = 229.72854 - 25.870649 log( WEIGHT)

A doubling of WEIGHT is associated with an = -17.85 increase in the mean of CITYMPG.

(b) Fit the transformation model

Assuming that in fact, , explain how to interpret the coefficient on log(WEIGHT).

log(CITYMPG) = 12.25682 - 1.1473781 log( WEIGHT)

A doubling of WEIGHT is associated with a multiplicative change of =0.4514 in the median of CITYMPG.

4. Dielman, Problem 5.6, page 201. The data set is in MOVIES5.JMP . Consider both transformations and polynomials in choosing the best model. Check that your chosen model is reasonable by making sure that the residual plot does not show any strong pattern in the mean of the residuals.

Bivariate Fit of TDOMGROSS By WEEKEND

Polynomial Fit Degree=2

TDOMGROSS = -0.957977 + 4.4917735 WEEKEND - 0.0214679 (WEEKEND-5.00878)^2

Summary of Fit

RSquare / 0.843231
RSquare Adj / 0.842206
Root Mean Square Error / 14.607
Mean of Response / 20.05893
Observations (or Sum Wgts) / 309

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio /
Model / 2 / 351178.94 / 175589 / 822.9560
Error / 306 / 65289.49 / 213 / Prob > F
C. Total / 308 / 416468.43 / <.0001

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / -0.957977 / 0.997615 / -0.96 / 0.3377
WEEKEND / 4.4917735 / 0.169612 / 26.48 / <.0001
(WEEKEND-5.00878)^2 / -0.021468 / 0.006584 / -3.26 / 0.0012

Polynomial Fit Degree=3

TDOMGROSS = -2.456371 + 3.7387443 WEEKEND + 0.1128957 (WEEKEND-5.00878)^2 - 0.0027876 (WEEKEND-5.00878)^3

Summary of Fit

RSquare / 0.868072
RSquare Adj / 0.866774
Root Mean Square Error / 13.42178
Mean of Response / 20.05893
Observations (or Sum Wgts) / 309

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio /
Model / 3 / 361524.46 / 120508 / 668.9539
Error / 305 / 54943.97 / 180 / Prob > F
C. Total / 308 / 416468.43 / <.0001

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / -2.456371 / 0.93775 / -2.62 / 0.0092
WEEKEND / 3.7387443 / 0.184833 / 20.23 / <.0001
(WEEKEND-5.00878)^2 / 0.1128957 / 0.018734 / 6.03 / <.0001
(WEEKEND-5.00878)^3 / -0.002788 / 0.000368 / -7.58 / <.0001

Polynomial Fit Degree=4

TDOMGROSS = -2.949138 + 3.7408212 WEEKEND + 0.1321539 (WEEKEND-5.00878)^2 - 0.0038927 (WEEKEND-5.00878)^3 + 0.0000146 (WEEKEND-5.00878)^4

Summary of Fit

RSquare / 0.868183
RSquare Adj / 0.866449
Root Mean Square Error / 13.43815
Mean of Response / 20.05893
Observations (or Sum Wgts) / 309

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / -2.949138 / 1.35115 / -2.18 / 0.0298
WEEKEND / 3.7408212 / 0.185104 / 20.21 / <.0001
(WEEKEND-5.00878)^2 / 0.1321539 / 0.042353 / 3.12 / 0.0020
(WEEKEND-5.00878)^3 / -0.003893 / 0.00221 / -1.76 / 0.0792
(WEEKEND-5.00878)^4 / 0.0000146 / 0.000029 / 0.51 / 0.6124

For the polynomial regression, the best order is 3, and we have the residual plot as shown above. The linearity assumption appears to be reasonable. The constant variance assumption doesn’t seem to hold, as the variance of the residuals increases as weekend values increase.

Bivariate Fit of TDOMGROSS By sqrt(WEEKEND)

Linear Fit

TDOMGROSS = -9.211707 + 19.18211 sqrt(WEEKEND)

Summary of Fit

RSquare / 0.731734
RSquare Adj / 0.73086
Root Mean Square Error / 19.07677
Mean of Response / 20.05893
Observations (or Sum Wgts) / 309

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t| /
Intercept / -9.211707 / 1.483541 / -6.21 / <.0001
sqrt(WEEKEND) / 19.18211 / 0.662878 / 28.94 / <.0001

When we do the transformation sqrt(weekend), the R squared is substantially less than that of the polynomial regression and the residual plot has a very strong pattern in the mean of the residuals, indicating violations of linearity.

Hence we choose the polynomial regression.

Note: Transformations only work well if the curvature of the relationship between the Y variable and X variable falls into one quadrant for Tukey’s bulging rule. Here, the curvature is in the fourth quadrant for low to middle values of WEEKEND, but in the first quadrant for high values of WEEKEND. For situations in which the curvature of the relationship between Y and X changes as X increases, polynomial regression is best.