Homework 4 Solutions, Statistics 112, Fall 2006
This homework is due Thursday, November 2nd at the beginning of class.
1. This problem is based on Dielman, Problem 5.8. The data set MPGWT5.JMP contains data on the number of miles per gallon obtained by a car in city driving (CITYMPG) and the weight of a car in pounds (WEIGHT) for 147 cars listed in the Road and Track October 2002 issue. We would like to model E(CITYMPG|WEIGHT).
(a) Fit a simple linear regression model of Y=CITYMPG on X=WEIGHT. Construct a residual plot. What is the most obvious problem you see with the residual plot compared to what you would expect to see if the ideal simple linear regression model holds?
Bivariate Fit of CITYMPG By WEIGHT
Linear Fit
CITYMPG = 45.271204 - 0.0074892 WEIGHT
Summary of Fit
RSquare / 0.422743RSquare Adj / 0.418762
Root Mean Square Error / 5.158216
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147
The linearity assumption of simple linear regression is violated. High weight cars have mostly positive residuals.
(b) Using Tukey's Bulging rule, try three appropriate transformations to try to achieve a better fit. Which transformation is best in terms of maximizing the (equivalently minimizing the root mean square error)? Does the transformation improve on the simple linear regression model?
By Tukey’s Bulging rule, we will try transformations: sqrt X, log X, and 1/X
Bivariate Fit of CITYMPG By sqrt(WEIGHT)
Linear Fit
CITYMPG = 71.445282 - 0.8895152 sqrt(WEIGHT)
Summary of Fit
RSquare / 0.447506RSquare Adj / 0.443696
Root Mean Square Error / 5.046364
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147
Analysis of Variance
Source / DF / Sum of Squares / Mean Square / F Ratio /Model / 1 / 2990.8616 / 2990.86 / 117.4462
Error / 145 / 3692.5398 / 25.47 / Prob > F
C. Total / 146 / 6683.4014 / <.0001
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / 71.445282 / 4.689627 / 15.23 / <.0001
sqrt(WEIGHT) / -0.889515 / 0.082079 / -10.84 / <.0001
Bivariate Fit of CITYMPG By log( WEIGHT)
Linear Fit
CITYMPG = 229.72854 - 25.870649 log( WEIGHT)
Summary of Fit
RSquare / 0.468821RSquare Adj / 0.465158
Root Mean Square Error / 4.948062
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147
Analysis of Variance
Source / DF / Sum of Squares / Mean Square / F Ratio /Model / 1 / 3133.3206 / 3133.32 / 127.9778
Error / 145 / 3550.0808 / 24.48 / Prob > F
C. Total / 146 / 6683.4014 / <.0001
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / 229.72854 / 18.47092 / 12.44 / <.0001
log( WEIGHT) / -25.87065 / 2.286862 / -11.31 / <.0001
Bivariate Fit of CITYMPG By 1/WEIGHT
Linear Fit
CITYMPG = -5.304672 + 82614.625 1/WEIGHT
Summary of Fit
RSquare / 0.500753RSquare Adj / 0.49731
Root Mean Square Error / 4.797029
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147
Analysis of Variance
Source / DF / Sum of Squares / Mean Square / F Ratio /Model / 1 / 3346.7350 / 3346.74 / 145.4376
Error / 145 / 3336.6663 / 23.01 / Prob > F
C. Total / 146 / 6683.4014 / <.0001
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / -5.304672 / 2.20236 / -2.41 / 0.0173
1/WEIGHT / 82614.625 / 6850.443 / 12.06 / <.0001
In terms of R square, the transformation for 1/Weight is the best. It improves on the simple linear model.
2. Problem 1 continued.
(a) Use polynomial regressions to model E(CITYMPG|WEIGHT). Use the procedure described in Notes 10 for choosing the best order of the polynomial.
Bivariate Fit of CITYMPG By WEIGHT
Polynomial Fit Degree=2
CITYMPG = 47.665858 - 0.0084975 WEIGHT + 0.0000026 (WEIGHT-3264.44)^2
Summary of Fit
RSquare / 0.484452RSquare Adj / 0.477292
Root Mean Square Error / 4.891612
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / 47.665858 / 2.357422 / 20.22 / <.0001
WEIGHT / -0.008497 / 0.000731 / -11.63 / <.0001
(WEIGHT-3264.44)^2 / 0.0000026 / 6.304e-7 / 4.15 / <.0001
Polynomial Fit Degree=3
CITYMPG = 41.369975 - 0.0067296 WEIGHT + 0.0000047 (WEIGHT-3264.44)^2 - 1.5329e-9 (WEIGHT-3264.44)^3
Summary of Fit
RSquare / 0.510103RSquare Adj / 0.499826
Root Mean Square Error / 4.785011
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / 41.369975 / 3.257562 / 12.70 / <.0001
WEIGHT / -0.00673 / 0.000964 / -6.98 / <.0001
(WEIGHT-3264.44)^2 / 0.0000047 / 9.906e-7 / 4.78 / <.0001
(WEIGHT-3264.44)^3 / -1.533e-9 / 5.6e-10 / -2.74 / 0.0070
Polynomial Fit Degree=4
CITYMPG = 38.130921 - 0.0056485 WEIGHT + 0.0000029 (WEIGHT-3264.44)^2 - 3.1065e-9 (WEIGHT-3264.44)^3 + 9.323e-13 (WEIGHT-3264.44)^4
Summary of Fit
RSquare / 0.516902RSquare Adj / 0.503294
Root Mean Square Error / 4.768395
Mean of Response / 20.82313
Observations (or Sum Wgts) / 147
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / 38.130921 / 3.973443 / 9.60 / <.0001
WEIGHT / -0.005648 / 0.001228 / -4.60 / <.0001
(WEIGHT-3264.44)^2 / 0.0000029 / 0.000002 / 1.81 / 0.0727
(WEIGHT-3264.44)^3 / -3.106e-9 / 1.245e-9 / -2.49 / 0.0138
(WEIGHT-3264.44)^4 / 9.323e-13 / 6.59e-13 / 1.41 / 0.1597
When the order is 3, the third order coefficient in the regression model is significantly different from zero, but when the order is 4, the p-value for the coefficient of the fourth order is 0.1597, thus the best order for the polynomial is 3.
(b) Compare the best transformation chosen in Problem 1 to the polynomial regression model chosen in Problem 2(a). Use RSquare Adj to make the comparison. Which model do you prefer?
Note: If the transformation model’s RSquare Adj is within 0.01 of the polynomial regression model’s RSquare Adj, it is reasonable to prefer the transformation model since it is simpler.
RSquare Adj for polynomial regression model is 0.477292, and RSquare Adj for 1/Weight model is 0.49731. Thus we prefer the 1/Weight model.
(c) Examine the residual plot of the model chosen in Problem 2(b). Based on looking at the residual plot, does the model provide an accurate description of E(CITYMPG|WEIGHT)?
Bivariate Fit of CITYMPG By 1/WEIGHT
Linear Fit
CITYMPG = -5.304672 + 82614.625 1/WEIGHT
There are still indications that the mean of the residuals is not zero for each range of 1/Weight, e.g., all residuals between .00042 and .0046 are less than zero, most residuals right around .0004 are greater than zero. This means that the model does not provide an entirely accurate description of E(CITYMPG|WEIGHT).
(d) Using the model chosen in Problem 2(b), predict the CITYMPG of a car of weight 3000 pounds and find a 95% prediction interval.
CITYMPG=-5.304672 + 82614.625/3000=22
The 95% prediction interval is (12.717, 31.750)
3. Problem 1 continued.
(a) Fit the transformation model
Assuming that in fact, , explain how to interpret the coefficient on log(WEIGHT).
CITYMPG = 229.72854 - 25.870649 log( WEIGHT)
A doubling of WEIGHT is associated with an = -17.85 increase in the mean of CITYMPG.
(b) Fit the transformation model
Assuming that in fact, , explain how to interpret the coefficient on log(WEIGHT).
log(CITYMPG) = 12.25682 - 1.1473781 log( WEIGHT)
A doubling of WEIGHT is associated with a multiplicative change of =0.4514 in the median of CITYMPG.
4. Dielman, Problem 5.6, page 201. The data set is in MOVIES5.JMP . Consider both transformations and polynomials in choosing the best model. Check that your chosen model is reasonable by making sure that the residual plot does not show any strong pattern in the mean of the residuals.
Bivariate Fit of TDOMGROSS By WEEKEND
Polynomial Fit Degree=2
TDOMGROSS = -0.957977 + 4.4917735 WEEKEND - 0.0214679 (WEEKEND-5.00878)^2
Summary of Fit
RSquare / 0.843231RSquare Adj / 0.842206
Root Mean Square Error / 14.607
Mean of Response / 20.05893
Observations (or Sum Wgts) / 309
Analysis of Variance
Source / DF / Sum of Squares / Mean Square / F Ratio /Model / 2 / 351178.94 / 175589 / 822.9560
Error / 306 / 65289.49 / 213 / Prob > F
C. Total / 308 / 416468.43 / <.0001
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / -0.957977 / 0.997615 / -0.96 / 0.3377
WEEKEND / 4.4917735 / 0.169612 / 26.48 / <.0001
(WEEKEND-5.00878)^2 / -0.021468 / 0.006584 / -3.26 / 0.0012
Polynomial Fit Degree=3
TDOMGROSS = -2.456371 + 3.7387443 WEEKEND + 0.1128957 (WEEKEND-5.00878)^2 - 0.0027876 (WEEKEND-5.00878)^3
Summary of Fit
RSquare / 0.868072RSquare Adj / 0.866774
Root Mean Square Error / 13.42178
Mean of Response / 20.05893
Observations (or Sum Wgts) / 309
Analysis of Variance
Source / DF / Sum of Squares / Mean Square / F Ratio /Model / 3 / 361524.46 / 120508 / 668.9539
Error / 305 / 54943.97 / 180 / Prob > F
C. Total / 308 / 416468.43 / <.0001
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / -2.456371 / 0.93775 / -2.62 / 0.0092
WEEKEND / 3.7387443 / 0.184833 / 20.23 / <.0001
(WEEKEND-5.00878)^2 / 0.1128957 / 0.018734 / 6.03 / <.0001
(WEEKEND-5.00878)^3 / -0.002788 / 0.000368 / -7.58 / <.0001
Polynomial Fit Degree=4
TDOMGROSS = -2.949138 + 3.7408212 WEEKEND + 0.1321539 (WEEKEND-5.00878)^2 - 0.0038927 (WEEKEND-5.00878)^3 + 0.0000146 (WEEKEND-5.00878)^4
Summary of Fit
RSquare / 0.868183RSquare Adj / 0.866449
Root Mean Square Error / 13.43815
Mean of Response / 20.05893
Observations (or Sum Wgts) / 309
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / -2.949138 / 1.35115 / -2.18 / 0.0298
WEEKEND / 3.7408212 / 0.185104 / 20.21 / <.0001
(WEEKEND-5.00878)^2 / 0.1321539 / 0.042353 / 3.12 / 0.0020
(WEEKEND-5.00878)^3 / -0.003893 / 0.00221 / -1.76 / 0.0792
(WEEKEND-5.00878)^4 / 0.0000146 / 0.000029 / 0.51 / 0.6124
For the polynomial regression, the best order is 3, and we have the residual plot as shown above. The linearity assumption appears to be reasonable. The constant variance assumption doesn’t seem to hold, as the variance of the residuals increases as weekend values increase.
Bivariate Fit of TDOMGROSS By sqrt(WEEKEND)
Linear Fit
TDOMGROSS = -9.211707 + 19.18211 sqrt(WEEKEND)
Summary of Fit
RSquare / 0.731734RSquare Adj / 0.73086
Root Mean Square Error / 19.07677
Mean of Response / 20.05893
Observations (or Sum Wgts) / 309
Parameter Estimates
Term / Estimate / Std Error / t Ratio / Prob>|t| /Intercept / -9.211707 / 1.483541 / -6.21 / <.0001
sqrt(WEEKEND) / 19.18211 / 0.662878 / 28.94 / <.0001
When we do the transformation sqrt(weekend), the R squared is substantially less than that of the polynomial regression and the residual plot has a very strong pattern in the mean of the residuals, indicating violations of linearity.
Hence we choose the polynomial regression.
Note: Transformations only work well if the curvature of the relationship between the Y variable and X variable falls into one quadrant for Tukey’s bulging rule. Here, the curvature is in the fourth quadrant for low to middle values of WEEKEND, but in the first quadrant for high values of WEEKEND. For situations in which the curvature of the relationship between Y and X changes as X increases, polynomial regression is best.