Multiple Regression Practice Problems Solutions Stat 112

1. (a) Pennsylvania’s predicted SAT score from the multiple regression model is 932.414 + 4.299*50 + (-3.074)*27.98 = 899.000. Pennsylvania’s residual is 885-899.000= -14.

(b) To test if the multiple regression model provides better predictions of SAT than just using the sample mean of SAT to predict SAT, we use the overall F test for the usefulness of predictors shown in the analysis of variance table. The null hypothesis for this test is that all of the explanatory variables have coefficient zero and the alternative is that at least one of the explanatory variables has a coefficient that is not zero. The p-value (Prob>F in the Analysis of Variance table) is <.0001. Thus, there is strong evidence that the multiple regression model provides better predictions of SAT than just using the sample mean of SAT to predict SAT.

(c) An approximate 95% confidence interval for the coefficient on TAKERS is the estimated coefficient plus or minus two standard errors: .

(d) EXPEND helps to predict a state’s average SAT score once TAKERS have been taken into account if the coefficient on EXPEND in the multiple regression of SAT on TAKERS and EXPEND does not equal zero. The t-test in the EXPEND row of the parameter estimates tests the null hypothesis that the coefficient on EXPEND equals zero vs. the alternative hypothesis that it does not equal zero. The p-value for the t-test is .0001. Thus, there is strong evidence that EXPEND helps to predict SAT once TAKERS have been taken into account.

(e) We judge an observation to be influential if it’s Cook distance is greater than 1. If an observation is found to be influential, it is justified to delete it from the analysis and report that we omitted the observation (and that our conclusions only hold for a reduced range of the explanatory variables, not including the explanatory variables of the observation) if its leverage is greater than 2*p/n, where p is the number of explanatory variables. By these criteria, it is justified to delete Alaska from the analysis (Cook’s distance greater than 1, leverage greater than 2*p/n=4/49=.08) but it is not justified to delete South Carolina (Cook’s distance less than 1).

(f) I would choose to use Log(Takers) because it eliminates the curvature. The residual plot from the regression of SAT on Takers has a quadratic pattern in the mean of the residuals whereas the residual plot from the regression of SAT on Log(Takers) does not have a pattern in the mean of the residuals. Also the R2 for the regression of SAT on Log(Takers) is .811, compared to .736 for the regression of SAT on Takers.

2. (a) Consider the following regressions:

The omitted variable bias formula tells us that

The output from the regression of Cars on Speed tells us that . I would expect that will be greater than zero since it is the change in the mean number of accidents that is associated with an increase in the number of cars by one, holding fixed the average speed. Thus, I would expect that and hence .

(b) There is an interaction between Cars and Speed in the multiple regression model if the coefficient on Cars*Speed does not equal zero. The t-test of the null hypothesis that the coefficient on Cars*Speed equals zero (shown in the Parameter Estimates table in the row Cars*Speed) has p-value <.0001. Thus, there is strong evidence of an interaction between Cars*Speed.

The estimates from the multiple regression model are , thus

The estimated decrease in the mean number of accidents for a 5 MPH reduction in the speed limit is greater for the weekdays than the weekends because the average number of cars is greater for the weekdays than the weekends. The estimated decrease in the mean number of accidents for a decrease in the speed limit is greater for weekdays than weekends because of the positive interaction between cars and speed for increasing the number of accidents.

(a) The residual plot shows a clear quadratic pattern in the mean of the residuals, indicating a violation of linearity. The scatterplot suggests that quadratic regression might be appropriate. No transformation suggested by Tukey’s Bulging Rule can be used because the curvature covers two quadrants of the circle in Tukey’s Bulging Rule. The residual plot does not suggest any pattern in the spread of residuals, indicating that the constant variance assumption is reasonable. The histogram of the residuals looks roughly bell shaped, indicating that the assumption of normality is reasonable. The box plot of Cook’s distances shows that there is no Cook’s distance that is greater than 1, indicating that there are no strongly influential points.

(b) The quadratic regression provides better predictions of mileage based on speed than the simple linear regression if the coefficient on the term in the multiple regression is not equal to zero. The t-test of the null hypothesis that the coefficient on equals zero (found in the Parameter Estimates table) has p-value <.0001. Thus, there is strong evidence that the quadratic regression provides better predictions of mileage based on speed than the simple linear regression.

The predicted mileage for speeds of 20, 50 and 70 MPH are 22.22, 29.70 and 26.78 respectively. Thus, the model suggests that it is best to drive at 50 MPH. This result illustrates the fact that in a quadratic regression model, the mean of Y given X will be increasing for some range of X and decreasing for some range of X.