Polynomial Models Example: 20.36 and 20.37

The number of car accidents on a particular stretch of highway seems to be related to the number of vehicles that travel over it and the speed at which they are traveling. A city alderman has decided to ask the count sheriff to provide him with statistics covering the last few years, with the intention of examining these data statistically so that he can (if possible) introduced new speed laws that will reduce traffic accidents. Using the number of accidents as the dependent variable, he obtains estimates of the number of cars passing along a stretch of road and their average speeds (in miles per hour). The observations for 60 randomly selected days are stored in file Xr20-36.

We first consider a first-order model.

Summary of Fit

RSquare / 0.055485
RSquare Adj / 0.022344
Root Mean Square Error / 2.408018
Mean of Response / 7.033333
Observations (or Sum Wgts) / 60

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio
Model / 2 / 19.41598 / 9.70799 / 1.6742
Error / 57 / 330.51735 / 5.79855 / Prob > F
C. Total / 59 / 349.93333 / 0.1965

Lack Of Fit

Source / DF / Sum of Squares / Mean Square / F Ratio
Lack Of Fit / 56 / 330.01735 / 5.89317 / 11.7863
Pure Error / 1 / 0.50000 / 0.50000 / Prob > F
Total Error / 57 / 330.51735 / 0.2281
Max RSq
0.9986

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t|
Intercept / -12.87191 / 13.79006 / -0.93 / 0.3545
Cars / 0.3732604 / 0.258746 / 1.44 / 0.1546
Speed / 0.2699408 / 0.223214 / 1.21 / 0.2315

Effect Tests

Source / Nparm / DF / Sum of Squares / F Ratio / Prob > F
Cars / 1 / 1 / 12.066931 / 2.0810 / 0.1546
Speed / 1 / 1 / 8.480329 / 1.4625 / 0.2315

Residual by Predicted Plot

The model does not appear to be valid. The p-value for the F-test that car and speed are useful as a group for predicting accidents 0.1965. However, even if the model does not appear to be valid, we should still perform diagnostics – perhaps the reason that the model does not appear to be valid is that one of the assumptions of multiple linear regression is violated.

First, let’s examine whether there are any influential points or outliers. The following is a plot of the Cook’s Distances (obtained by clicking the red triangle next to Response Accidents, clicking Save Columns and Cook’s D influence, then clicking on Graph, Overlay and putting Cook’s D Influence Accidents into Y and leaving X blank).

Overlay Plot

Point 45 has a much higher Cook’s Distance than any other point. It’s car and speed values are not unusual but it’s response (accidents) is very unusual – it equals –1! This seems like it could be a recording mistake. We should see what happens if we exclude point 45.

Summary of Fit

RSquare / 0.096935
RSquare Adj / 0.064683
Root Mean Square Error / 2.141203
Mean of Response / 7.169492
Observations (or Sum Wgts) / 59

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio
Model / 2 / 27.55913 / 13.7796 / 3.0055
Error / 56 / 256.74595 / 4.5847 / Prob > F
C. Total / 58 / 284.30508 / 0.0576

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t|
Intercept / 0.9768679 / 12.73884 / 0.08 / 0.9391
Cars / 0.5775096 / 0.235643 / 2.45 / 0.0174
Speed / 0.0079135 / 0.208954 / 0.04 / 0.9699

Effect Tests

Source / Nparm / DF / Sum of Squares / F Ratio / Prob > F
Cars / 1 / 1 / 27.537513 / 6.0063 / 0.0174
Speed / 1 / 1 / 0.006576 / 0.0014 / 0.9699

Residual by Predicted Plot

Now the model appears much closer to be valid with a p-value of .0576 for the F-test that car and speed are useful for predicting accidents. There appears to be a slight indication of a pattern in the residual by predicted plot, with low residuals for a high number of accidents predicted. To try to isolate any problems and improve the model, we make residual plots of the residuals versus cars, speed and cars*speed (this last residual plot will help us spot if there is a need for an interaction term). We do this by first saving the residuals (click on the red triangle next to Response Accidents, click on Save Columns and click Residuals) and then clicking on Graph, Overlay and putting Residuals Accidents into Y and the appropriate variable into X.

Overlay Plot

Overlay Plot

Overlay Plot

In each of these plots, there is a slight indication of a pattern in the residuals, especially in the cars plot. We should try a second order model.

Summary of Fit

RSquare / 0.700252
RSquare Adj / 0.671974
Root Mean Square Error / 1.268039
Mean of Response / 7.169492
Observations (or Sum Wgts) / 59

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio
Model / 5 / 199.08517 / 39.8170 / 24.7630
Error / 53 / 85.21992 / 1.6079 / Prob > F
C. Total / 58 / 284.30508 / <.0001

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t|
Intercept / 557.09212 / 357.6697 / 1.56 / 0.1253
Cars / -64.49967 / 6.82079 / -9.46 / <.0001
Speed / -7.750985 / 11.72064 / -0.66 / 0.5113
Cars*Speed / 1.0453489 / 0.102389 / 10.21 / <.0001
Cars Squared / 0.1136098 / 0.096932 / 1.17 / 0.2464
Speed Squared / -0.021813 / 0.096626 / -0.23 / 0.8223

Residual by Predicted Plot

There is strong evidence that the second order model is valid. Note the improvement in(root mean square error) from 2.14 to 1.27, the improvement in from 0.10 to 0.70 and the improvement in adjusted from 0.06 to 0.67. The second order model provides much better predictions than the first order model.

Miscellaneous:

1. Although multicollinearity did not appear to be a problem in this example, the way we would investigate multicollinearity is to look at the correlation matrix of the variables. We do this by clicking Analyze, Multivariate and putting Accidents, Cars and Speed into the Y columns. The result (with row 45 excluded) is

Multivariate

Correlations

Accidents / Cars / Speed
Accidents / 1.0000 / 0.3113 / 0.0087
Cars / 0.3113 / 1.0000 / 0.0126
Speed / 0.0087 / 0.0126 / 1.0000

1 rows not used due to missing values.