Example of Three Predictor Multiple Regression/Correlation Analysis: Checking Assumptions, Transforming Variables, and Detecting Suppression
The data are from Guber, D.L. (1999). Getting what you pay for: The debate over equity in public school expenditures. Journal of Statistics Education, 7, 1-8. The research units are the fifty states in the USA. We shall pretend they represent a random sample from a population of interest. The criterion variable is mean SAT in the state. The predictors are Expenditure ($ spent per student), Salary (mean salary of teachers), and Teacher/Pupil Ratio. If we consider the predictor variables to be fixed (the regression model), then we do not worry about the shape of the distributions of the predictor variables. If we consider the predictor variables to be random (the correlation model) we do. It turns out that each of the predictors has a distinct positive skewness which can be greatly reduced by a negative reciprocal transformation.
Here are the zero-order correlations for the untransformed variables:
Here is a regression analysis with the untransformed variables. I asked SPSS for a plot of the standardized residuals versus the standardized predicted scores. I also asked for a histogram of the residuals.
If you compare the beta weights with the zero-order correlations, it is obvious that we have some suppression taking place. The beta for expenditure is positive but the zero-order correlation between SAT and expenditure was negative. For the other two predictors the value of beta exceeds the value of their zero-order correlation with SAT.
Here is a histogram of the residuals with a normal curve superimposed:
The residuals appear to be approximately normally distributed. The plot of standardized residuals versus standardized predicted scores will allow us visually to check for heterogeneity of variance, nonlinear trends, and normality of the residuals across values of the predicted variable. I have drawn in the regression line (error = 0). I see no obvious problems here.
Under the homoscedasticity assumption there should be no correlation between the predicted scores and error variance. The vertical spread of the dots in the plot above should not vary as we move left to right. I squared the residuals and correlated them with the predicted values. If the residuals were increasing in variance as the predicted values increase this correlation would be positive. It is close to zero, confirming my eyeball conclusion that there is no problem with that fairly common sort of heteroscedasticity.
Now let us look at the results using the transformed data.
The correlation matrix looks much like it did with the untransformed data.
The R2 has increased a bit.
No major changes caused by the transformation, which is comforting. Trust me that the residuals plots still look OK too.
I wonder what high school teachers would think about the negative relationship between average state salary for teachers and average state SAT score? If we want better education should we lower teacher salaries? There is an important state characteristic that we should have but have not included in our model. Check out the JSE article to learn what that characteristic is.
Now, can we figure out what sort of suppression is going on here?
It looks like the expenditures variable is suppressing irrelevant variance in one or both or a linear combination of the other two predictors. Put another way, if we hold constant the effects of teacher salary and number of teachers per pupil, then the relationship between expenditures and SAT goes from negative to positive. Maybe the money is best spent on things other than hiring more teachers or better paid teachers?
Let us look at two-predictor models.
No suppression between expenditures and teacher salary.
A little bit of classical suppression here, but not dramatic.
A little bit of cooperative suppression here, but not dramatic.
Maybe the expenditures variable is suppressing irrelevant variance in a linear combination of teacher salary and teacher/pupil ratio. I predicted SAT from salary and teacher/pupil ratio and saved the predicted scores as “predicted23.” Those predicted scores are a linear combination of teacher salary and teacher/pupil ratio, with lower salaries and higher teacher/pupil ratios being associated with higher SAT scores. When I correlate predicted23 with SAT I get .477, the R for SAT predicted from salary and teacher/pupil ratio. Watch what happens when I add the expenditures variable to the predicted23 combination.
As you can see, the expenditures variable suppresses irrelevant variance in the predicted23 combination of the other two predictor variables. When you hold total amount of expenditures constant, there is an increase in the predictive value of a linear combination of teacher salary and teacher/pupil ratio.
Karl L. Wuensch
East Carolina University, Dept. of Psychology
March, 2011
Return to Wuensch’s Stats Lessons Page