Regression to the Mean

Some examples to help understand “regression”

First, a story (complements of BVD’s Stats: Modeling the World textbook). Suppose a new AP Stats student was enrolled in your class. Without seeing this student, what would be a good guess for this student’s height in inches? The mean height (in inches) of your current students would be a good guess.

What if you knew that this student’s IQ was 1.5 standard deviations above average. How would that change your prediction? Probably not at all, since we would not think there would be any correlation between IQ and height (r ≈ 0).

On the other hand, if we knew that their height in centimeters was 1.5 standard deviations above average, then we would know that their height in inches would be 1.5 standard deviations above average. This is, of course, because heights in centimeters and heights in inches are perfectly correlated (r = 1).

Finally, if we knew that this new student’s foot length was 1.5 standard deviations above average, then we would guess that their height is above average, too. But we would probably not guess that it is 1.5 standard deviations above average. It would be less than 1.5 standard deviations above average because foot length and height are not perfectly correlated. The factor used to calculate this “slipping” or “regressing” is the correlation coefficient, r. If the correlation between foot length and height is .72, then we can estimate the new student’s height by taking .72 x 1.5 and guess that the new student is 1.08 standard deviations above average in height (according to our linear model on height vs. foot length).

This “slippage” of less than perfect correlation between linearly associated variables is where the term “regression” emerged.
Another Example: Fat vs. Saturated Fats at McDonald’s

There is a positive, relatively linear association between the amounts of fat and saturated fat in McDonald’s menu items.

Here is the same data, graphed using the z scores (zx and zy).

If there was perfect correlation between fat and saturated fat, then the slope of the least squares regression line would equal one (using z-scores, remember). In other words, for every increase of one standard deviation in fat, we would expect an increase of one standard deviation in saturated fat ().

Here is the graph with the line zsatfat = zfat graphed:

But there is a better model for the data. Here is the graph with the LSRL added:

Compared to the zsatfat = zfat line, the LSRL shows that for every additional standard deviation in the zfat direction, there is slightly less than one additional standard deviation in the zsatfat direction. This “slippage” is where the term “regression” originated.

“For many variables, natural processes work to "dampen" extreme outliers and bring them closer to their respective means.”

Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors

Jeffrey M. Stanton

Syracuse University

Journal of Statistics Education Volume 9, Number 3 (2001)

Now let’s see if we can understand a different view of this so-called “regression effect” or “regression toward the mean.” On the graph below, notice the points near the vertical line on zx = 1. The mean of the zyvalues of these points are centered around a zy value that is less than 1. This again is an example of “regression,” or “slippage” toward the mean (the mean is zero on this “z-axis” graph).

What is R-Squared?

R-squared is frequently misunderstood, and its interpretation is often “parroted” without a clear, personal understanding. The following explanation will try to make r2 clear.

Look again at the fat vs. saturated fat example. One of the common interpretations of r-squared is something like this: “80% of the variation in saturated fat in McDonald’s menu items can be explained by the linear model on fat.” To understand this statement, let’s start with the question: “How much variation is in saturated fats without looking at the linear model on fat?”

A logical question to this question might be: “Variation from what?” Well, when we calculate standard deviation and variance, we usually measure variation from the mean. So let’s measure the total amount of variation in saturated fats by finding the sum of the squares of the differences between each saturated fat and the mean of the saturated fats (called SST, or sum of squares total). Below is a graph showing all the squares constructed from each data point to the mean (represented by the horizontal line). The sum of the area of these squares (SST) is 4939.

But this horizontal line is clearly not the best linear model for the data. It looks like we could reduce the sum of the squares by “tilting” the line in an upward slope until we produced the least sum of the squares…making the least squares regression line!

(These squares are actually the squares of all the residuals.) Notice that the sum of the squares HAS been reduced…all the way down to 966.3 (this is called the SSE—the sum of the squares of the errors). How much of the variation was removed by the linear model?

SST – SSE = 4939 – 966.3 = 3972.7

What percent of the SST was this?

SSE ÷ SST = 3972.7 ÷ 4939 = .80435 (80%)

Notice this was the r2 reported below the graph. The earlier interpretation should now make sense. The saturated fats for McDonald’s menu items had some variation to begin with. But if we consider the variation in saturated fats with respect to a linear model on fat for the same menu items, the some of variation has been reduced/accounted for/explained by this linear model. The percent that was “reduced” is the value of r-squared.

-1-