Does blocking really reduce variability?

Many modern introductory statistics courses include attention to study design. The Advanced Placement Statistics Exam discusses the role of blocking, which is said to “reduce variability”. However, neither the AP curriculum, nor most other introductory courses, covers the analysis of randomized block designs. This leaves the student a bit in the dark as to just how variability is reduced. This article attempts to give a partial explanation of this based on the two-sample t-test and simple linear regression; topics covered in the AP curriculum and most introductory college courses. We assume students have at least seen computer printouts for both of these techniques.

We will use data from an experiment to measure the effect of lighting conditions on the ability of humans to judge distance. Twenty-four people were randomly divided into two groups and assigned to one of the 'treatment' groups. All 24 people were asked to judge how far they were from a number of different objects. An average 'error' in judgment, in feet, was recorded for each person. One treatment group was shown the objects in bright sunshine, and the other under cloudy conditions.

------

1 ------I + I------

------

------

2 ------I + I------

------

--+------+------+------+------+------+----Errors

4.0 5.0 6.0 7.0 8.0 9.0

Parallel boxplots show that a reasonable model for this data would be a shift of about one and a half feet more errors for the second (cloudy) group. A two-sample t-test appears appropriate with no apparent skewness or outliers.

Two-sample T for Error_Sun vs Error_Clouds

N Mean StDev SE Mean

Error_Sun 12 5.717 0.896 0.26

Error_Clouds 12 7.36 1.14 0.33

Difference = mu (Error_Sun) - mu (Error_Clouds)

Estimate for difference: -1.64167

T-Test of difference = 0 (vs not =): T-Value = -3.91 P-Value = 0.001 DF = 22

Pooled StDev = 1.0275

As you can see, we used the older version of the test with a pooled variance estimate and 22 degrees of freedom. If you normally teach the newer version that does not pool then the only number above that will change is the degrees of freedom which will then be 20. However, that will not always be the case. We pooled here to make a better parallel with what we will do next. Note that the standard deviations of 1.1 and 0.9 are remarkably close. You can use this or the virtually identical results with this data to justify pooling.

There is a technique called Analysis of variance (ANOVA) that extends the pooled version of two-sample t to more than two groups. All we want to say at the moment is that it gives the same result when we have but two groups.

One-way ANOVA: Errors versus Group

Source DF SS MS F P

Group 1 16.17 16.17 15.32 0.001

Error 22 23.23 1.06

Total 23 39.40

S = 1.027 R-Sq = 41.05% R-Sq(adj) = 38.37%

The first thing to note is that the p-value is the same and so the decision is the same: there seems to be a real difference in the ability to judge distance under the two lighting conditions. Next, the mysterious F-value of 15.32 has a square root of 3.91, which is the t-value for the t-test. These two similarities will always exist when comparing pooled two-sample t with an ANOVA on the same two groups. ANOVA always pools.

The remainder of the table (or all we need of it) can be related to similar tables in regression printouts. The sum of the squared residuals from the overall mean of all 24 observations is 39.40. Of this, 16.17 (or 41.05%) is “accounted for” by the lighting conditions. (We won’t get into what “accounted for” means here other than to say that R2

means the same thing here as it does in regression.) Also as in regression the value of “S” is (like every standard deviation) a typical value for how far the data vary from some model or summary of the data. The first standard deviation we learned was a typical value for how far a batch of numbers varied from their mean. For all 24 observations, the standard deviation is 1.309. For regression, “S” is a typical value for the residuals (vertical distances from the regression line). For this ANOVA the “S = 1.027” is a typical value for how far the observations are from their respective group means.

At this point we have to confess that we have been concealing on aspect of this study. There was a blocking variable – age. The subjects were in fact divided into three age groups of equal size and then half of each age group was assigned to each lighting condition. We can do a two-way ANOVA that takes account of both lighting condition and age.

Two-way ANOVA: Errors versus Group, AgeGrp

Source DF SS MS F P

Group 1 16.17 16.1704 24.02 0.000

AgeGrp 2 9.76 4.8800 7.25 0.004

Error 20 13.47 0.6733

Total 23 39.40

S = 0.8205 R-Sq = 65.82% R-Sq(adj) = 60.69%

Again, we do not need to understand al of this, but note that R2 went up and S went down – both good signs. We usually measure variability in these situations with the sum of squared residuals. We can see that that is unchanged for the total and for the Group variable, but now AgeGrp accounts for 9.76 and the Error sum has been reduce by just this amount from the one-way ANOVA. This is the reduction in variability to which the magic phrase applies. The unaccounted-for sum of squares dropped from 23.13 to 13.47. The percentage unaccounted for dropped from 100-41.05=58.95% to 100-65.82=34.18%. Either way you look at it, it was cut nearly in half.

Let’s look at why this kind of decrease in variability is important. In a simple one-sample t-test, we look at a fraction whose denominator is a measure of variability: the (estimated) standard error of the mean. Other inference situations are similar, even if the formula is much more complicated, and even if we do not know what the formula is. What we can say about it though is that it is that estimate of variability that is reduced by blocking. This generally makes our test statistic larger (F went from 15.32 to 24.02) and p-values get smaller (0.001 to 0.000). This increases the power of a test and decreases the margin of error (or width of a confidence interval).

Finally, here are the raw data and the fits for the final model.


What model is that you ask? Not one that can be expressed in an equation as readily as a regression model. But it is just as simple. Here it is. Taking a young person in bright sunlight as the base (5.22 feet for average error) add on

  • 0.10 feet for being middle aged
  • 1.30 feet for being even older
  • 1.64 feet for cloudy conditions

You can verify this in the FITS column. That 1.64 is just the difference in the two means from the t-test we started with. But now if we did a confidence interval for that difference it would be narrower (details not shown here) and in addition we have learned something about the effect of age.

Technology Note:

The data are part of the DEPTH dataset that comes with the Student Edition of Minitab 14, which was used to create all the printouts.