Psyc 771/772 / Post Hoc Comparisons
Turkheimer / Page 7 of 7

Post-Hoc Comparisons

When a factor has more than two levels it takes more than one contrast code to code for it, which is why factors with k factors have k-1 degrees of freedom. The PREs for the individual contrasts, as opposed to the k-1 df contrasts for the whole factor, are called planned comparisons, and provide specific tests of the differences between the relevant groups or means of groups. The multi-df effect of the whole factor is called the omnibus effect. For a factor with k levels we can make k-1 independent comparisons among groups, because that's all the independent information there is to work with.

Sometimes, however, we might go into an analysis not knowing what groups we want to compare, and not wanting to limit ourselves to k-1 comparisons. We just want to look at pairs of groups and have some way of knowing whether the differences among the pairs are "significant." Or not necessarily pairs, we might also want to compare one group to the mean of two others, or any of a very large number of possible comparisons.

So here is an example: Five treatments for fever blisters, including a placebo, were randomly assigned to 30 patients. The data and SAS program on the master page describe the number of days to healing in the five groups.

Here is the output from the main part of PROC GLM:

General Linear Models Procedure

Dependent Variable: DAYS

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 4 36.466667 9.116667 3.90 0.0136

Error 25 58.500000 2.340000

Corrected Total 29 94.966667

R-Square C.V. Root MSE DAYS Mean

0.383994 27.15454 1.5297 5.6333

Source DF Type I SS Mean Square F Value Pr > F

GROUP 4 36.466667 9.116667 3.90 0.0136

Source DF Type III SS Mean Square F Value Pr > F

GROUP 4 36.466667 9.116667 3.90 0.0136

So sr2 equals .38. But we learned last week that we are not done as long as there are 4 df in the numerator. If we had specific hypotheses about which of the pairwise mean differences we were interested in we could use a set of planned comparisons using a CONTRAST statement in SAS. But sometimes we don't- what we want to do is to look at the five means in a more exploratory way, and see where the large differences are.

Here are the means:

------GROUP=1 ------

N Mean Std Dev Minimum Maximum

------

6 7.5000000 1.6431677 5.0000000 10.0000000

------

------GROUP=2 ------

N Mean Std Dev Minimum Maximum

------

6 5.0000000 1.2649111 3.0000000 6.0000000

------

------GROUP=3 ------

N Mean Std Dev Minimum Maximum

------

6 4.3333333 1.0327956 3.0000000 6.0000000

------

------GROUP=4 ------

N Mean Std Dev Minimum Maximum

------

6 5.1666667 1.4719601 3.0000000 7.0000000

------

------GROUP=5 ------

N Mean Std Dev Minimum Maximum

------

6 6.1666667 2.0412415 3.0000000 9.0000000

------

We could compute d or t between any pair of these means we chose, and in fact for descriptive purposes this would be a very useful thing to do. But from the point of view of significance testing, that is problematic because for five means there are 10 pairwise comparisons, and many many more comparisons based on more than two groups (eg, group 1 v. 2 and 3 combined). So if we were going to test the significance of all the comparisons we would do a whole lot of tests, which would inflate our experiment-wise Type I error rate substantially.

The most general approach to the problem of multiple hypothesis tests is called the Bonferroni Correction. This is based on the fact that if you do k tests with individual error rates of a/k, the overall error rate can't be any worse than k. This means that you're always OK if you divide the total error rate you want to maintain (ie, .05) by the total number of tests you are doing. In this case you are doing 10 pairwise tests, so you would be safe if you used an individual a of .005.

The problem with this is that it is very conservative, often to the point of being ridiculous. Check the power of testing pairwise hypotheses with an n of six at =.005. In addition, the Bonferroni correction is often too strict, in that you can derive less severe corrections that still do the job. There are a great many of these that all do pretty much the same thing. We will learn about two.

Most methods of this type work by estimating a minimum difference between group means that is significant at some level. You can then simply compare a difference to the minimum difference to see if it makes the grade.

The next most general (and by the same token, next least powerful) method is called the Scheffé method. The Scheffé method can be applied to either pairwise means, or groups of means, and it doesn't matter if the group sizes are equal. If the group sizes are all equal and you only want to compare pairs of means, the somewhat more powerful Tukey method can be used.

SAS computes these post-hoc tests (and a wide variety of others) in the MEANS statement included in PROC GLM.

proc glm;

class group;

model days=group;

means group/tukey scheffe lines;

We have seen the first part of the statement before, it just generates the means of the five groups. Following the slash, we ask for tukey and scheffe posthoc comparisons; the LINES options requests for a certain format in the output that I find useful.

General Linear Models Procedure

Tukey's Studentized Range (HSD) Test for variable: DAYS

NOTE: This test controls the type I experimentwise error rate,

but generally has a higher type II error rate than REGWQ.

Alpha= 0.05 df= 25 MSE= 2.34

Critical Value of Studentized Range= 4.153

Minimum Significant Difference= 2.5938

Means with the same letter are not significantly different.

Tukey Grouping Mean N GROUP

A 7.5000 6 1

A

B A 6.1667 6 5

B A

B A 5.1667 6 4

B A

B A 5.0000 6 2

B

B 4.3333 6 3

General Linear Models Procedure

Scheffe's test for variable: DAYS

NOTE: This test controls the type I experimentwise error rate but

generally has a higher type II error rate than REGWF for

all pairwise comparisons

Alpha= 0.05 df= 25 MSE= 2.34

Critical Value of F= 2.75871

Minimum Significant Difference= 2.9338

Means with the same letter are not significantly different.

Scheffe Grouping Mean N GROUP

A 7.5000 6 1

A

B A 6.1667 6 5

B A

B A 5.1667 6 4

B A

B A 5.0000 6 2

B

B 4.3333 6 3

You see that both the Tukey and Scheffe methods compute a minimum difference between means that is "significant". The Tukey difference is a little smaller because the Tukey method is more powerful when its requirements are met. The A's and B's are the result of the LINES option. A set of letters groups together means that are not different from each other. So a pair of means is significantly different if they do not share any letters. In this case that is only group 1 and group 3.

I don’t much like post-hoc tests because it is too significance testing oriented. It promotes what I consider to be the worst way to think about your data, dichotomizing the comparisons into pairs that are “different” and “not different.” But that is silly—the differences between the groups are whatever they are, big or small, and it is best, IMHO, to just describe them as such.

To help us think about this it will be useful to develop a new measure of effect size that is useful for categorical variable ANOVAs. Basically it is an extension of d, which was the standardized difference between two groups, ie,

Of course, a design like this is a simple one-way ANOVA with two levels for the single factor. Another way to express it would be as the average differences between the group and the grand mean, but you would have to take into account that one group is on one side of the grand mean and the other is on the other side, so the signs of the two differences would cancel each other out. So you could take the average squared difference between the group means and the grand mean, and take the square root when you are done, ie,

Note that this d would be one half the value of the d we started with, because we are measuring the distance between the group means and the grand mean rather than the difference between the means themselves.

This way of expressing d generalizes nicely to the multigroup case, where it seems natural to compute the average deviation of the group means from the grand mean no matter how many of them there are. What’s more, all the information we need to figure this out is included in a standard source table. First of all, although we haven’t really focused on interpreting it this way, one way to express SSG is as kind of a sum of squares of the group means around the grand mean,

So SSG/nj is the total of the squared difference of the groups around the grand mean, and SSG/knj is the average of the squared difference across groups. K times nj is the number of groups times the number of subjects per group, or N. Now we need to standardize it by sp, and it turns out that sp is equal to MSE. Then we take the square root, and get,

One good way to express this formula is as,

If you go back to the source table and work this through, you will see that the square root of the average squared difference between the group means and the grand means is 0.72.

Let’s get this number from the actual values to see what it means. Here is a table of the group means from the SAS output above:

Group / / / / /
1 / 7.5 / 1.867 / 3.49 / 1.49 / 1.22
5 / 6.17 / 0.537 / 0.29 / 0.12 / 0.35
4 / 5.17 / -.46 / 0.21 / 0.09 / 0.30
2 / 5 / -.633 / 0.40 / 0.17 / 0.41
3 / 4.33 / -1.30 / 1.70 / 0.73 / 0.85

The mean of the shaded column is .52, and the sqrt of .52 is .72, as above. So what do we conclude? In general, the ANOVA showed that the group means differ from the grand mean by .72 SDs. An examination of the rightmost column allows you to compare the individual group deviations to this average. Group 1 had a considerably larger deviation, groups 2 4 and 5 had smaller deviations, and group 3 was pretty close to average.

I can’t resist showing you something else I figured out. What if you are interested in solving the problem as post hoc comparisons do, in terms of differences between pairs of groups? It turns out that there is a relationship between the total squared deviations from the mean and the total of the k(k-1)/2 squared pairwise differences, as follows:

With a little algebra, you can then show that the square root of the average squared pairwise difference (call it f’) is equal to:

In our case that works out to about 1.14 SDs. You can then use that as a basis for examining the various pairwise differences.