1.0 the Logic of Significance Tests

ANOVA Overview

Josh Klugman

March 19th, 2009

1.0 The Logic of Significance Tests

Social science researchers are interested in proving causal relationships between variables. That is, they want to prove that a change in one variable will affect change in another variable.

xy

x = independent variable a.k.a. predictor a.k.a. explanatory variable

y = dependent variable a.k.a. outcome a.k.a. response variable

When psychologists conduct an experiment on a sample of 50 people, they often are not interested in the results for the 50 people per se – they want to show that the results generalize to a broader population, a population that we generally cannot directly observe.

We use significance tests to show that an observed relationship in a sample can be generalized to a population. We see variable xaffects variable y in our sample, but that does not prove that the relationship exists in the population. It could be that by random chance our sample was a fluke and we have a relationship that exists in our sample but is not “real” (occurs in the population).[1]

To see if we can support the notion that a relationship exists in a population, we carry out a significance test.

A significance test is a thought experiment. We set up a null hypothesis that says there is NO relationship in the population. We assume the null hypothesis is right, and we calculate the p-value, which is the probability that we would see a relationship at least as strong as the one we observed in our sample. If the p-value is “low enough” we say we reject the null hypothesis. If the p-value is not low enough we say we have to retain the null hypothesis.

 = the threshold for whether or not the p-value is “low enough”. Conventionally it is set to .05.

Erroneously rejecting the null hypothesis is called Type I error (probability of making Type I error = ).

Erroneously retaining the null hypothesis is called Type II error.

When you commit a Type I error, you are saying there is a relationship when in fact the relationship does NOT exist in the population.

When you commit a Type II error, you are saying you cannot prove a relationship exists when in fact it does exist in the population.

In social science, committing a Type I error is considered a bigger sin than Type II error. This is because social science is a conservative enterprise. Our default assumption is there is not a relationship between a given set of variables UNLESS we can prove otherwise. When you say a relationship does exist, you are challenging what we traditionally thought.

When you say a relationship does not exist, you are upholding our default assumption. To do this incorrectly is bad, but at least the door is left open for someone else to test the assumption.

2.0 The Logic of One-Way ANOVA

We use ANOVA to test the proposition that is a causal relationship where a categorical variable (say, experimental condition) affects an outcome.

With one-way ANOVA we are interested in testing for “significant” differences between three or more groups on some outcome.

Example: We are interested in determining if we can induce specific moods in our subjects. We use clips from movies as the mood-induction treatments, and we have three conditions: pleasant, neutral, and unpleasant. After the subject watches the clips, we measure their affect on a scale from 1 to 8, where 1 indicates sadness, and 8 indicates happiness. We get this data:

= mean for condition j in the sample

sj = standard deviation for condition j in the sample

sj2 = variance for condition j in the sample

We see that there are differences between the means for the three conditions.

The question is, are these differences real, or are they just random differences caused by sampling variability?

We setup a null hypotheses that are no differences between the true population means (j). We want to knock down this null hypothesis.

H0: pleasant = neutral = unpleasant

Ha: Not all of the population means are equal

j – mean for condition j in the population

In order to see if the populations have the same mean, the logic of ANOVA is to see how far apart the samples means are from each other—relative to the variation that occurs within the groups.

(let’s assume all six of these groups are normally distributed, where the mean equals the median).

In A and B, the differences between the means are the same. But in B, the differences are larger relatively to the variability within the groups.

The logic of ANOVA says that in A, the mean differences are more likely due to random chance. In B, it is less likely we see these mean differences due to chance. We are more likely to reject the null hypothesis under B than under A.

3.0 Conducting a One-Way ANOVA (the omnibus F-test)

To conduct an ANOVA, we calculate an F statistic:

a = Number of groups

N = Total number of people

nj = number of people in group j

yij = Value of y for person i in group j

= Mean value of y for group j

= Grand mean (mean of whole sample)

Thick Blue horizontal bar: the grand mean ()

Thin Black horizontal bars: group means ()

Light blue vertical lines: distance between the group means and the grand mean ()

Thin black vertical lines: distance between individuals and their respective group means ()

For our example,

MSB = mean between-group sum of squares

MSW = mean within-group sum of squares

The “between-group” sum of squares are sometimes called the “model sum of squares”. They represent the differences explained by our model.

The “within-group” sum of squares are also called the residual sum of squares or the unexplained sums of squares. These are differences unexplained by our model.

4.0 The F-Statistic

In order for us to say that there is a real difference among the groups, the F statistic has to be, at a minimum, above 1. In other words, the between-group mean sum of squares has to be bigger than the within-group mean sum of squares.

This is because between-group differences in the sample are actually caused by between-group AND within-group differences in the population.

If F is less than or equal to 1, we can never be sure that the observed group differences reflect TRUE group differences.

But in order for us to conclude that there are true group differences, we need an F that is much larger than 1. How much larger? For that, we need to look at the F distribution.

The F distribution is a right-skewed distribution that is a ratio of two chi-squares. It is specified with a numerator degrees of freedom (a-1) and a denominator degrees of freedom (N-a).

Probability density functions for various F distributions:

Here is a probability density function for an F distribution with 2, 27 degrees of freedom:

Social scientists evaluate the calculated F-statistic in two ways.

Method 1: Calculate the p-value—the probability of getting a higher F-statistic (finding the area in the right-tail). If this area is “low enough” we can say we reject the null hypothesis and retain the alternative hypothesis. We denote the “low enough” threshold with . The conventional  is .05.  should be determined at the outset.

P value for this example: 7.34 × 10-10, which we commonly round to <.001.

The P-value represents the probability of observing group differences at least as big as we have observed them if there are no true differences.

The statistical package SPSS automatically gives researchers the p-value (in the “Sig.” box).

Method 2: Determine , and find then find the “critical value” on the F distribution that bounds  (we denote critical F values as F*). If the F statistic is larger than F*, then you can reject the null hypothesis.

Critical value for an F(2,27) distribution ,=.05: 3.35.

You can calculate p-values using the FDIST function in Excel, and you can calculate critical values using the FINV function.

5.0 Assumptions of ANOVA

The samples are drawn from populations with normal distributions (the continuous variable is normally distributed)
The samples are drawn from populations with equal variances
The cases in the samples are statistically independent of each other.

Violating the normality & equality of variances assumptions is usually not a big deal unless you have small groups with wildly different group sizes. If you do have small groups or different group sizes, you lose control of the probabilities of committing Type I and Type II errors. In that case, you will need to use special techniques to account for these violations.

6.0 Contrasts

Let us use a different example.

Here is some hypothetical data on an experiment looking at various ways to treat hypertension (the outcome is systolic blood pressure, measured in mmHg).

H0: DrugTherapy = Biofeedback = Diet= Combo

Because the p-value is so low, we can reject the null hypothesis. We see there is at least one significant difference among these four groups, but maybe we are interested in testing for specific differences. Say, the differences between biofeedback and drug therapy.

Mean Drug Therapy – Biofeedback difference: 104 – 91 = 13.

We see that in our sample people with biofeedback have a lower systolic blood pressure by 13 mmHg. But again, we have to ask if this is a TRUE difference—if it really occurs in the population.

Again, we have to turn to the F statistic.

H0: DrugTherapy = Biofeedback

Alternatively:

H0: (1)DrugTherapy – (1)Biofeedback + 0Diet + 0Combo=0

p = .026

Let us test a more complicated contrast: the combination treatment versus all three others.

[]

p = 6.78 × 10-5

In general:

Where

cj = coefficient for group j as specified in null hypothesis

7.0 The Problem of Multiple Comparisons

For contrasts, alpha is the pairwise error rate. It is the rate of making a Type I error for a particular of groups. If we set alpha to .05, the probability that we will incorrectly reject the null hypothesis for a particular comparison is .05. If we did a hundred contrasts, we will incorrectly reject the null hypothesis five times.

However, the probability that we will incorrectly reject at least one null hypothesis in the whole experiment is considerably larger. The experimentwise error rate depends on the type of contrast you do (for the sake of parsimony I will not get into this). The calculation for the highest possible experimentwise error rate is:

Experimentwise error rate (EW)= 1-(1-)C

If you do two contrasts, the highest possible experimentwise error rate is 1-(.95)2=.0975 . If we did a hundred experiments and did two contrasts for each experiment, we will incorrect reject a null hypothesis in 9.75 experiments.

Usually we want to minimize our experimentwise error rate. There are a couple of approaches to do this.

Approach #1. Ignore omnibus F test, and just do a small number of planned, theoretically-informed contrasts (no more than three).

Approach #2. Do as many planned contrasts as you want, but use the Bonferroni adjustment. (Although if you plan on having a large number of contrasts, you are shading over into posthoc territory.)

Approach #3. If the omnibus test is significant, test for any contrasts that look interesting (or test for all possible contrasts) but use post-hoc adjustments – Tukey’s Wholly Signifigant Differences (WSD) for simple contrasts and the Scheffe adjustment for complex contrasts..

Planned contrast – a contrast you were interested in BEFORE the data is collected. Usually guided by theory.

Post-hoc contrast – a contrast you want to test AFTER looking at the data (or when you test for all possible comparisons).

(post-hoc contrasts are more likely to lead to Type I error because you are testing for differences irregardless of theory)

7.1 Bonferroni Adjustment

Bonferroni contrasts involve simply setting:

PC = Pairwise alpha

EW = Experimentwise alpha

C = number of planned contrasts

In practice, this usually boils down to setting PW to equal .05/C. If you plan on having three contrasts, then PW will equal .0167.

Example:

Take the contrast we did between biofeedback and drug therapy. Let us say that was one of three planned contrasts. We saw that F(1,16) = 6.02 and unadjusted p = .026.

You can use one of three ways to figure out the significance of the adjusted contrast.

Compare the unadjusted p to PW. Our p is greater than .05/3 = .0167, so we have to retain the null hypothesis.
Create an “adjusted p” by multiplying the unadjusted p by C and compare it to EW. Adjusted p = .078 is greater than .05. Retain the null hypothesis.
Find the adjusted critical value for . 7.14. F < F*, so retain the null hypothesis.

Danger of Bonferroni adjustment: If you have a lot of planned contrasts the Bonferroni adjustment will be less powerful (more likely to commit Type II error) than post-hoc contrasts.

7.2 Tukey’s Wholly Significant Differences

We use the Tukey WSD for post-hoc contrasts involving only two groups. With the Tukey WSD, the critical value and the p-value comes from a different distribution—the “studentized range distribution”. The logic of the studentized range distribution is that you can get a critical value for testing the difference between the group with the lowest mean and the group with the highest mean and still keep PC(Min-Max)and EWat .05. If there is going to be any difference between the groups, it is definitely going to occur between the group with the lowest mean and the group with the highest mean (in our hypertension example, this would be between the drug therapy and combination groups). We use the same critical value for other pairwise comparisons, which means for non-maximum pairwise comparisons PW is < .05 andEW is still .05.

Values from the studentized range distribution are denoted as q. To get the critical value from this distribution, use .

Example:

For the blood pressure experiment we did, we had four groups (a = 4) and we had 20 subjects (20-4 =16). We need to find . We find q by looking it up in a statistical table; q = 4.046, and =8.185.

For our drug therapy – biofeedback contrast, F(1,16) = 6.02 which is less than . According to the Tukey WSD, we must retain the null hypothesis. We cannot prove a difference exists in the population.

7.3 Scheffé Adjustment For Complex Adjustments

We use the Scheffé test for all of our post-hoc contrasts if any of them are complex. If none of your post-hoc contrasts are complex, then do not use the Scheffé test as it is much less powerful than other techniques we have talked about for pairwise comparisons.

The Scheffé adjustment has a similar logic to the Tukey adjustment – the critical values come from a probability distribution for testing the biggest possible difference among the groups.

For the Scheffé we can go back to the F distribution. The critical value for a Scheffé test are:

Example:

For the complex contrast we tested for above, we found that F(1,16) = 28.40.

Our test statistic is greater than the critical value, so we can reject the null hypothesis with the Scheffé test.

8.0 Two-Way ANOVA

Most of the time, researchers are not interested in the relationships between only two variables. More often, they want to examine how multiple variables affect a particular outcome.

Hypertension experiment (outcome: systolic blood pressure:

Control / Drug Therapy / Biofeedback / Biofeedback
& Drug
185 / 186 / 188 / 158
190 / 191 / 183 / 163
195 / 196 / 198 / 173
200 / 181 / 178 / 178
180 / 176 / 193 / 168
Mean / 190 / 186 / 188 / 168
Grand Mean / 183
s2 / 7.91 / 7.91 / 7.91 / 7.91

Two Way Approach:

Biofeedback
Absent / Present / Average
Drug
Therapy / Absent / 190 / 188 / 189
Present / 186 / 168 / 177
Average / 188 / 178 / 183

In this factorial ANOVA, we are looking at: (a) the main effect of drug therapy; (b) the main effect of biofeedback; and (c) the interaction effect of both the drug therapy and biofeedback treatments.

Main effects are the effect of a factor averaging across all the levels of all the other factors.

An interaction effect is when the effect of a factor is contingent on the level of another factor.

Main effect of drug therapy: Compare SBP (systolic blood pressure) of people without drug therapy to the SBP of people with drug therapy. Subjects undergoing drug therapy see a decline in SBP of 12 mmHg (189-177).

Main effect of Biofeedback: Compare SBP of people without biofeedback to SBP of people with biofeedback. Subjects undergoing biofeedback see a decline in SBP of 10 mmHg (188-178).

Interaction effect:

We can talk about the interaction between any two variables (a and b) in two ways:

How does the effect of a differ across levels of b?
Effect of biofeedback without drug therapy (190 - 188 = 2)
Effect of biofeedback with drug therapy (186 – 168 = 18)
How does the effect of b differ across levels of a?
Effect of drug therapy without biofeedback (190 - 186 = 4)
Effect of drug therapy with biofeedback (188 - 168 = 20)

8.01 Terminology

Factor: Independent variable

Level: Value of a single independent variable

In the example, we have two factors (a two-way ANOVA). Each factor has two levels (absent/present). We designate a two-way factorial ANOVA with this notation: a × b, where a is the number of levels in the first factor, and b is the number of levels in the second factor.

In the example, we have a 2 × 2 ANOVA.

Cell: Combination of two or more levels from different independent variables.

In the example, we have 4 cells (neither biofeedback nor drug therapy; biofeedback only; drug therapy only; both biofeedback & drug therapy.

8.02 Omnibus F-tests

Biofeedback
Absent / Present / Average
Drug
Therapy / Absent / 190 / 188 / 189
Present / 186 / 168 / 177
Average / 188 / 178 / 183

Hypertension Experiment Individual Values

Sum of Squares Within: