Lecture 10 - Power

Power: Probability of rejecting the null hypothesis in those situations when the null is false.

In terms of the 2 x 2 table of outcomes.

The combination of two states with two decisions leads to four possible outcomes.

Situation that exists in the populations
Null is True / Null is false
Retain Null / Correct Retention / Incorrect Retention
Experiment / Probability: Type II error rate
Outcome
Reject Null / Incorrect Rejection / Correct Rejection
Probability: Significance level / Probability: Power.

Note that power is not an issue for the left side of the 2x2 table.

If we’re in the left side of the table, the Study Time Program was not effective – performance is the same for the study skills group and the no study skills group. So the population performance means are equal..

If we’re in the left side of the above table, the only thing that affects the probability of Rejection (or the probability of Retention) is the Significance Level of the statistical test. This is set (usually at .05) and does not depend on the outcome of the research.

Power is an issue only for the right side of the 2x2 table.

If we’re in the right side of the table, the Study Time Program is effective. There is some difference between the performance population means.

If we’re in the right side of the table, a whole bunch of factors affect the probability of Rejection. Those are the factors we’re considering here.

Obviously, if the null is false, then you want to do whatever you can to put yourself in the lower right cell.

Factors that affect Power in order of importance.

1. The effect size: When comparing means: How big the difference between population means actually is. When doing correlational research: how strong the relationship actually is in the population.

Definitions

When comparing two population meansWhen investigating a relationship

This is symbolized as d = (E–uC) / .Effect size = Population r

If the population means are equal or the population r = 0, then effect size is 0, the null is true, and power is not an issue.

The larger the effect size, the more likely we are to detect it.

Analogy: The brightness of a distant star. The brighter it is, the easier it will be to detect.

How big is an effect size?

From Lance, C. E., & Bandenberg, R. J. (2008). Statistical and Methodological Myths and Urban Legends. Routledge.

Characterizations of effect sizes in terms of what Cohen considered small, medium, and large will be presented below.

2. The sample size. The only thing we really have control over. The larger the sample size the greater the power.Sample size is the primary method of manipulating power.

Analogy: The size of our telescope. The larger then telescope, the greater the chance of detecting a star.

Note that increasing the sample size has no effect on probabilities computed in those situations in which the null is true – the left side of the 2x2 table above. If the null is true, the probability of incorrectly rejecting it depends only on the significance level. The significance level is set before the research is conducted.

3. The particular test chosen. For example, in the comparison of two groups, if the assumptions of the t-test are met, it is the most powerful way to compare means. The Mann-Whitney U-test is less powerful as a test to compare means than the t-test when the assumptions of the t are met.

4. The significance level. The larger the significance level, the larger the power.

But you can't have your cake and eat it too. Unfortunately, increasing the significance level increases the probability of a Type I error - since that's what significance level is.

5. The variability of scores within each population.

Recall that for two populations, the effect size is the difference in population means divided by the population standard deviation, (1 - u0) / . Proper conduct of the experiment may affect the value of . The smaller the value of , the larger the power. Manipulating  does not affect the probability of a Type I error.

Telescope analogy: Get rid of random atmospheric distortion.

6. Direction of alternative hypothesis. All other things being equal, a one-tailed alternative hypothesis is more powerful than a two-tailed alternative if you've specified the direction correctly.

Summarizing

Manipulation / Effect of manipulation if there is no difference in population means. (Left side of 2x2 table.) / Effect of manipulation if there is a difference in population means. (Right side of 2x2 table.)
Increase the effect size / No effect at all / Increases power
Increase sample size / No effect at all / Increases power
Choose a more powerful test / No effect at all / Increases power
Make significance level larger / Increases Type I error rate / Increases power
Decrease variability of scores within groups / No effect at all / Increases power
Choose appropriate one-tailed alternative hypothesis / No effect at all / Increases power

How big is an effect size

Measures of Effect Size for Common Statistics Tests

Population ValueSample Estimate

One Population t

Actual Pop Mean – Hyp’d Pop MeanSample Mean – Hyp’d Pop Mean

d = ------

Pop SDSampleSD

Small = .2Medium = .5Large = .8

Two Independent Samples t

Pop Mean 1 –Pop Mean 2Sample Mean 1 – Sample Mean 2

d = ------

Pop SDSquare root of (Pooled Variance, S2p)

Small = .2Medium = .5Large = .8

Two Correlated Samples t

Pop Mean 1 –Pop Mean 2Sample Mean 1 – Sample Mean 2

d = ------

Pop SDSquare root of ((S21 + S22)/2)

Small = .2Medium = .5Large = .8

But correlation of paired scores, r, influences actual effect size.

One Way independent samples ANOVA,

SD of Population MeansSample SD of Sample Means

f = ------

Pop SDSquare root of MS Within

Small = .1Medium = .25Large = .4

f2

η2 = Eta2 = ------

1+f2

Small = .01Medium = .059Large = .138

Pearson R between two variables

Population rSample r

Small = . 10Medium = .30Large = .50

Determining Sample Size for upcoming research

It is important to take power into account when planning the sample size for research.

Following is an illustration of what must be considered when comparing two groups, a common situation.

I. Determine how big the population effect size is that you’re trying to detect. That is, determine how big of a difference you’ll by trying to discover in your research.

Commonly asked question: How can we know what the difference will be in the population before we’ve conducted the experiment to discover if there is a difference? A Catch 22 situation.

From Lance, C. E., & Bandenberg, R. J. (2008). Statistical and Methodological Myths and Urban Legends. Routledge.

Mean of the red correlations is -.11, an estimated of effect size for inconsistency as a predictor of GPA.

II. Determine the desired power – the probability of detecting the difference we think our manipulation will make. Typically, we want that probability to be as large as possible (1 would be great) but, realistically, we usually settle for the value .8. That value is to power analysis and sample size determination what .05 is to significance levels.

III. We then consult sample size tables or a computer program such as SamplePower 3to determine the sample size required to detect the estimate effect with the desired power. A collection of power tables is available at -> Psychology 201 -> Power Tables.

Example 1 – Two Groups Research.

You plan to investigate a new method of teaching statistics. Prior to the research, you wish to determine how many participants will be required.

I. PopulationEffect size: Hmm. If your new method will only yield a small effect, then it probably wouldn’t be worth your efforts to pursue it. So you’re only interested in the new method if it yields a medium effect size, d=0.5. So plan the statistical analysis so that it will be likely to detect a medium or larger effect size. If the effect size is smaller than medium, your analysis might not detect it, but that’s OK, since a small effect size would mean that the method wasn’t that effective.

II. Power. We’d like at least a 90% chance of detecting a medium effect size. There’s no point in doing the research and the analysis if we can’t be quite sure that we’ll detect a useful difference.

III. Sample Power Output

Sample Power indicates that we’ll need 90+90 or 180 participants in order to have power of 92% to detect a difference of 0.5 standard deviations, a medium effect.

Biderman’s Power tables . . . -> Psychology 201 -> Power Tables

So, we’ll use 90 persons per group and have .92 probability of detecting a difference of .5 SDs.

Example 2 – Correlational Research.

You are investigating a new test for predicting performance of students in a statistics curriculum. How big of a sample should you use?

1. Effect Size: Hmm. CA correlates about .5 with performance. But Conscientiousness correlates only about .2 with performance in academia. You decide that you are not interested in doing any more work on your test unless is correlates more highly with performance than does Conscientiousness. You decide that a medium effect size correlation coefficient, r=.3, is the effect size you are most interested in.

2. Power: Let’s choose a sample that will have a 90% chance of detecting a correlation of .3.

3. Sample Power Output

So the sample power output suggests that you’ll need 110 participants in order to have probability of .90 to detect a correlation of .3.

Biderman’s Power Table Output

Argh. Biderman didn’t prepare a power table for Correlations. Somebody get him to do that.

Why be concerned about Power?

1. Assuming we create treatments to make a difference, it only makes sense to conduct research that has the greatest probability of detecting the difference we set out to make.

2. To provide insight into reasons for failure to reject the null (failure to find differences).

If we fail to reject the null, it will be due to one of two reasons.

a. The manipulation we implemented had no effect - that is, the actual effect size was zero. Our treatment did not make a difference.

b. The manipulation had an effect, but the statistical test had insufficient power to detect the effect of our manipulation. Our treatment made a difference but we were too lazy or poor or ignorant to use enough participants, and we didn’t detect it.

FOR Study example.

We performed a study investigating the effect of Frame Of Reference (FOR) instructions on the validity of Conscientiousness as a predictor of GPA. Our original sample had 150 students. The FOR effect was not significant. For this and other reasons, the study was not accepted in a conference.

We followed up by adding 150 more participants, on the assumption that the population FOR effect size was small, e.g., r=.1 or .2. For the 300 participant sample, the difference in validities between the nonFOR and the FOR condition was .07, quite small, but statistically significant.

If you fail to reject, you should estimate the actual effect size in your data. If the estimated effect size is small, then this indicates that your manipulation was not as powerful as you might have expected.

But if the sample estimate of effect size was large while your statistical test was not significant, that suggests that your sample size was too small.

When we don’t want high power

In general, high power is good. If the null hypothesis is false, we want to be able to correctly reject it.

There are instances, however, when we may not want to detect a difference even if it is there.

Examples

1) We're not interested in the difference.

E.g., We're interested in the effect of Type of Training. A Gender difference is found. We're not interested in gender differences. Nuts! Now we have to deal with them.

2) We're overwhelmed by differences already and don't have time to deal with any others.

We've conducted research evaluating Type of Training, Sex, Type of Job, Age of Employee. A Gender difference is found. Rats! We don't have time to deal with the gender effect at the present time.

3) The difference is incredibly small.

Suppose the average statistical test performance of the population of I/O students is 84.3 while the average statistical test performance of the population of Research students is 84.31. With 10,000 I/O students and 10,000 RM students, the difference would be statistically significant.. Oh wow! I really care!

This issue is what is referred to as the issue of statistical vs. practical significance.

Any difference, however small or inconsequential, can be made statistically significant by increasing power (usually through larger samples.) But whether a statistically significant difference is worth our dealing with is another question. Many times, statistically significant differences are not worth dealing with.

For this reason, it has become common practice to report not only the statistical significance of a difference, but also a measure of sampleeffect size - the estimated size of the difference, measured in a standardized fashion. That way, small differences which were detected by extremely powerful statistical procedures can be recognized for what they are: small differences. The GLM procedure in SPSS can print such sample effect sizes.

Using SamplePower to obtain power and sample sizes

Sample Power 3 is an add-on module available with the SPSS suite of programs.

It can be used to compute power. More often then not, however, it’s used to compute the sample size required to have a prespecified power for proposed research. That’s what will be illustrated here.

SamplePower opens with a blank screen, except for a randomly chosen tip.

Pull down File and choose New.

Independent Groups t-test

1) Specify the effect size. Do that by changing one of the population means to the desired effect size. Either the mean of population 1 or the mean of population 2 can be changed.

2) Adjust the N per Group until the desired power appears below.

To get the exact sample size for Power = 80%, pull down the Tools menu and choose “Sample size for 80% Power.”


Population Correlation Coefficient

1) Set the Population Correlation to the desired value.

2) Pull down Tools and choose “Sample Size for 80% power.”

One way Analysis of Variance

1) Initial Screen. Click on the “Number of Levels” field or the “Effect Size” field.

2) Enter the Number of categories in the field on the right.

3) Click on the appropriate effect size.

Pull down the Tools menu and choose “Sample Size for 80% Power”.

Two Way chi-square

Chi-square is unusual in that the power and thus, required sample size for a given difference between proportions depends on what specific values those proportions take on.

Sample size required to detect a .05 difference - .40 vs. .45.

1) Choose the two population proportions whose difference you’ll want to detect.

2) Pull down the Tools menu and choose “Sample size for 80% Power.”

Sample size required to detect a .05 difference - .05 vs. .10.

Note that the sample size required to detect the difference between .05 and .10 is much smaller than that required to detect the difference between .40 and .45.

The bottom line is that when testing hypotheses about population proportions, you must specify not only the difference in proportions, but also the two specific proportions whose difference you wish to detect.

Copyright © 2005 by Michael BidermanPower- 110/02/18