Power (STT Consulting) April 2010
Power
STT Consulting
April 2010
www.stt-consulting.com
Content
Introduction
Power, sample size, and effect size for 2 variances (F-test); G*Power application
Power analysis with “R”
Power calculations with EXCEL (known s)
-General (z-test)
-Internal Quality Control (IQC)
Confidence interval for power
Software and references
Statistical Power (http://www.power-analysis.com/power_analysis.htm)
In very basic terms, statistical power is the likelihood of achieving statistical significance. In other words, statistical power is the probability of obtaining a p-value less than 0.05, for example. Obtaining p < 0.05 is exactly what many studies strive for, making the understanding of power calculations “a must”.
A power analysis is typically performed before a study is being planned. It is used to anticipate the likelihood that a study will yield a significant effect. Specifically, the larger the effect size, the larger the sample size, and/or the more liberal the criterion required for significance (alpha), the higher the expectation that the study will yield a statistically significant effect (= the higher the power will be).
These three factors (effect-size, alpha, n), together with power, form a closed system - once any three are established, the fourth is completely determined. The goal of a power analysis is to find an appropriate balance among these factors by taking into account the substantive goals of the study, and the resources available to the researcher.
Role of Effect Size
The term "effect size" refers to the magnitude of the effect under the alternate hypothesis. The nature of the effect size will vary from one statistical procedure to the next, but its function in power analysis is the same in all procedures.
The effect size should represent the smallest effect that would be of clinical, analytical, or other significance. In clinical trials for example, the selection of an effect size might take account of the severity of the illness being treated (a treatment effect that reduces mortality by one percent might be clinically important while a treatment effect that reduces transient asthma by 20% may be of little interest). It might take account of the existence of alternate treatments (if alternate treatments exist, a new treatment would need to surpass these other treatments to be important).
Role of Alpha
Traditionally, researchers in some fields have accepted the notion that alpha should be set at 0.05 and power at 80% (corresponding to a beta of 0.20). This notion is implicitly based on the assumption that a type I error is four times as harmful as a type II error (the ratio of alpha to beta is 0.05 to 0.20), which notion has no basis in fact. Rather, it should fall to the researcher to strike a balance between alpha and beta as befits the issues at hand. For example, if the study will be used to screen a new drug for further testing we might want to set alpha at 0.20 and power at 95%, to ensure that a potentially useful drug is not overlooked. On the other hand, if we were working with a drug that carried the risk of side effects and the study goal was to obtain FDA approval for use, we might want to set alpha at 0.01 while keeping power at 95%.
Role of Sample Size
For any given effect size and alpha, increasing the sample size will increase the power.
Variation in the data (imprecision)
As always, high variation gives poor estimates (e.g., power), except the sample size is high. Note also, the standard deviation in power analysis is often taken from a pilot study. Therefore, it may be appropriate to calculate confidence intervals of power (Taylor DJ, Muller KE. American Statistician 1995;49:43-47; Tarasinska J. Statistics & Probability Letters 2005;73:125-130).
Illustration of the power concept
In very basic terms, statistical power is the likelihood of achieving statistical significance. In other words, statistical power is the probability of obtaining a p-value less than 0.05, for example: we wish to confirm an effect!
When performing power analysis, we have to define the a-error (2-sided), first. Here, we define it at the 5%-level (p = 0.05, z = 1.96).
Under null-hypothesis conditions, we get p < 0.05 in 5% of the cases, however, these are false postives (no effect introduced).
When we introduce an effect (here: shift of the population in k x standard error), the frequency of p <0.05 increases: 50% at effect 1.96, 90% at effect 3.2415 (1.96 + 1.2916). This is our power.
We can graph the power versus the effect (here, shift) in a so-called power function. The corresponding power function is here the power function of the 2-sided z-test.
Applications
Analytical imprecision (variance, standard deviation)
Assume you have an analytical method that samples volumetrically and you are not satisfied with its precision. You want to switch to gravimetrically-controlled sampling = you want to IMPROVE precision. Typically, you work with 100 µl sample volume (= approximaterly 100 mg, for aqueous samples). You analyze 6 aliquots volumetrically sampled and 6 aliquots sampled with gravimetric control. The results are: SDgrav = 5, SDvol = 8 (note: this corresponds to the typically observed imprecision). The mean of both results is not relevant here.
Statistical investigation
Make a 1-sided F-test (you want to improve; you know gravimetric control should be better): n = 6; SDgrav = 5 (VARgrav = 25); SDvol = 8 (VARvol = 64).
www.stt-consulting.com >Tests with estimates >F-test: p Value = 0.1627.
You are disappointed; gravimetric control may be better, but you could not demonstrate it.
Several questions may arise to get a significant test (a = 0.05):
1. What was the power of the initial experiment?
2. How many aliquots should you have measured at a given power (e.g., 0.9)?
3. How small should SDgrav have been with n = 6 at a given power (e.g., 0.9)?
We address these with the free G*Power software (http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/).
F-tests
>Variance: Test of equality (2 sample case)
Determine effect size (note, the software wants the variance ratio!)
More background material: http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/how_to_use_gpower.html
G*Power print-screens
1. What was the power of the initial experiment?
The power was roughly 24%
2. How many aliquots should you have measured at a given power (e.g., 0.9)?
You should have measured 41 aliquots, each. Again, you are disappointed.
3. How small should SDgrav have been with n = 6 at a given power (e.g., 0.9)?
Note, the variance RATIO should have been 0.0573. We assume that the variance of the old method was representative VARvol = 64, then VARgrav should have been 64 x 0.0573 = 3.67 and SDgrav should have been SQRT(3.67) = 1.92.
Conclusion
It is very difficult to demonstrate an improvement of imprecision with a low number of measurements (n <10)!
Power curve (power and sample size)
The graph below shows the relationship between power of the test (y) and the sample size (x). NOTE: the total number of results is shown on x (for example, 6 for vol and 6 for grav). The curve starts at p = 0.2367 (n = 6, each) and ends at p = 0.9 (n = 41, each).
The table can be copied into EXCEL with the CTRL C command.
Power analysis with R-software
R has an amazing variety of power calculations, however, spread over many different packages. The most general one is the package {pwr}. http://www.statmethods.net/stats/power.html
To the best of my knowledge, {pwr} and the other packages, miss power calculations for variances (Chi2, F). Therefore, R-scripts are given for the power of these tests. Additionally, the script for the power-curve of an F-test is given.
R-script Power of Chi2-test
df=4
sd1=1
sd2=sqrt(10)
F1=sd2^2/sd1^2
F2=qchisq(0.95,df)
F3=F2/F1 #Scaling factor for H1 distribution (G*Power tutorial)
1-pchisq(F3,df)
Result: [1] 0.9174616 G*Power (one-sided): 0.9174616
Note: 2-sided slightly different: 0.89199 vs 0.89228; critical chi-squares: same; df = 2: 0.6915 vs 0.6940
R-script Power of F-test
df=4
sd1=1
sd2=sqrt(15)
F1=sd2^2/sd1^2
F2=qf(0.95,df,df)
F3=F1/F2 #Scaling factor for H1 distribution (G*Power tutorial)
pf(F3,df,df)
Result: [1] 0.7856614 G*Power (one-sided): 0.7856614
Note: 2-sided slightly different
R-script Power curve F-test
x=seq(1,40,by=1)
F2=qf(0.95,4,4)
y=pf(x/F2,4,4)
plot(x, y, main="Power curve: 1-sided F-test (alpha=0.05, n=5)", font.main=1, cex.main=1, xlab="Variance ratio", ylab="Power", ylim=c(0,1), lab=c(5, 10, 2), type="l")
#Note type = (“l”)ine, not “one”
Power curves with EXCEL
General
Power curves with known s (for example, for the z-test) can be calculated with EXCEL: www.stt-consulting.com >Statistics >Power Tutorial; similar, sample sizes: www.stt-consulting.com >Statistics >Power.
Internal quality control
The EXCEL-file Power Tutorial contains also some power funtions for internal quality control rules.
The figure shows power curves for detecting systematic error (in fractions of the stable standard deviation): by the s-rules (n = 1): 1.96, 2.5, 3, and 3.5 (from left to right).
The figure demonstrates that rules with smaller s’s are more powerful, however, they have a higher probability of FALSE rejection (see: P at 0).
Educative power curves
Some educative power curves for the 1-sided F-test (a = 0.05) and others ─ generated with G*Power and Table copied to EXCEL ─ can be found in www.stt-consulting.com >Statistics >Power Tutorial.
The figure shows that power curves for the F-test with low n are quite “flat” and reach a desirable power (e.g., p = 0.9) relatively late. For sufficient power of F-tests, sample sizes >10 are desirable.
Don’t forget confidence intervals
Power may have a confidence interval when the standard deviation is estimated.
Computing confidence-bounds for power and sample-size of the general linear univariate model. Taylor DJ, Muller KE. American Statistician 1995;49:43-47.
See also: Confidence intervals for the power of Student's t-test. Tarasinska J. Statistics & Probability Letters 2005;73:125-130.
Software and references
Free software
G*Power
http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/
Background/educational material: http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/how_to_use_gpower.html
Others
http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize (no F-test)
http://www.cs.uiowa.edu/~rlenth/Power/ (also online; no 1-sample F-test)
Commercial software
http://www.power-analysis.com/software_overview.htm
http://www.ncss.com/pass.html
Education
http://www.statsoft.com/textbook/power-analysis/
Educational text
Hypothesis Testing and Statistical Power of a Test. Hun Myoung Park, Ph.D.
http://www.indiana.edu/~statmath/stat/all/power/power.pdf
5