Abstract Title Page
Not included in page count.
Title:
The Robustness of Inferences from Randomized and Quasi-Experiments
Author(s):
Kenneth A. Frank, Michigan State University; Minh Q. Duong, Michigan State University; Spiro Maroulis, Michigan State University; Ben Kelcey, University of Michigan
2009 SREE Conference Abstract Template
Abstract Body
Background/context:
Research in education is fundamentally a pragmatic enterprise (Raudenbush, 2005). If a new curriculum, pedagogy or school structure contributes to learning then we should adopt it to improve student knowledge and ability (Cook, 2002). This is the spirit in which the What Works Clearinghouse was conceived (Eisenhardt & Towne, 2008), and motivates calls for evidence based practice (as reviewed by Slavin, 2008, p. 5).1 Policy-oriented pragmatism distinguishes education from the disciplines on which it draws (e.g., economics, psychology, sociology) in which basic research about the underpinnings of human behavior or description about the human condition is valuable, even when it has no clear immediate implications for policy or practice.
Because education is pragmatic, educational researchers must pay careful attention to the basis for making inferences from research that will inform policy. In Holland’s (1986) language, because educational policymakers are concerned with manipulation, educational researchers are concerned with causation. Toward that end, the US Department of Education’s National Center for Education Research (NCER) has drawn on the model in medicine to call attention to, and use of, randomized trials. The ultimate goal is to inform policy by providing a sound, scientifically rigorous basis for making causal inferences. Random assignment supports that goal because in the long run random assignment eliminates preexisting differences between subjects who receive different treatments (Fisher, 1935), allowing researchers to obtain unbiased estimates of treatment effects. The result is as described by Burtless (1995): "They [inferences from randomized experiments] do not become entangled in a protracted and often inconclusive scientific debate about whether the findings of a particular study are statistically valid. Politicians are more likely to act on results they find convincing" (p. 276).
Purpose/objective/research question/focus of study: One of the main goals of this article is to ground debate about causal inferences in the procedures used to make statistical inferences, and propose an approach of quantifying the robustness of inferences as linked to the following questions:
1) For a randomized study, how robust is the statistical inference to the inclusion of data from other populations to which one would like to generalize?
2) For an observational study, how robust is the statistical inference to the inclusion of unmeasured confounding variables in the analysis?
The answer to the first question concerns external validity, informing policy by contributing to scientific debate and discussion about the extension of inferences from a sample from one population to other populations and circumstances (Cronbach, 1982). The answer to the second question concerns internal validity, informing policy by contributing to scientific debate and discussion about the unobserved factors that could invalidate the inference (Cook & Campbell, 1979). Quantifying the robustness of statistical inferences moves debate about causal inference from abstract generalities to specifics, providing a sharp, quantitative language for that debate. Moreover, we will then use the same statistical apparatus to quantify the extent to which the statistical control achieved in an observational study approximates the conditions of a randomized experiment.
Setting:First, Borman et al. (in press) assessed the effects of the Open Court Reading (OCR) curriculum on reading achievement by randomly assigning 49 classrooms to either OCR or to continue their current practices. 2 In the second study, Hong and Raudenbush (2005) assessed the effects of kindergarten retention on reading achievement using an observational study.
Population/Participants/Subjects: California, nationally representative
Intervention/Program/Practice: Open Court Curriculum; Kindergarten retention
Research Design: Quantitative reanalysis
Data Collection and Analysis: Given the concerns about the external validity of the randomized experiment in Borman et al. (in press) and about internal validity in Hong and Raudenbush’s (2005) observational study, we turn to quantifying the robustness of the inferences in terms of the general linear model that underlies much of social science research (Hepple, 1998). First, we apply Frank and Min’s (2007) recent results to quantify the robustness of inferences to concerns about external validity to in Borman et al.’s study. We then use the technique developed in Frank (2000) to quantify the robustness of Hong and Raudenbush’s (2005) inference of the effect of retention on achievement to concerns about internal validity. Because we have tried to make this presentation generally accessible, we invite the more technical reader to engage the footnotes for more detail.
Frank and Min (2007) conducted a thought experiment in which a sample is made more representative by replacing observed cases with cases representing a population other than that from which the data were originally sampled. For example, consider that OCR may not have as strong an effect in states on the East Coast known for their progressive teaching approaches as in Borman et al.’s (in press) sample. Then what must be the relationship between OCR and reading achievement in East Coast states such that if East Coast states constituted half of Borman et al.’s sample the inference regarding OCR would not be valid? Frank and Min’s (2007) calculations show the inference that OCR improves reading achievement would be invalidated only if OCR were correlated at about -.1 with reading achievement in the East Coast states constituting the replacement sample (assuming the means and variances of OCR participation and reading achievement are constant across samples – see Frank & Min, 2007). That is, OCR would actually have to have a negative effect on reading achievement in the replacement data to invalidate the statistical inference that OCR has a non-zero effect on reading achievement. While these calculations do not create a new inference they shift the language of debate about the possibility of different effects in different contexts to one of effects in different directions in different contexts. Thus we concur with Borman et al.’s inference that OCR has a positive effect on achievement.
Similar to the approach in Frank and Min (2007) for quantifying the robustness of inferences with respect to concerns about external validity, Frank (2000) quantified the robustness of inferences to concerns about internal validity. Frank defines the impact of a confounding variable on a regression coefficient as in Figure 2 through the convergence of arrows relating a confounding variable (v) to a treatment (t) and to an outcome (y). As expressed in the Figure, Frank defines the impact of a confounding variable on a regression coefficient in terms of the product rv•t×rv•y, where r represents a correlation. Thus the product simultaneously accounts for the two relations that define a confounding variable, its correlation with the predictor of interest (e.g., rv•t) and its correlation with the outcome (e.g., rv•y).
In the example of the effect of retention on achievement, the treatment is retention, the outcome achievement, and the confounding variable might be some unmeasured aspects of a student’s motivation. Thus the correlation between retention and achievement, rachievement•treatment, will be impacted by the correlation between motivation and retention and the correlation between motivation and achievement: rmotivation•retention×rmotivation•achievement. 7 The arrow in Figure 2 from retention to reading achievement narrows once impacted by the confounding variable, because the correlation between the predictor of interest (e.g., retention) and the outcome (e.g., achievement) is altered by the product of the two correlations involving the confound (e.g., rmotivation•retention×rmotivation•achievement).8
Frank (2000) uses the definition of impact to quantify the conditions necessary to invalidate a statistical inference. Applying Frank’s calculations to Hong and Raudenbush’s (2005) estimate of the effect of retention on reading achievement, the magnitude of the impact of an unobserved confounding variable must be greater than .12 to reduce the association between retention and achievement below the threshold for statistical significance. 9 Implied, the confounding variable would have to be correlated with retention and with achievement at .35 (i.e., the square root of .12) or greater to invalidate the inference.
To place the magnitude of the impact threshold (.12) in a meaningful context, we compare the threshold with impacts of observed covariates known to be theoretically and empirically significant (Frank, 2000; Frank, Sykes, Anagnostopoulos, Cannata, Chard and Krause 2008; Lin, Psaty, & Kronmal, 1998; Rosenbaum 1986, 2002). In Hong and Raudenbush’s (2005) analyses, the largest impact (in magnitude) is for a teacher’s evaluation of a student’s approach to learning (SAL) as measured in the Spring of kindergarten (Hong & Raudenbush, 2006, identify SAL as the most important confounder to use for sensitivity analyses of retention effects in the ECLS-K data).10 The impact of SAL equals rSAL•retention×rSAL•achievement
= -.1849×.4442= -.08213, with the negative sign indicating that controlling for SAL would decrease the retention effect. 11 Critically, the impact of SAL is about two thirds of the threshold value of .12. Thus, some unexplained factor would have to be 50% more important than the most important measured variable to invalidate the statistical inference that retention has a non-zero effect on achievement.12 From this analysis, we concur with Hong and Raudenbush’s inference that retention reduces achievement.
We have quantified the robustness of Borman et al.’s (in press) inference in terms of the correlation between OCR and in a replacement sample and we have quantified the robustness of Hong and Raudenbush’s (2005) inference in terms of the correlations associated with an omitted confounding. Quantifying robustness in terms of correlations grounds our sensitivity analysis in the statistical techniques used to support the inference. Thus it uses the statistical apparatus commonly used to make inferences from quantitative analyses in the social sciences to characterize uncertainty about the causal inference. We now use the same apparatus to quantify the extent to which the statistical control employed in a quasi-experiment approximates the theoretical conditions of a randomized experiment.
Findings/Results: Having quantified the capacity of the pretests to absorb the impacts of other covariates, we now use the randomized experiment to theoretically quantify the level of statistical control achieved in a quasi-experiment. While certainly the statistical implications of randomization are unquestioned, we wish to emphasize that randomization eliminates differences between groups only in the long run. That is, when there are only a small number of subjects, some covariates may be associated with the treatment and with the outcome in a given sample.
Consider a small randomized study of retention, in which 15 kindergarteners are randomly assigned to be retained and 15 others are promoted. Even with randomization, it could easily be that the 15 retained students happened to come from lower socioeconomic backgrounds than those who were promoted, and that socioeconomic background is related to achievement in the sample. Therefore any observed difference between retained and promoted students in the sample may be due to underlying differences in socioeconomic status of the students. More generally, some covariates may have impacts greater than zero in a sample, even when treatments are randomly assigned. Though we emphasize that the standard errors calculated in typical statistical analyses of randomized experiments accurately reflect the potential for some factors to be associated with both the treatment and outcome in a sample, here we use the idea of “imperfect control in a small randomized experiment” to establish a baseline for evaluating the statistical control achieved in a non-randomized study.
Intuitively, larger impacts occur in smaller randomized experiments as randomization will reduce the relationship between treatment and any individual attribute to zero in the long run. For purposes of drawing comparisons between randomized and quasi-experiments, the question then is: in a randomized experiment, what is the theoretical relationship between impact and sample size? If one could answer that question, and keeping in mind that the impacts of observed covariates in quasi-experiments are easily calculated, then one could express the control achieved in a quasi-experiment in terms of the theoretical expectation in a randomized experiment. In fact, Pan’s (Pan, 2002; Pan & Frank, 2004a, 2004b) highly accurate approximations of the distribution of impacts as a function of sample size provides the answer to our question. 17 This approximation can be used to supports statements such as “the impacts of the observed covariates in a particular observational study approximate the theoretical impacts of a randomized experiment of size X”.
Returning to the estimated effect of retention on achievement, recall that we identified student approaches to learning (SAL) as an important confounder and that after controlling for the pretests the magnitude of the impact of SAL was .00126. Based on Pan’s approximations, a mean impact of .00126 would theoretically occur in a randomized experiment of size 171.18,19 Thus we say that the statistical control in this example sustains an effective n of 171.
Critical to our interpretation is the capacity of the pretest to reduce, or absorb, the impact of confounding variables. Prior to controlling for the pretest, the magnitude of the impact of SAL is .08213 translating to an effective n of 10 or less, considerably smaller than the effective n of 171 after controlling for the pretests. Without the pretest there is limited capacity to control for confounding variables and so the study would approximate only a very weak experiment (of size 10). This emphasizes the importance of a quasi-experimental design with pretests that absorb the impacts of other predictors.
Conclusions:
The optimal research design would employ randomization both to select subjects from a population and to assign subjects to treatment conditions. But such a double use of randomization is rarely practical in the educational research because of the need to recognize individual’s rights to choose to participate in a randomized trial or to choose educational treatments and experiences. As a result, a single design can rarely have unquestionable external and internal validity.
Are educational researchers destined to repeat the Lee Cronbach – Donald Campbell debates of the 1980’s with no clear progress in our thinking about the philosophy of educational science? At the very the debates in the 1980’s spurred many advancements such as the use of effectiveness studies for randomized trials across heterogeneous populations and the use of regression discontinuity designs and propensity scores and matching cases for observational studies. No doubt use of these advancements has moved us beyond the state of the technical art of the 1980’s.