H 615 Week 3 Comments/Critiques

Prior to this week’s readings, I was unaware of the great variety of quasi-experimental designs that exist. The utility of quasi-experimental designs was supported across all of the readings, from slightly different perspectives. Shadish et al (2002) argued that commonly used quasi-experimental designs can be strengthened with the inclusion of design elements such as pretests and control groups to reduce the plausibility of validity threats and to perhaps (particularly when added to interrupted time series) contend with causal inferences resulting from randomized experiments. Cook et al (2008) determined three conditions when causal estimates from quasi-experimental designs rival experimental estimates: regression-discontinuity studies,matching groups on pretest, and known selection process into treatment. Des Jarlais et al (2004) cite multiple sources in supportof including findings from quasi-experimental designs in the development of evidence-based recommendations to strengthen public health practice. All of the readings promote the use of nonrandomized designs together with, or as possible alternatives to, randomized experiments. Ultimately, the best study design depends on the research question, consideration for validity threats, feasibility, and intent to evaluate efficacy vs. effectiveness. I was intrigued by the nonequivalent dependent variable posttest,and curious about its frequency of use in study design.

While we would prefer to study human health and development using randomized experiments, this is sometimes impossible due to logistic, funding, and ethical constraints. Cook et al. (2008) argue the social sciences will always need alternatives to randomized experiments, the key issue being how investigators can design strong nonrandomized studies. They also point the important difference between a quasi-experimental and a non-experimental study, and indicate that comparing randomized experiments with poorly designed nonrandomized studies is unfair. Shadish et al. (2002) argue that quasi-experimental studies can sometimes infer cause, but this causal inference requires collecting more data and making more analytical assumptions; they encourage researchers to place emphasis in the former rather than the latter, and caution them to tolerate the ambiguity of results despite their effort in building stronger designs. One way to increase the relevance of non-experimental designs in the field of evidence-based public health is to improve the reporting standards, being transparent and clear, and recognizing pitfalls and limitations as well as strengths, as stated by Des Jarlais et al (2004). After all, a possible threat to validity is not always a plausible one, and we should not automatically dismiss the potential of nonrandomized studies.

When randomized controlled trials are not the appropriate choice, researchers must turn to quasi-experimental study designs. Shadish et al (2002) provided useful guidelines for designing studies starting with the most basic design, conducting only a posttest, to more sound designs which incorporate control groups with multiple pretests as well as posttests. Quasi-experimental designs have been accused of not being rigorous enough to use their results to construct causal statements. This idea was disproved by Cook, Shadish, and Wong (2008) who, while comparing study results of randomized control experiments with exemplarily designed observational studies, found that both types of study design produced similar patterns of statistical significance. These results were found across multiple studies. The importance of exemplary study design and methods for interventions as well as detailed descriptions of that design were focused on in the Des Jarlai, Lyles, & Crepaz (2004) commentary. Des Jarlai et al. developed a most useful guideline, “TREND”, to help researchers evaluate the soundness of an intervention that uses a quasi-experimental design approach. Researchers can use “TREND” to determine if articles are presenting enough information about their study and methods to be of actual use in a broader context.

In the undergraduate class I TA for, someone recently (erroneously) referred to a study thusly: “they gave some pregnant women more alcohol and some less, and looked at the effects on their kids.” Social scientists must often forgo experiments to avoid such obvious ethical concerns. Researchers may bemoan the fact that we cannot conduct experiments on every relationship of interest, and SCC certainly show that all non-experimental designs are subject to validity threats. However, the take-home message for these readings is that through careful, thoughtful design, quasi-experiments can produce valid outcomes. The issue with many quasi-experimental designs is that the researcher has not taken care to avoid threats to internal validity. This is the case with selection bias, as Cook, Shadish and Wong point out. I can think of many studies in which this threat occurred, but researchers relied on “off-the-shelf” selection correlates like race, SES, and gender instead of exploring specific theoretical selection correlates. The TREND guidelines proposed by Des Jarlais, et al. further help researchers to design studies to meet high standards of validity, and to publish findings in digestible and analyzable ways. These ideas will help me to design more effective intervention studies in the future.

These readings restored my faith in our collective ability to successfully (with practical means) measure change in people and their surrounding world (although Table 4.3 read like a synopsis of social psychology). They all highlighted the importance of careful thought and planning about designs other than RCTs, and emphasized how to phrase any implications about causal effects resulting from said studies. I admit that I’m far more comfortable with the designs that use both control groups and pretests, especially the double pre-test (pp. 145). Let’s say that there was a double pretest of parent involvement, then the treatment of childcare setting (Head Start, daycare, parent care, preschool), next the post-test regarding parent involvement was conducted during the end of the treatment. How much of a concern would it be if the post-tests were given over a series of weeks (where some had nearly completed the treatment, others entirely completed, and others completed some time ago)? Page 158 briefly mentions that in an ideal setting, the treatment is temporally separated from post-test, but did not address any statistical approaches for coping with the varied timing of post-tests.

Quasi-experiments are those that resemble randomized experiments, but do not include random sampling assignment in the research design because of various constraints. These experiments may limit the ability to extract descriptive causal inferences, especially when done without a pretest or control group. Careful consideration of the alternative explanations and the minimization of threats to validity with a priori design elements can strengthen the descriptive causal inferences drawn from quasi-experiments.

If randomization is not possible, in addition to research design elements, a researcher can use other methods of assignment such as a cutoff point on a specified variable as is the case with regression discontinuity design (RDD) in order to approximate randomization. RDD has been shown to be concordant with random assignment in a few studies that used within-study comparisons to assess the outcomes of both methodologies (Cook & Wong, 2013). Another strategy to improve causal inferences drawn from quasi-experiments has been suggested by the TREND group, who follow clinical trials researchers in proposing a transparent reporting system of evaluations that would standardize reporting procedures for behavioral interventions. Both of these approaches represent attempts to increase the validity of quasi-experimental designs that are commonplace within the constraints of the social sciences.

It seems clear to those in the social and behavioral sciences that the RCT while long considered the holy grail of intervention research, is not always an ethically or practically plausible option as we seek to identify effective and efficacious interventions. As Cook, Shadish and Wong discuss, there are cases in which RCTs are not necessary, where the causal inferences of non-randomized experiments can take the place of RCT results, with the proper statistical adjustments in place as a precaution. Even within-study analysis of results provides little clarity about the level of agreement between various studies without a systematic method of analyzing study results and effect sizes from differing methodologies. The TREND guideline for publication of non-experimental study publication offers a set of critical factors which authors should consistently include in manuscripts to facilitate analysis of results between studies. I wondered as I read these articles, after the design considerations raised by SC&C, why is it that the RCT is so highly regarded in research, if other methods of inquiry can offer insight into causation as well? And do we lose a valuable aspect of the causal relationship by clinging so tenaciously to the RCT?

The readings for this week suggest that though quasi-experiments contribute to the field, such studies are “weaker” than their experimental counterparts. As is noted by Cook et. al (2008), “the randomized experiment reigns supreme institutionally supported through its privileged role in graduate training, research funding, and academic publishing” (725). However, the same piece suggests that quasi-experiments can contribute to the field when threats to validity are accounted for. Similarly, the book chapters suggest that quasi-experiments should be included in our methodological repertoire (134), but that with such studies, researchers need to “tolerate ambiguity” (161), essentially perpetuating the prevailing scientific norms privileging experimental designs. This is echoed by Jarlais and colleagues (2004), who state “ when empirical evidence from RCTs is available, ‘weaker’ designs are often considered to be of little or no evidentiary value’” (361). What are the consequences of this? Much like with the tendency to overlook null hypotheses and qualitative research, could being overly critical of quasi-experimental designs lead to the dismissal of potentially important contributions? Given real issues of practicality and restrictions, is it necessary to change these attitudes/norms in the field? If so, how can we go about doing that?

To be completely honest, the many designs and factors to draw inference without threat of validity appears to be more challenging then actually performing a study. I am also curious to know what exactly a journal is looking for when they review a study for publication. Are they addressing issues in design and threats of validity? Does a study’s ability to draw causal inference increase a journal’s impact factor? For instance, studies that employ more appropriate study design and methodology translate to stronger overall impact of the journal on the field of research?

The availability and overall ability of a researcher to perform a randomized experiment influences, often decides, the study design. Is the amount of attention warranted by quasi-experiments and nonexperimental studies to reduce threats of validity and improve chance for causal inference as valuable as being able to execute a randomized experiment? Do researchers get “lazy” in designing a quasi-experimental study because there are so many factors to consider?

Finally, how deep is the evidence for synthesizing results across randomized experiment, quasi-experiment, and nonexperiment study designs? Earlier reviews found discrepancies between randomized experimental and observational studies but a more recent review found the opposite in that observational research produced similar outcomes to that of experimental design.

Quasiexperimental designs, their nuances discussed in detail in Shadish, Cook, & Campbell (2002), share many features with randomized experiments (RCTs) aside from randomization. Although RCTs are considered the “gold standard” in causal inference, Cook, Shadish, & Wong (2008) present within-study comparisons that demonstrate in some quasiexperimental designs – regression discontinuity, abbreviated interrupted time-series, and simplistic quasiexperimental designs (if criteria of population matching (e.g., Bloom et al., 2005) and careful measurement selection (e.g., Shadish et al., in press) are met) – results produced mirror findings from experiments. Thus policy makers should have reasonable confidence in their use of quasiexperimental studies in their efforts toward evidence-based policy creation; an especially valuable notion when random assignment is not feasible. The TREND checklist creates a framework to improve the quality of data reported for quasiexperimental studies and lends further support to policy makers’ decisions (Des Jarlais et al., 2004). Albeit optimistic, questions still remain for quasiexperiments, especially regarding the context and conditions for producing unbiased results.

Campbell and Stanley, in 1963, state that there is no perfect study and the general point of CS and CCS is to examine the validity based strengths and weaknesses of different designs and possible counterbalances of those weaknesses all with the understanding that designs may be dictated by situational constraints. While the idea of the within experiment comparisons to examine which quasi-methods produce results that mimic RCT results(such as those examined by Cook et al.) can provide support for stronger causal statements when RCT’s are not possible, these studies exacerbate the largely arbitrary debate over which quasi-method is “best”. Instead, researchers should recognize the utility, strengths, weaknesses, and situational constraints of each design, and take what can be learned from the study based on its strengths to help inform future studies on the correlation of interest. Fewer resources should be placed on which design is “better” and more on the publication of more detailed study designs for RCT, quasi, and non-experimental studies as advocated by Des Jarlais et al. These publications would allow researchers to make more informed conclusions about the results of single studies and thus strengthen their ability to use these results to inform future studies.

After reading Chapters 4 & 5 in addition to the Cook, Shadish and Wong article, I am still fuzzy on when the use of propensity scoring versus a regression discontinuity design versus an instrumental variable is appropriate to strengthen what Cook et al. categorize as non-experimental study designs. I am specifically thinking of secondary data analysis projects using cross-sectional or perhaps only two years of panel survey data; how would I determine which of these methods is the most appropriate to use in creating a comparable comparison group to minimize/detect selection bias? Further, should “off-the-shelf” covariates be utilized to create and apply a propensity score in matching treatment to comparison cases if those covariates are all that are available? In a similar vein, is the “shotgun” approach to selecting variables for a propensity score that bad if we don’t know exactly what combination of variables/factors are correlated with treatment and effect(s) (i.e. theoretical/empirical literature on a given cause and its effect(s) is scant)? More broadly, how can we discern when the addition of one or more of these design elements to strengthen the causal inference from a given study is worth pursuing (and/or convincing others we should)?

The readings for this week raised questions in my mind about the assumption in randomized experiments (RE) that random assignment to treated and control groups is indeed enough to eliminate bias. The article by Shadish ,Cook, and Campbell talk about building designs that render threats to internal and external validity implausible. Cook, Shadish and Wong, when comparing REs to QEDs, suggested that this is plausible outside of REs and the key is greater precision around the selection and measurement of covariates in reducing the influence of bias; the more effort, generally the more comparable. Where I am struggling is how can we say an RE is superior because of random assignment? I understand the statistical concept but the RT examples in the book and articles talk about random assignment from a subset of a population (schools in a school district among a nation of school districts) and because random assignment was used the experiment were somehow better. Don’t we still have bias, hidden bias, and still need to face all of the threats to validity? I am not sure how REs are any different or superior to QEDs in this situation. In either case researchers need to be knowledgeable to all threats of validity and still face selection bias. I know I am oversimplifying this but if well-done QED can approximate REs why do we need REs? Why not consistently employ meta-analyses? [You are missing something critical – and I am glad you laid it out! Randomization provides for maximum internal validity – that is, causal inference. It does nothing for external validity (generalizability). So, when causal inference is most important, which it often is, then an RCT is the strongest approach. When generalizability is most important, which it is AFTER we know that something is efficacious, then, other designs might be better. However, as we will see in later chapters, there are things we can do to enhance generalizability from RCTs.]

Various reasons can lead researchers to exclude randomization, control groups or pre-test observations, moving away from long-standing, gold-standard traditions of RCTs. While causal inferences could be compromised without such features, the readings were compelling in how best to guard against validity threats in quasi-experimental designs. Questions for my own research arose: Could one-group posttest-only designs be appropriate for programs aiming to increase pro-social bystander behavior because this knowledge and these skills are not commonly taught in other areas of one’s life, and thus, be only attributed to an intervention? For research on rare events, such as bystander intervention, finding suitable proxies are challenging. Could the TTI’s “Related Behaviors” serve as a proxy, if the literature supports covariance between the related behavior and the behavior of interest? If found to be valid measures, students could potentially be matched based on their likeliness to intervene in other, related risk situations. This strategy would be more compelling if this construct was a near perfect predictor of intent to intervene in dating violence and sexual assault situations, but it would not impossible to support as valid if journal space allowed enough room for investigators to be transparent in their explanations for such design features.