Symposium Title: Issues in the Design, Implementation, and Analysis of Blocked group-randomized trials
Chair: Beth Ann Griffin, Associate Statistician, RAND Corporation
Papers (structured abstracts below):
1) Intraclass Correlation Values for Student Achievement Tests in Oregon
*Michael T. Coe, Ph.D., and *Makoto Hanita, Ph.D., Northwest Regional Educational
Laboratory
2) Evaluating the impact of blocking on power in group-randomized trials
*Beth Ann Griffin, Daniel McCaffrey, John Pane, RAND Corporation
3) Examples of Blocking in Group-Randomized Trials
*Daniel F. McCaffrey, John F. Pane, Mary Ellen Slaughter, MS, J. R. Lockwood, RAND Corporation and Matthew G. Springer, Vanderbilt University
4) The Analysis of Matched-Pairs Group-Randomized Trials
*Andres Martínez, University of Michigan and Jessaca Spybrook, PhD, Western Michigan University
*Designates the presenters for each paper.
Discussant: Spyros Konstantopoulos, Assistant Professor, educational research, measurement, and evaluation department, Boston College
Rationale (550 word limit):
Group-randomized trials have become increasingly common in program evaluation in education research. In these studies, intact groups of individuals (e.g. schools or classrooms) are assigned at random into an intervention group or a control group, and the goal of the evaluation study is to determine the causal effect of the intervention in question. The use of blocking where groups are classified into subclasses called blocks such that similar groups are in the same block and randomization occurs within blocks provides several important advantages for group-randomized trials. First, it can potentially improve the power of such designs by reducing the variability in the balance between the treatment and control arms and increasing the precision of the treatment effect estimates. Second, blocking increases the face validity of the design and provides insurance against drop-out. However, care must be taken when designing, implementing, and analyzing data from these studies. First, issues surrounding assumed intraclass correlation values still exist. Also, many questions remain about how much power can actually be gained in real settings where the researchers have limited control over the schools in the study and general administrative data that are used in determining blocks. Moreover, the involvement of teachers in the delivery or receipt of the intervention creates challenges for designing group-randomized experiments to test education interventions that must be taken into account. Finally, many researchers are still unclear on how to properly account for blocking in the analysis phase of the study when blocking is incorporated into its design.
The goal of this symposium is to discuss key issues in the design, implementation, and analysis of group-randomized trials. The first paper will discuss the ICC values for Oregon achievement tests in various subjects, grade levels, and key student and school subgroups to make these values widely available for researchers who are planning to use state assessment data as outcome measures for experimental and quasi-experimental designs. The second paper will discuss how much power can be gained or lost by using matched pairs in the design of a group-randomized trial and which factors (socio-demographic and/or scholastic variables) play a key role in reducing the variability of the outcome in a large group-randomized trial of a new technology-based Algebra I curriculum. The third paper will describe the challenges randomized trials face when teachers are involved in the delivery or receipt of an educational intervention and describe two distinct study designs that can address these challenges and the role that blocking played in these designs. The fourth paper will review how to account for the blocking in the analytical stage of a study designed as a blocked group-randomized trial and examine the consequences of not taking the blocking into account when doing the analysis. Together, these four papers will help advance attendees knowledge of how best to design, implement, and analyze data from blocked group-randomized trials. The symposium brings together the work of a group of well-respected statisticians and education researchers whose backgrounds have focused on improving study designs and methods for estimating causal effects in education research.
Abstract Title Page for Paper 1
Not included in page count.
Title:
Intraclass Correlation Values for Student Achievement Tests in Oregon
Author(s):
Michael T. Coe, Ph.D., and Makoto Hanita, Ph.D., Northwest Regional Educational Laboratory.
2009 SREE Conference Abstract Template
Abstract Body
Background/context: Description of prior research and/or its intellectual context and/or its policy context.
Group-randomized studies, also called cluster-randomized trials (CRTs), are an increasingly important category of research aimed at evaluating practices in field settings where groups of individuals are given the same treatment. This situation is common in many education, criminal justice, public health, and community social service settings.
In the past, statistical methods for studies in which individuals are randomly assigned to experimental conditions have been mistakenly applied to studies in which groups of individuals were randomly assigned to conditions. The great majority of studies involving group-randomized designs that have been published in recent decades were analyzed incorrectly.
Briefly, the problem is that outcome data from a cluster of individuals (such as students in the same classroom) tends to be somewhat similar, with lower natural variation compared to the more diverse outcome data obtained from a random sample of individuals from the same population. Since statistical methods for comparing treatment groups to control groups use random variation between individuals as the background against which to judge whether differences between groups are significant, this reduction in variation among individuals distorts the statistical tests.
Specifically, the reduction in background individual variance due to clustering causes differences between groups to seem “significant” in a statistical sense more often than they should. In technical terms, the statistical assumption that the individual observations were independent of each other has been violated, error variance in the data set has been reduced to below what it really is in the population, and the Type I error rate of the statistical tests has been increased. Many researchers in the past have designed group-randomized studies but then analyzed the clustered data as if it had come from a random sample of individuals from the population. This resulted in poorly designed studies and incorrect conclusions about the effectiveness of interventions.
In addition to incorrect analyses of data from these studies, group-randomized research designs present another problem. When planning future studies, researchers try to predict how many participants will be needed in order to maximize the efficiency of the study. A sample size that is too large will waste time and money, testing numbers of participants beyond those needed to answer the question, while a sample size that is too small will have little chance to find stable, replicable results, also wasting time and money. An optimal sample size is large enough but not too large. A great deal of work has been done to aid researchers in designing studies with optimal efficiency. This planning phase is typically called an analysis of statistical power, with “power” in this context referring to the sensitivity of the design to detect meaningful differences between groups.
In group-randomized studies, there are various methods for properly analyzing existing data sets, as well as for conducting power analyses to plan future studies. The pace of development in this area is currently swift. New methods for conducting proper power analyses in cluster-randomized designs take into account the expected degree of similarity in the outcome data collected from a cluster of individuals. This parameter is known as the intra-class correlation coefficient (ICC). In education, the concept of ICC is relevant to any study where the individual student outcome is measured in situations where clusters such as classrooms, schools or districts were randomly assigned to a particular practice or policy condition.
Researchers have begun to plan many more group-randomized studies and conduct many more data analyses using appropriate designs (e.g., nested designs) and statistical models (e.g., multilevel models). However, because these methods are only now becoming widespread, some of the key information that makes this approach possible is still difficult to find. Data concerning expected ICC values for various instruments and within various populations is generally not available. Without a reasonably accurate estimate of the ICC in a given research design, researchers are at a disadvantage when attempting to perform a power analysis to determine a sample size and create a study budget.
Several recent articles have attempted to fill this gap by providing ICC values for specific situations, as well as addressing other technical issues such as the value of using blocking variables during random assignment or covariates at the individual or group level during analysis to improve the efficiency of these designs. For example, Bloom, Richburg-Hayes & Black (2005, 2007) examined ICCs and minimum detectable effect sizes (MDEs) in data from five school districts under a variety of statistical models. Hedges & Hedberg (2007a) provided a set of ICC values for national probability samples, including reading and mathematics outcomes and ICC values from unconditional models as well as models with demographic and/or pretest covariates. Hedges & Hedberg (2007b) published ICCs from a sample of rural schools, finding that the ICCs tended to be smaller in these schools than in urban samples, and that ICCs tended to be smaller in upper grades than in lower grades. These findings will be a great help to researchers planning studies with similar outcomes, sampling frames and data structures. (For additional recent work in this area, see also Bloom, 2004; Bloom, Bos & Lee, 1999; Murray, 1998, 2001; Murray & Hannan, 1990; Murray, Hannan & Baker, 1996; Murray & Blitstein, 2003; Murray, Varnell & Blitstein, 2004; Raudenbush, Martinez and Spybrook, 2007; Schochet, 2005; Varnell, Murray, Janega & Blitstein, 2004; Schochet, 2005.)
Purpose/objective/research question/focus of study: Description of what the research focused on and why.
The purpose of this study is to determine the ICC values for Oregon achievement tests in various subjects, grade levels, and key student and school subgroups, and to make these values widely available for researchers who are planning to use state assessment data as outcome measures for experimental and quasi-experimental designs. This study does not duplicate the early efforts noted above, but extends knowledge of typical ICC values by focusing on designs in which existing state assessment data are used as outcome measures. As Hedges and Hedberg note, their 2007 study included national probability samples using data from many states, but used two-level models in which the variance related to state, district and school were all combined. As a result, their ICC values are likely to be larger than those found in studies conducted within a single state, using models in which there is no state level variance.
Setting:Specific description of where the research took place.
The study is taking place in Oregon as a partnership between the Research Unit of the Center for Research, Evaluation and Assessment at the Northwest Regional Educational Laboratory and the Oregon Department of Education, funded by the W.T. Grant Foundation.
Population/Participants/Subjects: Description of participants in the study: who (or what) how many, key features (or characteristics).
Statewide assessment data in mathematics and in reading have been obtained for all students in grades 3 through 8, and in grade 10 during the 2006-07 and 2007-08 school years. State assessment data in science were obtained for all students in grades 5, 8 and 10 for the 2007-08 school year. State assessment data in writing were obtained for all students in grades 4, 7 and 10 for the 2006-07 and 2007-08 school years.
Intervention/Program/Practice: Specific description of the intervention, including what it was, how it was administered, and its duration.
This study is not an effectiveness test of an intervention. The focus is on providing better psychometric data on state achievement tests, in order to enable researchers to be more precise when calculating statistical power during the planning of future effectiveness studies of interventions in schools.
Research Design:Description of research design (e.g., qualitative case study, quasi-experimental design, secondary analysis, analytic essay, randomized field trial).
This study involves a secondary analysis of statewide assessment data in order to produce a comprehensive set of intraclass correlation values across grade levels and academic subjects.
Data Collection and Analysis: Description of plan for collecting and analyzing data, including description of data.
We have statewide assessment data in mathematics, reading, writing and science from the Oregon Department of Education. Student test scores for the 2007-08 school year will be our outcome measure for all the analyses. Separate analyses will be performed for each grade, and for each subject, for the entire population. Then these analyses will be repeated for various student subgroups such as ELLs and Special Ed Students, as well as for various school subgroups such as High Poverty (High FRL%) Schools and High Minority Schools. After the initial set of descriptive analyses to check distributions, the data will be subjected to a series of analyses using hierarchical linear modeling (HLM).
The first set of analyses will involve the use of ONEWAY ANOVA (empty) models, to calculate the (unconditional) ICC values. A 2-level (student nested within school) ONEWAY ANOVA model will be fit to the data, to calculate the school-level ICCs for each grade level and subject area. This will be followed by an exploratory 3-level (student nested within school nested within district) analysis using a ONEWAY ANOVA model, to partition the school level variance in the 2-level model into within-district-between-school variance, and between-district variance. From the result of this 3-level analysis, we will calculate the within-district school-level ICCs, and also the district-level ICCs. The 3-Level analysis is exploratory, since many rural districts in Oregon are single-school districts or districts with only a few schools per grade levels, making the partitioning of school- and district-level variance difficult.
The second set of analyses will assess the degree to which the school-level ICC can be reduced by including the previous year’s assessment score, aggregated at the school level, as a covariate. This covariate is readily and inexpensively available in all states, and in previous studies has proven to be often effective in reducing school-level ICCs. Since Oregon students take mathematics and reading tests every year, an alternative analysis, using the individual students’ previous assessment scores as a covariate, will also be performed for these subjects.