Online Workshop

Tips of the Trade: Developing High Quality Research Syntheses

Presenters: Chad Nye and Oliver Wendt

Part 4 - Quality Appraisal/Group Design

Text version of PowerPointTM presentation for workshop sponsored by SEDL’s Center on Knowledge Translation for Disability and Rehabilitation Research.

Slide 62: Quality Appraisal

Slide 63: What is Quality Appraisal?

  • Process of assessing methods and results of each study is often referred to as critical appraisal or assessing study quality
  • Aims to determine whether the study is adequate to answer the question
  • Research evidence can be produced by a wide range of methods and approaches, sometimes distorted by systematic error due to reasons of cost, practicality or accident

(Petticrew & Roberts, 2006)

Slide 64:

“Any research synthesis is only as strong as the data on which it is based”.

(Hedges & Olkin, 1985, p. 14)

Slide 65: What is Study Quality?

  • Can mean different things to different people depending on discipline
  • Many aspects of study design can affect the outcome of a study
  • Most important design criteria relate to its internal validity, i.e., the extent to which a study is free from the main methodological biases (such as selection bias, response bias, attrition bias, and observer bias)

(Petticrew & Roberts, 2006)

Slide 66: What is Study Quality? (cont.)

  • If a study is not internally valid, then observed effect sizes from that study may be incorrect
  • The study design with greatest internal validity is the true experiment using randomization to assemble comparable groups (Cook & Campbell, 1979)

–In medical and health sciences often called randomized controlled trial (RCT)

Slide 67:What is Study Quality? (cont.)

For randomized trials the following aspects are critical to assessing study quality (Jadad,1998):

–Relevance of the research question

–Internal validity: degree to which trial design, conduct, analysis, and presentation have minimized or avoided biased comparisons of the interventions under investigation

–External validity: precision and extent to which results can be generalized to other settings

–Appropriateness of data analysis and presentation

–Ethical implications of the intervention

Slide 68: Importance of Study Quality

  • Issue of study quality has increased in importance in the fields of educational research and health care research
  • Methodological reviews indicate relatively high prevalence of poor-quality studies (arrow to right) can mislead health care practice and policy

(Schulz, Chalmers, Hayes, & Altman, 1995)

  • Therefore, including poor-quality studies in a systematic review/meta-analysis can be a source of bias

(Torgerson, 2003)

Slide 69: Assessing Study Quality

  • Quality assessment can be used to “weight” each study summarized in the review
  • “Weighting” is usually done narratively (arrow to right) differentiating clearly between higher and lower quality studies
  • Can be helpful for making appropriate methodological recommendations regarding future research
  • Another means of investigating impact of study quality is sensitivity analysis (arrow to right)testing how sensitive the review findings are to the inclusion and exclusion of studies of different quality

(Petticrew & Roberts, 2006)

Slide 70: Assessing Study Quality (cont.)

  • Researcher directs attention to all key aspects of the study – its design, methods, participants, setting, and any key measures or variables
  • Often involves using a checklist or scale to formalize process of appraising the study (arrow to right)main methodological issues are examined systematically

–Using same approach for each study makes it less likely that problems or biases will be overlooked

(Petticrew & Roberts, 2006; Torgerson, 2003)

Slide 71: Assessing Study Quality (cont.)

  • Critical appraisal is essential to sound systematic reviewing, but not an end in itself

–No study is perfect and it is easy to find flaws in every piece of research

–Do not take critical appraisal to extremes: Type I error if studies are accepted on face value, Type II error if valid evidence is rejected due to too much scrutiny (Petticrew & Roberts, 2006)

- Identify errors that are large enough to affect how the result of the study should be interpreted!

Slide 72: Tools for Quality Appraisal

  • Educational and health care research has become aware of potential problems of poor-quality research evidence  many institutions have produced sets of quality criteria to classify studies as being rigorous or not (e.g., EPPI-Centre, Scottish Intercollegiate Guidelines Network)

Slide 73: Simeonsson & Bailey (1991)

  • Mainly an appraisal of internal validity
  • Based on three dimensions: (a) research design, (b) interobserver agreement of dependent variable, and (c) treatment integrity
  • Classifies the certainty of evidence into four groupings:
  • Conclusive
  • Preponderant
  • Suggestive
  • Inconclusive

Slide 74: Simeonsson & Bailey (1991) (cont.)

  • Conclusive: outcomes are undoubtedly the results of the intervention based on a sound design and adequate or better interobserver agreement and treatment integrity
  • Preponderant: outcomes are not only plausible but they are also more likely to have occurred than not, despite minor design flaws and with adequate or better interobserver agreement and treatment integrity

Slide 75: Simeonsson & Bailey (1991) (cont.)

  • Suggestive: outcomes are plausible and within the realm of possibility due to a strong design but inadequate interobserver agreement and/or treatment integrity, or due to minor design flaws and inadequate interobserver agreement and/or treatment integrity
  • Inconclusive: outcomes are not plausible due to fatal flaws in design

Slide 76: Example - Simeonsson & Bailey (1991)

Table with 7 columns.

Column 1: Study - Buffington, Krantz, MeClannahan, & Poulson (1998).

Column 2: Purpose - To demonstrate the effects of modeling, prompting, & reinforcement on the acquisition of gestures.

Column 3: Participants (n, CAa, Speechb) - 4 children (Anne [6-5], Oscar [6-4], Kevin [4-5], Nick [4-5]) with some spoken language.

Column 4: Design- MPDacross behaviors.

Column 5: Outcomes - (1) Acquisition: Pointing and saying “look” to indicate; (2) Generalization across settings and stimuli.

Column 6: Effect Size Estimate PNDc (percent) or ESd - (1) Acquisition: Anne (94) Oscar (96) Kevin (96) Nick (95); (2) Generalization: The generalization is unclear based on the results reported.

Column 7: Appraisal – Conclusive-Acquisition: Excellent design, IOA, and TI. Inconclusive-Generalization: The generalization is unclear based on the results reported.

Slide 77: Simeonsson & Bailey (1991) (cont.)

Translation into practice:

  • Only evidence evaluated as being suggestive or better should be considered re: implications for practice
  • Inconclusive studies are not appropriate for informing practice due to their fatal design flaws! They may be considered only in terms of directions for future research.

Slide 78: PEDRO Scale

  • Group designs

–10-point PEDRO scale

–10 points maximum for RCTs; 8 points max for non-RCTs

Slide 79: Adapted PEDro Scale

When to use this scale?: Please use this scale for any group studies involving (a) randomized control trials (RCTs), (b) non-RCTs, and 9C) case series.

How to use this scale? Answer all questions with yes or now. Count the number of Yes responses to arrive at the total score.

Origin of the scale: This scale is based on the PEDroscale, which, in turn, is based on theDelphi list developed by Verhagen and colleagues (Journal of Clinical Epidemiology51: 1235-41, 1998). The Delphi list is a list of trial characteristics that was thought to be related to trial “quality” by a group of clinical trial experts. The PEDroscale contains additional items on adequacy of follow-up and between-group statistical comparisons. The first item relates to the external validity (specifically the participant selection criteria) and thus does not apply to EVIDAAC, which is strictly examining the quality of evidence in terms of its validity. Therefore, the first item was eliminated for EVIDAAC purposes. Additionally, an item each was added concerning reliability and treatment integrity. The remaining 12 items assess the internal validity of each trial and whether the trial contains sufficient statistical information to make it interpretable. Thus, the internal validity of each trial is ranked based on a total score out of 12. The items are as follows

Slide 80:

Table with 2 columns: Appraisal Item and Rating (Yes / No), and 13 rows.

Appraisal Items:

1. The participants were randomly allocated to interventions (in a crossover study, participants were randomly allocated to an order in which treatments were received). (Yes / No)

2. Allocation was concealed. (Yes / No)

3. The intervention groups were similar at baseline regarding the most important prognostic indicators. (Yes / No)

4. There was blinding of all participants. (Yes / No)

5. There was blinding of all therapists who administered therapy. (Yes / No)

6. There was blinding of all assessors who measured at least one key outcome. (Yes / No)

7. Inter-observer agreement for the dependent measure/s meets minimal standards (i.e., IOA = 80percent; Kappa = 60percent) and is based on >/=20percent of all sessions during each phase/condition. (Yes / No)

8. Treatment integrity is at an appropriate level given the complexity of the treatment, independently verified, and based on relevant procedural steps in >/=20percent of sessions during each phase/condition. (Yes / No)

9. Measures of at least one key outcome were obtained from more than 85percent of participants originally allocated to groups. (Yes / No)

10. All participants for whom outcome measures were available received the treatment or control condition as allocated or, where this was not the case, data for at least one key outcome was analyzed by “intention to treat”. (Yes / No)

11. The results between-intervention group statistical comparisons are reported for at least one key outcome. (Yes / No)

12. The study provides both point measures and measures of variability for at least one key outcome. (Yes / No)

(Row 13) Total # of Yes responses at bottom of far right yes/no column

Slide 81: EVIDAAC Scale

  • Appraisal scale developed for EVIDAAC: A database of appraised evidence in augmentative and alternative communication(Schlosser, R. W., Sigafoos, J., Eysenbach, G., & Dowden, P., 2007)
  • 10 items derived from recent SSED literature as well as quality criteria outlined by Horner and colleagues (2005)
  • Each item is answered yes or no, and given 1 or 0 points for a maximum final quality score of 10.
  • Higher scores indicate better methodological quality

Slide 82:

Two-column table with: Appraisal Item and Rating (yes/no) with 11 rows.

Appraisal items:

1. Participants, and participant selection, are described with sufficient detail to allow other researchers to select similar participants in future studies. (Yes / No)

2. Critical features of the physical setting are described with sufficient precision to allow for replication. (Yes / No)

3. The dependent variable is sufficiently operationalized. (Yes / No)

4. The dependent variable is being measured repeatedly using sufficient assessment occasions to allow for identification of performance patterns prior to intervention and comparison of performance patterns across conditions/phases (level, trend, variability). (Yes / No)

5. Inter-observer agreement meets minimal standards (i.e., IOA = 80percent; Kappa = 60percent) and is based on >/= 20percent of all sessions during each phase/condition. (Yes / No)

6. Experimental control is demonstrated via three demonstrations of the experimental effect (predicted change in the dependent variables varies with the manipulation of the independent variable) at different points in time (a) within a single participant (within-subject replication) or (b) across different participants (between-subject replication). (Yes / No)

7. Baseline data are sufficiently consistent before intervention is introduced to allow prediction of future performance. (Yes / No)

8. Experimental control is demonstrated via three demonstrations of the experimental effect (predicted change in the dependent variable varies with the manipulation of the independent variable) at different points in time (a) within a single participant (within-subject replication) or (b) across different participants (between-subject replication). (Yes / No)

9. The independent variable is defined with replicable precision. (Yes / No)

10. Treatment integrity is at an appropriate level given the complexity of the treatment, independently verified, and based on relevant procedural steps in >/=20percent of sessions during each phase/condition. (Yes / No)

(Row 11) Total # of Yes responses at bottom of far right yes/no column

Slide 83: Further Reading

  • Wendt, O., & Miller, B. (2012). Quality appraisal of single-subject experimental designs: An overview and comparison of different appraisal tools. Education and Treatment of Children, 35(2), 109-142.
  • Heyvaert, M., Wendt, O., Van Den Noortgate, W., & Onghena, P. (under review). Randomization and Data-Analysis Items in Tools for Reporting and Evaluating Single-Case Experimental Studies

Slide 84: Exercises

  • 1. Coding of group design article
  • 2. Quality appraisal of single-subject design article

Slide 85: References

Cook, T., & Campbell, D. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McNally.

Jadad, A. (1998). Randomized controlled trials: A user’s guide. London: BMJ Books.

Petticrew, M., & Roberts, H. (2006). Systematic Reviews in the Social Sciences: A Practical Guide. Malden, MA: Blackwell Publishing.

Schulz, K. F., Chalmers, I., Hayes, R., & Altman, D. G. (1995). Empirical evidence of bias: Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA, 273, 408-412.

Simeonsson, R., & Bailey, D. (1991). Evaluating programme impact: Levels of certainty. In D. Mitchell & R. Brown (Eds.), Early intervention studies for young children with special needs (pp. 280-296). London: Chapman and Hall.

Torgerson, C. (2003). Systematic Reviews. London, UK: Continuum., London, UK.

Slide 86: Synthesis of Group Design Studies

Slide 87: Basic Statistic of Meta Analysis

Standardization Mean Difference

d= Xe – Xc / Sp

Commonly called “d” statistic or “effect size”

Xe = mean of the experimental group

Xc = mean of the control group

Sp = pooled standard deviation of Xe and Xc combined

Slide 88: Figure 1. Overall Effect for Individual Studies: Tx vs Ctl

These data are the overall effect size, upper and lower limit of the 95% confidence interval, group n, and the forest plot representing the those values for the included studies.

Study / Year / Hedges’s g / Lower Limit / Upper Limit / Tx
n / Ctl
n
Amerikaner / 1982 / 0.421 / -0.282 / 1.124 / 15 / 16
Byham* / 1983 / 0.097 / -0.885 / 1.079 / 7 / 7
Stark / 1983 / 0.688 / -0.363 / 1.738 / 7 / 6
Straub / 1983 / 0.809 / 0.115 / 1.503 / 16 / 17
Wanat / 1983 / 0.489 / -0.218 / 1.196 / 15 / 15
Similon* / 1984 / 0.737 / 0.093 / 1.382 / 21 / 18
Lo / 1985 / 0.791 / 0.216 / 1.367 / 23 / 27
Merz / 1985 / 0.268 / -0.344 / 0.880 / 20 / 20
Northcutt* / 1985 / 0.168 / -0.653 / 0.988 / 10 / 16
Omizo / 1985 / 0.747 / 0.194 / 1.301 / 24 / 30
Omizo / 1988 / 0.514 / -0.031 / 1.060 / 23 / 30
Trapani / 1989 / 0.930 / -0.216 / 2.076 / 7 / 6
Margarlit** / 1995 / -0.255 / -1.011 / 0.501 / 25 / 32
Hart / 1996 / 0.021 / -1.240 / 1.281 / 4 / 3
Weiner / 1997 / 0.619 / -0.121 / 1.296 / 7 / 19
Wiener / 1997a / 0.587 / -0.200 / 1.375 / 13 / 19

Fixed Effects w** Hedges’s g = 0.514, Lower Limit = .334, Upper Limit = 0.695

Fixed Effects w/o** Hedges’s g = 0.555, Lower Limit = .370, Upper Limit = .741
The Hedges’s g & 95% CI values are also displayed in a forest plot.

The Hedges’s g column is highlighted in purple.

Slide 89: Figure 2. Overall Effect for RCT vs QED Studies: Tx vs Ctl

Study / Year / Hedges’s g / Lower Limit / Upper Limit / Tx
n / Ctl
n
RCT / Amerikaner / 1982 / 0.421 / -0.282 / 1.124 / 15 / 16
RCT / Hart / 1996 / 0.021 / -1.240 / 1.281 / 4 / 3
RCT / Lo / 1985 / 0.791 / 0.216 / 1.367 / 23 / 27
RCT / Omizo / 1985 / 0.747 / 0.194 / 1.301 / 24 / 30
RCT / Omizo / 1988 / 0.514 / -0.031 / 1.060 / 23 / 30
RCT / Stark / 1983 / 0.688 / -0.363 / 1.738 / 7 / 6
RCT / Straub / 1983 / 0.809 / 0.115 / 1.503 / 16 / 17
RCT / Trapani / 1989 / 0.930 / -0.216 / 2.076 / 7 / 6
RCT / Wanat / 1983 / 0.489 / -0.218 / 1.196 / 15 / 15
RCT / Weiner / 1997 / 0.619 / -0.121 / 1.296 / 7 / 19
RCT / Wiener / 1997 / 0.587 / -0.200 / 1.375 / 13 / 19
Fixed Model / 0.620 / 0.404 / 0.836
QED / Byham / 1983 / 0.097 / -0.885 / 1.079 / 7 / 7
QED / Merz / 1985 / 0.268 / -0.344 / 0.880 / 20 / 20
QED / Northcutt / 1985 / 0.168 / -0.653 / 0.988 / 10 / 16
QED / Similon / 1984 / 0.737 / 0.093 / 1.382 / 21 / 18
Fixed Model / 0.374 / 0.011 / 0.736

Fixed OverallHedges’s g = 0.571, Lower Limit = 0.385, Upper Limit = 0.757

The Hedges’s g & 95% CI values are also displayed in a forest plot

The Hedges’s g column is highlighted in purple.

Slide 90: Figure 3. Overall Effect for RCT vs QED Studies: Tx vs Ctl

Table: 2 Rows, 3 Columns.
Row 1 - RCT Studies: Effect Size: 0.620, Lower Limit: 0.404, Upper Limit: 0.863

Row 2 - QED Studies: Effect Size: 0.374, Lower Limit: 0.011, Upper Limit: 0.736

Question: Are these effect sizes significantly different?

NO!

QBet= 1.304 p=.253

The Hedges’s g & 95% CI plot is displayed to the right of the table.

Slide 91: Dilemma???

To Lump or Split???Do we combine all of the studies from the 2 different designs to answer the question of effectiveness of Social Skill Training for LD children?

With no statistically significant difference between the design types, the question for further data investigation is should the effect sizes for each study be combined (lump) or presented independently (split the RCT and QED studies)

Slide 92: Figure 3. Overall Effect for RCT vs QED Studies: Tx vs Ctl

Table: 2 Rows, 3 Columns.
Row 1 - RCT Studies: Effect Size: 0.620, Lower Limit: 0.404, Upper Limit: 0.863

Row 2 - QED Studies: Effect Size: 0.374, Lower Limit: 0.011, Upper Limit: 0.736

The Hedges’s g & 95% CI plot is displayed to the right of the table.

Arrow pointing to the Effect Size column reads “RCTs are almost 2x as large!!”

Slide 93: Dilemma???

AND These are 2 different types of scientific evidence??

(Image of an apple and an orange)Isn’t this the problem??

Slide 94: Figure 5. Study Effect for Tx vs Ctl Measuring Social Behavior Skills

These are the studies that measured the effect of SST on social behavior skills and present the effect size, 95% CI, and associated forest plot for each study and the overall result.

Study / Year / Hedges’s g / Lower Limit / Upper Limit / Tx
n / Ctl
n
Amerikaner / 1982 / 0.593 / 0.238 / 0.947 / 15 / 16
Straub / 1983 / 0.808 / 0.371 / 1.299 / 16 / 17
Similon* / 1984 / 1.143 / 0.476 / 1.810 / 21 / 18
Merz* / 1985 / 0.301 / -0.053 / 0.655 / 20 / 20
Omizo / 1988 / 0.509 / 0.265 / 0.793 / 23 / 30
Wiener / 1997 / 0.571 / 0.312 / 0.830 / 7 / 19
Wiener / 1997a / 0.571 / 0.359 / 0.785 / 13 / 19

Fixed Effects: Hedges’s g = 0.556, Lower Limit = 0.438, Upper Limit = 0.673

The Hedges’s g & 95% CI values are also displayed in a forest plot

The Hedges’s g column is highlighted in purple.

Slide 95: Figure 6.Study Effect for Tx vs Ctl Measuring Social Cognititive Skills

These are the studies that measured the effect of SST on social cognitive skills and present the effect size, 95% CI, and associated forest plot for each study and the overall result

Study / Year / Hedges’s g / Lower Limit / Upper Limit / Tx
n / Ctl
n
Amerikaner / 1982 / 0.292 / -0.407 / 0.990 / 15 / 16
Byham* / 1983 / 0.180 / -0.803 / 1.163 / 7 / 7
Wanat / 1983 / 0.489 / -0.218 / 1.196 / 15 / 15
Lo / 1985 / 0.791 / 0.216 / 1.367 / 23 / 27
Merz* / 1985 / 0.212 / -0.219 / 0.644 / 20 / 20
Northcutt* / 1985 / 0.158 / -0.077 / 0.395 / 10 / 16
Similon* / 1984 / 0.744 / .0373 / 1.115 / 21 / 18
Omizo / 1985 / 0.747 / 0.194 / 1.301 / 24 / 30
Omizo / 1988 / 0.684 / 0.134 / 1.235 / 23 / 30
Wiener / 1997 / 0.808 / 0.194 / 1.423 / 10 / 19
Wiener / 1997a / 0.520 / 0.025 / 1.015 / 13 / 19

Fixed Effects: Hedges’s g = 0.506, Lower Limit: 0.362, Upper Limit: 0.622

The Hedges’s g & 95% CI values are also displayed in a forest plot

The Hedges’s g column is highlighted in purple.

Slide 96: Figure 7.Study Effect for Tx vs Ctl Measuring Social communication Skills

These are the studies that measured the effect of SST on social communication skills and present the effect size, 95% CI, and associated forest plot for each study and the overall result.

Study / Year / Hedges’s g / Lower Limit / Upper Limit / Tx
n / Ctl
n
Stark / 1983 / 0.688 / -0.363 / 1.738 / 7 / 6
Trapani / 1989 / 0.930 / -0.216 / 2.076 / 7 / 6
Hart / 1996 / 0.021 / -1.240 / 1.281 / 4 / 3

Fixed Effects: Hedges’s g = 0.596, Lower Limit = 0.288, Upper Limit = 0.925

The Hedges’s g & 95% CI values are also displayed in a forest plot

The Hedges’s g column is highlighted in purple.

Slide 97: Figure 8. Study Effect for Tx vs Ctl Measuring Educational Related Outcomes

Three row table with 8 columns.

Column headings are Study, Year, Outcome, Hedges’s g, Lower Limit, Upper Limit, Tx n, Ctl n.

Row 1 – Study: Byharn*, Year: 1983, Outcome: Study Skills, Hedges’s g: 0.014, Lower Limit: -0.967, Upper Limit: 0.995,Tx n: 7,Ctl n: 7.

Row 2 – Study: Similon*, Year: 1984, Outcome: Access to Tch, Hedges’s g: 0.234, -Lower Limit: 0.385, Upper Limit: 0.853, Tx n: 21, Ctl n: 18.

Row 3 – Fixed Effects, Hedges’s g: 0.171,Lower Limit: -0.352, Upper Limit: 0.695.

These are the studies that measured the effect of SST on educational outcomes and present the effect size, 95% CI, and associated forest plot for each study and the overall result.