EAppendix
Appendix describing multiple imputation methods used in the paper
Rationale for imputation of missing gross motor function scores for children with CP
One challenge in classifying functional limitations among children with CP in a population-based study is incomplete ascertainment of information related to level of functioning. In previous population-based studies, information on functioning is missing for substantial proportions, up to 20%, of cases. A “complete-case” approach---where CP cases with missing information are simply excluded from the analysis---is often employed in population-based studies. Research on missing-data has demonstrated that the complete-case approach can result in bias if the outcome of interest (in this case, functional limitation) differs depending on whether information about the outcome is present or missing. (Schafer)
In the case of CP, it has been suggested in the literature that those missing gross motor functioning are less likely to be observed in registries and surveillance, due to fewer interactions with health care providers.
Imputation Methods
Multiple imputation and statistical analysis were performed with SAS 9.1.3 (SAS Institute, Cary NC) and the SAS-callable implementation of IVEware (Ragnunathan et al 2002), which has been shown to outperform the default multiple imputation function in SAS.(Yu et al 2007) To evaluate the robustness of the imputation models, different models were run with and without the inclusion of race variables. All imputations used 10 iterations and 10 multiples to produce 10 datasets (100 total cycles). We also performed stratified imputations (such as separate imputations among non-Hispanic black and non-Hispanic white cases) to ensure that undetected patterns in the data were not creating artifacts in the imputed results. The imputation model produced similar results regardless of whether or how racial or income variables were included. Imputed values for both GMFCS levels (treated as either a continuous or categorical variable) and the 3-category walking ability variable (as an unordered categorical variable) were also similar. IVEware performs polytomous regression for categorical variables, which does not recognize ordered values within categories. We ran a confirmatory imputation using the mi packagein R (R Foundation for Statistical Computing, Vienna, Austria), which uses multinomial log-linear models to impute ordered categorical factors. The overall distribution of motor limitations imputed with the mi package was very similar to that obtained from IVEware for our final model.
As stated in the manuscript, there were a number of variables that were significantly associated with both severity of gross motor limitations and whether or not this information was missing. These associations formed the basis of multiple imputation models, which included covariates of interest to this analysis, as well as other variables previously identified in the literature as being related to functional limitations among children with CP. The table displaying these associations can be found in Appendix Table 1.
We also investigated the robustness of the imputation model, and attempted to estimate the performance of the imputation model under a range of assumptions. Appendix Table 2 displays a basic sensitivity analysis, examining the consequences of different assumptions about the proportion of missing data that we cannot recover. Appendix Table 3 shows the results of the imputation model when different variables of interest are included or excluded. We examined different combinations of covariates to gauge the relative importance of different variables in an attempt to avoid possible overfitting from the inclusion of too many covariates.
Comment and Limitations ofImputation Approach
Thedifferences between the complete-case and imputed results diminished when stratifying by factors that are strongly associated with function. For example, the vast majority of children with spastic unilateral CP (Table 1 in paper) were found to have mild functional limitations. The multiple imputation procedure made use of this information and assigned nearly all of the spastic unilateral cases with missing information to the mildest category, resulting in proportions similar to the complete case analysis within this group. (Data not shown)
One limitation of this approach is that the imputation model assumes the data are “missing at random”—that is, this approach cannot account for the possibility that children missing functional information differed in a way that cannot be predicted with the other available variables. Knowing whether case children with missing information are different would preclude the need for imputation; however, it has been argued that, in practice, imputation often comes nearer to the “true” answer than simply restricting the analysis to cases with complete information.
References (cited here or generally useful)
Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009;60:549–576.
Graham, J. W., Hofer, S. M., Donaldson, S. I., MacKinnon,D. P., & Schafer, J. L. (1997). Analysis with missing datain prevention research. In K. Bryant, M. Windle, & S.West (Eds.), The science of prevention: Methodologicaladvances from alcohol and substance abuse research(pp. 325–366). Washington, DC: American Psychological Association
Klebanoff MA, Cole SR. Use of multiple imputation in the epidemiologic literature. Am J Epidemiol 2008;168:355-7
Raghunathan TE, Solenberger PW, Van Hoewyk J. IVEware: Imputation and Variance Estimation Software User Guide. University of Michigan. 2002.
Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychological methods. 2002;7:147–77.
Sterne JA, White IR, Carlin JB, Spratt M, Royston P, et al.(2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338: b2393.
Stuart E, Azur M, Frangakis C, et al. Multiple imputation withlarge data sets: a case study of the children’s mental healthinitiative. Am J Epidemiol. 2009;169(9):1133–1139
Su Y, Gelman A, Hill J, Yajima M. Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box. Journal of Statistical Software. (forthcoming)
Yu L-M, Burton A, Rivero-Arias O. Evaluation of software for multiple imputation of semi-continuous data. Stat Methods Med Res. 2007;16:243–258.
eTable 1. Variable Selection for the ImputationModel
Spearman Correlation Coefficient ( r )Factor / Missing GMF score / Sig / Value of GMFCS Score
(when known) / Sig
Using All Cases (N=476)
# Abstracted Evaluations / -0.299 / **** / 0.355 / ****
Born in-state / 0.025 / 0.112 / *
White Race / -0.057 / -0.131 / *
Black Race / 0.041 / 0.131 / *
Higher Poverty (Tertiles) / -0.002 / -0.121 / *
Age at first eval (months) / -0.020 / -0.287 / ****
Age at most recent eval (months) / -0.573 / **** / 0.153 / **
Previous CP Diagnosis / -0.108 / * / 0.085
Site: Alabama / -0.033 / 0.127 / *
Site: Georgia / 0.133 / ** / -0.027
Site: Missouri / -0.095 / * / -0.063
Site: Wisconsin / -0.034 / -0.043
Isolated CP / 0.217 / **** / -0.331 / ****
Epilepsy / -0.230 / 0.399 / ****
Source of evaluations: School / -0.019 / -0.069
Source of evaluations: Non-School / 0.111 / * / -0.082
Source of evaluations: Both / -0.109 / * / 0.109 / *
Male Gender / 0.033 / 0.025
CP Subtype: Ataxic Hypertonic Dyskinetic / -0.024 / -0.011
CP Subtype: Other, NOS, Unknown / 0.126 / ** / 0.057
CP Subtype: Spastic Mono/Hemi/Unilateral / 0.067 / -0.383 / ****
CP Subtype: Spastic Di/Tri/Quadriplegic / -0.131 / ** / 0.292 / ****
Intellectual Disability / 0.000 / 0.248 / ****
Postnatal etiology / 0.056 / 0.107 / *
Birth certificate only variables (N=366)
Term Birth / -0.034 / 0.076
Normal birthweight / -0.025 / 0.081
Multiple Birth / 0.168 / ** / -0.071
Mother is married / 0.007 / -0.089
Maternal education (categorical) / -0.051 / -0.154 / *
Maternal age (years) / 0.048 / -0.060
Apgar Score (5 minutes) / -0.105 / * / -0.119
Symbol / P-values
**** / <0.0001
*** / <0.001
** / <0.01
* / <0.05
The first column displays the correlation between each variable and whether gross motor function status is missing. A positive coefficient in the first column indicates that CP cases with this characteristic are more likely to be missing gross motor function information. The second column shows the (rank-order) correlation between each variable and how strongly it could be used predict gross motor function (among cases where gross motor function is known). A positive coefficient in the second column suggests the characteristic is associated with more severe functional limitations.Values that are associated with missing data and also with gross motor function are particularly important to include in the imputation model.
eTable 2. Evaluating the sensitivity and robustness for the multiple imputation model (n=476 CP Cases)
Approach adapted from “Analysis with Missing Data in Prevention Research”, Graham et al. 1995.
We performed a sensitivity analysis for the imputation model (similar to Table 9 in the Graham et al chapter on estimating inaccessible missingness.). We used the R-squared value of linear models (containing the imputation variables) to estimate how much of the missingness we could explain with the variables in the model and we then made different assumptions about how much of the remaining variability was not missing at random.
Sensitivity analysis for imputationHighly Conservative assumptions / Less Conservative assumptions
Proportion / Cumulative Proportion / Proportion / Cumulative proportion
Missing GMF among cases (Observed) / 0.256 / 0.256 / 0.256 / 0.256
Proportion of missingness due in any part to GMF (estimated proportion that is not MCAR) / 0.800 / 0.205 / 0.333 / 0.085
Within those, proportion of missingness due entirely to the independent effect of GMF apart from other correlated variables / 0.800 / 0.096 / 0.500 / 0.040
Within those, proportion of GMF variability not accounted for by other vars (remaining inaccessible data, observed linear model predicting motor function, R-squared = .664) / 0.336 / 0.055 / 0.336 / 0.014
Estimated bias in imputation (If there is a residual MNAR component to the missing data, how much do the real values differ from the interpreted values?) / 1 Ambulation Level / 1 GMFCS level
Estimated cumulative impact of inaccessible (MNAR) mechanism on entire sample (n=476) / 5.5% of the cases (n=26) should be 1 Ambulation level different than their imputed value. / 1.4% of the cases (n=5) should be 1Gross Motor Function ClassificationSystem level different than their imputed value.
eTable 3. Robustness of Imputation using different variables in the model.
Race-specific ProportionsImputation results based on race and social class variable inclusion
Complete case analysis / Imputed with race and poverty as covariates / Stratified by race (separate imputations for white/black) / Race ignored as a covariate in imputation model / Race and poverty ignored in imputation model / Poverty ignored as a covariate in imputation model
White / Black / White / Black / White / Black / White / Black / White / Black / White / Black
Independent / 63 / 46 / 66 / 53 / 68 / 51 / 65 / 53 / 67 / 53 / 67 / 54
w/ device / 10 / 13 / 8 / 10 / 9 / 10 / 8 / 12 / 9 / 11 / 9 / 11
non-ambulatory / 28 / 42 / 26 / 37 / 23 / 39 / 26 / 35 / 25 / 37 / 24 / 35
Cases with BC Data
Complete case analysis / imputed with race and Maternal Edu as covariates / Ignore maternal edu / Cases restricted to those with complete BC data
White / Black / White / Black / White / Black / White / Black
Independent / 62 / 43 / 65 / 50 / 63 / 46 / 63 / 50
w/ device / 10 / 11 / 10 / 11 / 11 / 12 / 10 / 11
non-ambulatory / 28 / 46 / 25 / 39 / 26 / 42 / 27 / 39
We examined whether the imputation model would differ depending on whether certain variables were included or left out, or whether white and black children were run together or separately. The values in bold use the full imputation model, and were used in the paper. The strongest predictors were CP subtype, co-occurring epilepsy or intellectual disability, and number of evaluations found by the surveillance system. Race and perinatal variables did not seem to make large contributions to predicting GMF for the missing cases.