More Multiple Imputation Detail

Appendix

More Multiple Imputation Detail

Multiple imputation is a technique that was initially proposed by Rubin [10,11]. The basic procedure of multiple imputation is simple: we createdm multiple complete data sets, filling in the missing observations in a principled way, analyzed the complete data sets, and combined the results. Multiple imputation incorporates the uncertainty due to the missing data in the imputation process [12]. A good summary of multiple imputation as well as software that is available is provided by Harel and Zhou [12].

Multiple imputation combines aspects of both the Bayesian and frequentist statistical paradigms. The imputed data sets are often created using markov chain montecarlo (MCMC) simulations. However, the complete-data analysis often uses frequentist statistical methods. This research implements the multiple imputation methodology of Raghunathan, Lepkowski, Van Hoewyk, and Solenberger[15], which is well suited to survey data where there are many different variable types and the data structure can be complicated by skip patterns.

Software to implement the method is available as a SAS macro called IVEware[13]. This macro produced imputed values for each individual in the data set, conditional on all the values observed for that individual. Imputations were created using a sequence of multiple regressions, varying the type of regression model by the type of variable being imputed. The types of regression models used are linear, logistic, Poisson, generalized logit or a mixture of these depending on the type of variable being imputed. The method also allowed the imputations to be restricted to relevant subpopulations or to satisfy bounds on the variables. Covariates included all other variables observed or imputed for that individual. The imputations weare drawn from the posterior predictive distribution specified by the regression model with a flat or non-informative prior distribution for the parameters in the regression model. Subsequent imputations of a variable overwriote previously drawn values. This builds in dependencies among imputed values and uses the correlation structure among the covariates. To generate multiple imputations, the same procedure can be applied with different random starting seeds or taking every pth imputed set of values in the cycles mentioned above. This research usesd 100 multiple imputations based on a recommendation in Harel [14].

This study questionnaire hasd two types of questions. The regression model used to multiply impute the Yes/No questions (1 – 7) is a logistic regression model that includes all of the other variables in the questionnaires as covariates. Both pre- and posttest variables were included in the imputation models. For example, for question 4:

The remaining variables weare Likert scale questions (questions 8 – 12F). Item Q8 (“Sometimes young people have so many personal problems they have no other options besides suicide.”) wais such a question. Students were asked to what extent they agree with the statement, from 1 to 5. Such variables were treated as continuous. They were imputed as the response variable in a multiple linear regression model. Imputed values were restricted to be bound between 1 and 5, and were later assigned to an integer between 1 and 5 based on the imputed value.

Analysis of multiple imputation data consisteds of running a complete-case analysis on all 100 multiply imputed data sets. We used the standard multiple imputation varianceformula found in Rubin[11] to compute the multiplyimputed estimate of the regression coefficients and thecovariance matrix. Suppose thatis the estimateof the vector of regression coefficients in the logisticmodel, and its covariance matrix, based on imputeddata set l. The multiply imputed estimate of is

and its covariance matrix is

where

More specific details about our multiple imputation methodology may be found in Raghunathan, et. al. [15].