The use of the bootstrap in the analysis of case-control studies with missing data

Volkert Siersma*, Christoffer Johansen**

*Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark

**Institute of Cancer Epidemiology, The Danish Cancer Society, Copenhagen, Denmark

Abstract

Background. Valid inference and efficiency are concerns when there are missing values in the risk factors of case-control studies. Complete-case analysis is inefficient and often biased for these studies. Multiple imputation is a more efficient approach, but valid confidence intervals require the complex assumption that the imputation is proper which is sometimes hard to ascertain. Computationally intensive resampling methods assume less of the imputation method in exchange for computation time.

Methods. A practical bootstrap method is presented to conduct inference in multivariate case-control studies when risk factors have incomplete data. This is illustrated through two case studies with considerable missing data in some risk factors of interest. The first study illustrates the applicability of the bootstrap method compared to complete-case analysis and multiple imputation. The second study illustrates the limitations of the bootstrap method.

Results. The bootstrap approach gives very similar results to multiple imputation.

Conclusion. The bootstrap approach can be preferable when the imputation procedure cannot be ascertained fully proper, but merely to result in unbiased estimates.

Key words: nonparametric bootstrap, bootstrap confidence intervals, missing values, multiple imputation, matched case-control study


Missing values are common in epidemiological data and even well designed and executed experiments can feature a considerable number of missing values. A study design often used to assess the effects of multiple factors on the risk for a relatively rare disease is the matched case-control design. With this design individuals with the disease are sampled and for each case, one or more controls are sampled, similar to the case in certain characteristics but without the disease. The risk factors are assessed often using conditional logistic regression1,2. In these studies risk factors are sampled retrospectively and observations may be incomplete often because of causes beyond the scope of the problem addressed in the study.

Many analysts exclude any subject with missing values from the data and proceed using methods for data without missing values. This so-called complete-case analysis is the default for many, if not all, statistical computer packages when faced with data containing missing values. This approach gives biased estimates for case-control studies if the occurrence of missing values depends on both the case-control identifier and the risk factors3,4. Additionally, complete-case analysis is inefficient because of the loss of information by excluding subjects with incomplete observations. Several incompletely observed risk factors can easily leave only few subjects with complete data.

Multiple imputation5,6 is a method that overcomes inefficiency and claims valid inference when values are missing at random – MAR5 – i.e. the fact that the value is missing is unrelated to the actual value that is missing; while still being just as general as the complete-case approach and computationally inexpensive, and therefore popular7. In this approach, several complete datasets are created by filling in incomplete observations using information from existing data. These completed datasets are then analysed by standard methods. The results are combined with Rubin’s rule6,8 where the mean of the obtained estimates has a variance estimated by a sum of between- and within-repetition variances. Few repetitions are needed because the simulation error is relatively small compared to the overall uncertainty and Rubin’s rule accounts explicitly for this error6,9.

The variance estimate obtained by multiple imputation is criticised for inconsistency with possible progressive bias in certain settings10. Additionally, the condition that the imputations are proper6, a complex requirement for valid inference, is often hard to establish in practice9,11. Multiple imputation was originally devised for large public-use datasets, where trained statisticians create a, for computational and logistical reasons, limited number of completed datasets for public dispersion for possibly many end-users with access only to standard statistical software. For case-control studies the design demands often one end-user and an analysis, which on modern computers can use much more simulated datasets than originally devised. This opens up to computationally intensive resampling methods to circumvent the above criticism.

The nonparametric bootstrap is a general method of inference for statistics with an in principle unknown distribution12-15. The data generation process is mimicked through sampling with replacement from the original sample to obtain a replica dataset of the same size. Assuming that the original data is representative of the total population, the parameter estimates from many resampled replica datasets construct an empirical distribution for that estimate which is used for inference. The use of the nonparametric bootstrap for inference on imputation estimators has been acknowledged before7,16-18 , but generally discarded as being too cumbersome computationally.

We present a practical bootstrap method for inference in multivariate case-control studies when risk factors have incomplete data. This is done through two case studies with considerable missing data in risk factors of interest. The first study illustrates the applicability of the bootstrap method compared to complete-case analysis and multiple imputation. The second study illustrates the limitations of the bootstrap method.

RSV infection – the nonparametric bootstrap

Respiratory Syncytial Virus (RSV) infection causes hospitalisation during the first two years of life for some 2 percent of children born each year. A matched case-control study19 features all 1272 hospitalisations for RSV infection in two Danish counties in the 5-year period from 1990 to 1995. Whenever possible, five controls are randomly chosen from the Danish central personal register matched on gender, age and municipality. There are 6075 controls in the study. Potential risk factors are gestational age, birth weight, household size and space, the mother’s smoking habits and level of maternal antibodies against RSV. Two factors have a sizable portion of missing observations. For the smoking factor 38% of the entries are missing since this information was only collected in the later part of the study. Maternal antibodies are thought to have a possible effect only in the first three months after birth. Therefore this measurement is ordered only for the 286 children that are hospitalised during the first three months of their lives and is available for 233 of these. Additionally, since this information is expensive, it is obtained for only one corresponding, but randomly chosen, control. As maternal antibodies have disappeared after three months, this value is assumed zero for children over three months of age.

A multivariate conditional logistic regression model is estimated for complete data using a Cox proportional hazard procedure present in many statistical software packages[#]. The estimated log(OR)s with corresponding confidence intervals are subsequently transformed to OR scale. To accommodate for non-linearities, continuous factors are made ordinal; if no natural categories exist, four categories approximately corresponding to the quartiles of the distribution of the risk factor are chosen. For these categories three sets, corresponding to three approaches to incomplete data, of ORs relative to the hypothesised lowest risk category of each risk factor, with corresponding confidence intervals are listed in Table 1.

Complete-case estimates are reported in the first column of Table 1 (CC). These are obtained by excluding the children in which either the smoking status of the mother, or an antibody titer are missing from the data sample, and applying the estimation procedure to these reduced data. This approach is consistent here because missing smoking information depends solely on calendar time and missing blood tests are due to administrative irregularities considered random3,4. The complete-case method is the default of most statistical computer packages and thus requires no extra time to implement, and only one call of the estimation procedure.

Complete-case analysis of the RSV data shows significant effects of many relevant risk factors and could support a final conclusion. Other methods however could through efficiency gains give more evidence for the effect of birth weight and better assess the influence of crowding and maternal antibody titer, the latter being the most costly information.

Multiple imputation estimates are reported in the second column of Table 1 (MI). In this approach M datasets are simulated by replacing missing values in the original data sample S by qualified guesses through an imputation procedure imp(S). These completed datasets are then analysed separately with the complete data procedure and the resulting M estimates are combined using

Table 1: Estimated Odds Ratios (OR) with corresponding 95% Confidence Intervals (CI) in a multivariate conditional logistic regression for the risk factors for hospitalisation for RSV infection. The estimates are constructed through Complete-case analysis (CC), Multiple Imputation (MI) and NonParametric Bootstrap (NPB), respectively.

Risk factor / level / CC OR (95% CI) / MI OR (95% CI) / NPB OR (95% CI)
Gestational age / <33 weeks / 4.65 (2.44-8.85) / 3.88 (2.41-6.25) / 3.75 (2.74-7.75)
33-35 weeks / 1.64 (0.96-2.81) / 1.73 (1.17-2.57) / 1.66 (1.20-2.82)
36-37 weeks / 1.31 (0.91-1.89) / 1.43 (1.07-1.92) / 1.40 (1.10-1.97)
38-39 weeks / 1.13 (0.92-1.39) / 1.18 (0.99-1.40) / 1.16 (1.00-1.40)
>39 weeks / 1.00 / 1.00 / 1.00
Birth weight / <3.0 kg / 1.61 (1.11-2.31) / 1.42 (1.06-1.91) / 1.46 (1.10-1.98)
3.0-3.5 kg / 1.28 (0.94-1.75) / 1.15 (0.89-1.47) / 1.16 (0.90-1.51)
3.5-4.0 kg / 1.12 (0.82-1.54) / 1.06 (0.83-1.26) / 1.07 (0.83-1.38)
>4.0 kg / 1.00 / 1.00 / 1.00
Space per member of household / <22 m2 / 1.36 (1.01-1.82) / 1.10 (0.88-1.38) / 1.09 (0.87-1.42)
22-28 m2 / 1.27 (0.97-1.67) / 1.14 (0.91-1.42) / 1.14 (0.92-1.48)
28-36 m2 / 1.06 (0.81-1.39) / 1.03 (0.83-1.26) / 1.02 (0.82-1.29)
>36 m2 / 1.00 / 1.00 / 1.00
Age difference with next older sibling / 0-2 years / 1.70 (1.26-2.28) / 1.76 (1.40-2.20) / 1.74 (1.45-2.32)
2-4 years / 1.61 (1.26-2.05) / 1.64 (1.34-1.99) / 1.62 (1.40-2.07)
>4 years / 1.45 (1.11-1.89) / 1.23 (0.99-1.52) / 1.22 (1.01-1.56)
adult (no sibs) / 1.00 / 1.00 / 1.00
Maternal antibody titer / <210 / 1.35 (0.69-2.64) / 1.22 (0.71-2.11) / 1.57 (0.78-2.22)
210-275 / 1.23 (0.63-2.40) / 1.68 (0.97-2.91) / 1.85 (1.08-2.74)
275-330 / 1.65 (0.86-3.15) / 1.75 (1.04-2.95) / 1.92 (1.22-2.85)
>330 / 1.00 / 1.00 / 1.00
Smoking status of the mother / smoking / 1.64 (1.36-1.98) / 1.56 (1.19-2.05) / 1.57 (1.32-1.98)
non-smoking / 1.00 / 1.00 / 1.00

Rubin’s rule6,8 to arrive at ORs and standard errors for the classes of the risk factors. The multiple imputation approach in Table 1 uses M=10 simulated datasets.

Imputation of the two missing factors is based on sequential random draws from the conditional probability distribution of the risk factor given all complete variable sets in the study20,21. First, the smoking factor is imputed by draws from a Bernoulli distribution for the probability of a smoking mother conditional on all complete information – i.e. the risk factors without missing values, the case-control identifier and the matching variables, but not the maternal antibody titer – using a logistic regression model estimated from the part of the data for which the smoking information is not missing. Thereafter, the antibody titer is imputed by a draw from a linear regression model, estimated on that part of the data for which the antibody titer is not missing, on all complete information, now including the newly imputed smoking factor.

This imputation scheme focuses on easy implementation rather than being proper6. Single-outcome procedures for linear and logistic regression are standard in statistical computer packages and usually produce model predictions for missing outcome when all covariates are present. The procedure is constructed by iteration of these standard procedures and functions that produce random draws from probability distributions. This imputation scheme is proper for a single factor with missing values, under an assumption of ignorability, i.e. missing values in a factor depend in a similar way as its observed values on the other variables in the study9. The proposed sequential imputation algorithm cannot be proper since not all available information is used to impute the smoking factor. Proper imputation would be approached by iterating the above imputation procedure between the factors with missing information; by additionally using the imputed antibody count in the logistic model for the smoking factor in a next iteration step20,21. The uniterated imputation procedure used here is rendered first-moment proper – gives unbiased estimates – by an assumption of independence of smoking and antibody titer conditional on all other variables.

The multiple imputation procedure takes some time to implement. Software that performs the iterations of the sequential method is available however20,22. The estimation procedure and the imputation procedure are called M=10 times, combining the results only once. This gives usually small computing times also for more elaborate imputation schemes.

Multiple imputation is more efficient compared to complete-case analysis, as evidenced by narrower confidence intervals (Table 1). Especially, evidence to support a (non-linear) effect of maternal antibodies is caused by efficiency gains. Effects of birth weight and crowding are slightly lower than in the complete-case analysis, which might be the result of incorrectly assumed independences.

Estimates obtained by a non-parametric bootstrap are reported in the last column of Table 1 (NPB). The bootstrap method for inference on imputation estimates from an incomplete dataset S is determined by a resampling procedure res(S) and an imputation procedure imp(S). Bootstrap replica estimates = are obtained by application of the imputation procedure on datasets obtained through the resampling procedure. A large number B replicas construct an empirical distribution, that approximates the distribution of the parameters .[&] A 95% confidence interval is estimated by the 2.5% and 97.5% percentiles of the replica estimates ; improvements to this percentile method exist24. B=1000 bootstrap replicas are constructed in the RSV study to obtain confidence intervals15. A condition for the bootstrap approach to give valid inference is consistency of the estimate , which is obtained through the imputation procedure imp(S) described above, which is argued to provide consistent estimates.