Example of Working with Missing Values

Alan C. Acock

February, 2007

Presented at the Department of Family and Child Studies

FloridaStateUniversity

Supporting material is available at

Based on the Power Point Presentation, the following is an example of working with missing values. These are notes to facilitate the presentation and are not intended to be in a format appropriate for publication.

There are many packages and commands for working with missing values. I will illustrate the process using a command written by Royston (2004) for Stata. It is useful to see this process even if you do not have access to Stata becauseit is one of the best approaches currently available andit allows us to illustrates several important issues that are problematic with various packages.

Model:

We will estimate the hours a person works. We think this depends on their gender, race (white, black, other), age, education, number of children and an interaction of the number of children and gender.

We have data for 1,262 adults with no missing data using the 2004 General Social Survey.

. sum

Variable | Obs Mean Std. Dev. Min Max

------+------

id | 1262 1351.342 801.18 4 2812

hrs1 | 1262 42.31616 14.89532 1 89

childs | 1262 1.602219 1.436003 0 8

age | 1262 41.94295 12.85587 18 86

educ | 1262 14.36846 2.753653 0 20

------+------

paeduc | 1262 12.25832 3.742694 0 20

maeduc | 1262 12.22583 3.284868 0 20

income98 | 1262 18.49128 4.418744 1 24

attend | 1262 3.861331 2.666016 0 8

------+------

other | 1262 .0847861 .2786735 0 1

black | 1262 .1030111 .3040939 0 1

female | 1262 .4896989 .500092 0 1

femkid | 1262 .8312203 1.297545 0 8

------+------

If we do a regression (listwise deletion) we obtain:

. regress hrs1 female other black age educ childs femkid, beta

Source | SS df MS Number of obs = 1262

------+------F( 7, 1254) = 12.16

Model | 17784.362 7 2540.62314 Prob > F = 0.0000

Residual | 261994.488 1254 208.927024 R-squared = 0.0636

------+------Adj R-squared = 0.0583

Total | 279778.85 1261 221.870619 Root MSE = 14.454

------

hrs1 | Coef. Std. Err. t P>|t| Beta

------+------

female | -4.09762 1.233346 -3.32 0.001 -.1375725

other | -2.9565 1.478854 -2.00 0.046 -.0553126

black | .2853287 1.360049 0.21 0.834 .0058251

age | -.0603985 .0348098 -1.74 0.083 -.0521288

educ | .3840182 .1500599 2.56 0.011 .0709923

childs | 1.010823 .4098343 2.47 0.014 .0974497

femkid | -1.562961 .5720398 -2.73 0.006 -.136151

_cons | 41.23919 2.59819 15.87 0.000 .

------

I created a new dataset that has missing values that violate the MAR assumption.

  • Ideleted values deliberately so that the resulting dataset using listwise deletion has only 680 observations.
  • We are missing between 2% and 19% of the values for each variable, but with listwise deletion almost half the observations are dropped because they have a missing value on at least one variable.
  • If I had deleted these randomly then the multiple imputation would approximate the results for the full sample. I’ve deliberately deleted the observations to violate assumptions of data missing randomly.

The results for our new dataset, using listwise deletion, are quite different:

sum

Variable | Obs Mean Std. Dev. Min Max

------+------

id | 1262 1351.342 801.18 4 2812

hrs1 | 1125 42.272 14.87459 1 89

childs | 1115 1.583857 1.43404 0 8

age | 1017 42.09636 12.93885 18 86

educ | 1236 14.36246 2.755579 0 20

------+------

paeduc | 1232 12.25812 3.70872 0 20

maeduc | 1223 12.21259 3.302125 0 20

race | 1209 1.267163 .6012586 1 3

income98 | 1208 18.40977 4.432703 1 24

attend | 1212 3.861386 2.669322 0 8

------+------

other | 1213 .0824402 .2751477 0 1

black | 1219 .1033634 .3045579 0 1

female | 1125 .4897778 .5001178 0 1

femkid | 1024 .8476563 1.305764 0 8

------+------

. regress hrs1 female other black age educ childs femkid, beta

Source | SS df MS Number of obs = 680

------+------F( 7, 672) = 8.47

Model | 11131.7958 7 1590.25655 Prob > F = 0.0000

Residual | 126184.756 672 187.774934 R-squared = 0.0811

------+------Adj R-squared = 0.0715

Total | 137316.551 679 202.233507 Root MSE = 13.703

------

hrs1 | Coef. Std. Err. t P>|t| Beta

------+------

female | -4.7175 1.630608 -2.89 0.004 -.1659522

other | -3.564597 2.055072 -1.73 0.083 -.0648645

black | .9089816 1.7347 0.52 0.600 .0198014

age | -.0786796 .0443309 -1.77 0.076 -.0719363

educ | .6051706 .1961412 3.09 0.002 .1161483

childs | .9381454 .5071599 1.85 0.065 .0975426

femkid | -1.194012 .7266066 -1.64 0.101 -.1132623

_cons | 38.65546 3.375819 11.45 0.000 .

------

The results using mean substitution are also quite different with the explanatory power of the model attenuated:

. regress hrs1m femalem otherm blackm agem educm childsm femkidm, beta

Source | SS df MS Number of obs = 1262

------+------F( 7, 1254) = 9.22

Model | 12168.8102 7 1738.40145 Prob > F = 0.0000

Residual | 236519.958 1254 188.612407 R-squared = 0.0489

------+------Adj R-squared = 0.0436

Total | 248688.768 1261 197.215518 Root MSE = 13.734

------

hrs1m | Coef. Std. Err. t P>|t| Beta

------+------

femalem | -4.990874 1.15559 -4.32 0.000 -.1678047

otherm | -2.767918 1.448562 -1.91 0.056 -.053167

blackm | .5248508 1.312423 0.40 0.689 .0111867

agem | -.0360429 .0357436 -1.01 0.313 -.0298081

educm | .3560231 .1431787 2.49 0.013 .0691348

childsm | .2233433 .3823879 0.58 0.559 .0214362

femkidm | -.6406052 .5417609 -1.18 0.237 -.0536494

_cons | 41.48353 2.552413 16.25 0.000 .

------

Checking for missing values

. misschk hrs1 female other black age educ childs femkid, gen(miss) replace dummy

Variables examined for missing values

# Variable # Missing % Missing

------

1 hrs1 137 10.9

2 female 137 10.9

3 other 49 3.9

4 black 43 3.4

5 age 245 19.4

6 educ 26 2.1

7 childs 147 11.6

8 femkid 238 18.9

The columns in the table below correspond to the # in the table above.

If a column is _, there were no missing cases for that variable.

Missing for |

which |

variables? | Freq. Percent Cum.

------+------

1___5 _7_ | 11 0.87 0.87

1___5 ___ | 102 8.08 8.95

1______| 24 1.90 10.86

_2__5 __8 | 1 0.08 10.94

_2___ 6_8 | 25 1.98 12.92

_2___ __8 | 111 8.80 21.71

__34_ ___ | 43 3.41 25.12

__3__ ___ | 6 0.48 25.59

____5 _7_ | 9 0.71 26.31

____5 ___ | 122 9.67 35.97

_____ 6__ | 1 0.08 36.05

______78 | 101 8.00 44.06

______7_ | 26 2.06 46.12

______| 680 53.88 100.00

------+------

Total | 1,262 100.00

  • The first pattern has missing values on three of the variables, number 1, 5, and 7 (hrs, age, childs).
  • This table can tell us if there is a variable or, more usefully, a combination of variables that have a lot of missing values.

The next table tells us how many people have missing values on 0, 1, 2, … of the variables. Notice that all but 37 of the observations are missing values for 2 or fewer variables and no observation is missing a value for more than 3 variables.

Missing for |

how many |

variables? | Freq. Percent Cum.

------+------

0 | 680 53.88 53.88

1 | 179 14.18 68.07

2 | 366 29.00 97.07

3 | 37 2.93 100.00

------+------

Total | 1,262 100.00

Variables created:

miss<varnm> is a binary variable indicating missing data for <varnm>.

  • This command creates a dummy variable (miss<varnm> for each variable to represent the missingness.
  • These are coded 0 if not missing and 1 if missing.
  • We can use these new variables to see if there are variables in the dataset that predict them.
  • A variable that is correlated with one of these variables is known as an auxiliary variable. It is a mechanism that explains the missingness.

Here is one example:

. tab misshrs1

-> tabulation of misshrs1

Missing |

value for |

hrs1? | Freq. Percent Cum.

------+------

NotMissing | 1,125 89.14 89.14

Missing | 137 10.86 100.00

------+------

Total | 1,262 100.00

What are our auxiliary variables?

paeduc

maeduc

income98

attend

black

other

  • Normally, we would pick more.
  • These are variables that predict whether there is a missing value or not—they may or may not predict the score for the missing value.
  • Think of other auxiliary variables that are mechanisms explaining why a value is missing—race? Depression?

What are our covariates?

paeduc

maeduc

income98

attend

black

other

  • Normally, we would pick more candidates to be used as either auxiliary variables or covariates.
  • These happen to be the same variables but are selected because we think they might be related to the score on our primary variables (hrs1 childs age educ interact).
  • Think of other covariates. Often these are the same as the auxilary variables. For example, minorities may work fewer hours because of discrimination and knowing minority status would help us predict the value when it is missing.

Finding auxiliary variables and covariates

To evaluate auxiliary variables we do the following: We explore for variables that are correlated with missingness. Normally, we would include far more variables: (edited output follows)

pwcorr misshrs1-missfemkid hrs1-female

| misshrs1 missfe~e missot~r missbl~k missage misseduc missch~s

------+------

misshrs1 | 1.0000

missfemale | -0.1218 1.0000

missother | -0.0701 -0.0701 1.0000

missblack | -0.0655 -0.0655 0.9345 1.0000

missage | 0.5564 -0.1648 -0.0986 -0.0922 1.0000

misseduc | -0.0506 0.3977 -0.0292 -0.0272 -0.0712 1.0000

misschilds | -0.0394 -0.1267 -0.0730 -0.0682 -0.0533 -0.0527 1.0000

missfemkid | -0.1682 0.7238 -0.0969 -0.0905 -0.2315 0.2866 0.4627

hrs1 | . 0.0007 -0.0423 -0.0442 -0.0202 -0.0040 0.0767

childs | 0.0424 -0.0857 -0.0507 -0.0426 -0.0099 -0.0505 .

age | -0.0202 -0.0691 0.0342 0.0362 . -0.0354 0.0439

educ | 0.0172 0.0444 -0.0132 -0.0106 0.0215 . -0.0765

paeduc | 0.0102 0.0547 0.0116 0.0273 0.0054 0.0250 -0.0864

maeduc | -0.0378 0.0761 0.0027 0.0189 -0.0248 0.0317 -0.0985

income98 | 0.0407 -0.0189 0.1013 0.0861 -0.0222 0.0120 -0.0221

attend | 0.0003 0.0049 -0.0506 -0.0552 0.0614 -0.0371 0.0472

other | -0.0123 0.0540 . . -0.0313 0.1833 0.0265

black | 0.0328 -0.0526 -0.0239 . 0.0449 -0.0315 -0.0595

female | -0.0114 . 0.0000 -0.0006 0.0151 0.0304 -0.0105

| missfe~d hrs1 childs age educ paeduc maeduc

------+------

missfemkid | 1.0000

hrs1 | 0.0624 1.0000

childs | -0.0857 -0.0392 1.0000

age | -0.0223 -0.0322 0.4021 1.0000

educ | -0.0748 0.0739 -0.0840 0.0151 1.0000

paeduc | -0.0434 0.0208 -0.1637 -0.2146 0.3701 1.0000

maeduc | -0.0196 0.0632 -0.1602 -0.1877 0.3754 0.6151 1.0000

income98 | -0.0364 0.1914 0.1103 0.2140 0.2925 0.1274 0.1283

attend | 0.0498 -0.0533 0.2204 0.0606 0.0669 -0.0269 -0.0104

other | -0.0047 -0.0484 -0.0335 -0.1016 0.0094 -0.0156 -0.1387

black | -0.0993 -0.0055 0.0200 -0.0747 -0.0699 -0.0732 -0.0583

female | 0.0095 -0.2277 0.0882 0.0296 0.0063 -0.0029 0.0035

| income98 attend other black female

------+------

income98 | 1.0000

attend | 0.0619 1.0000

other | -0.0529 0.0004 1.0000

black | -0.1349 0.1443 -0.1021 1.0000

female | -0.0823 0.1253 -0.0324 0.1020 1.0000

The correlation between income98 and age is .21. income98is a covariate that predicts the value of age where age is missing. The correlation between other (neither Black nor white) and misseduc is .18. People who are neither black nor white are more likely to have a missing value on education (not necessarily a higher or lower value, just a missing value. Thus other is an important auxiliary variable as a mechanism for missingness on education. MAR assumes we have included relevant auxiliary variables.

  • We need to include any auxiliary variables and any covariates we identify.
  • Often users of the full information maximum likelihood solutions to missing values include no additional variables even when they are available. I show how to do that in my JMF article (Acock, 2005).

Multiple Imputation

So much for preliminary analysis. We are now ready to do multiple imputation. The command we will use is one of the best that is currently available and if you use another command you should be aware of the issues we handle. We will do this with a command calledice that was written by Patrick Royston and is an implementation for Stata of S. van Buren and C. G. M. Oudshoorn’s program for MICE that is available in R and S-Plus (

First, do a dry run. This does nothing but tell us how Stata thinks we should do it. We will need to modify this as explained below:

. ice hrs1-femkid using imputed.dta, dryrun m(20)

#missing |

values | Freq. Percent Cum.

------+------

0 | 649 51.43 51.43

1 | 121 9.59 61.01

2 | 381 30.19 91.20

3 | 80 6.34 97.54

4 | 30 2.38 99.92

5 | 1 0.08 100.00

------+------

Total | 1,262 100.00

Variable | Command | Prediction equation

------+------+------

hrs1 | regress | childs age educ paeduc maeduc race income98 attend

| | other black female femkid

childs | regress | hrs1 age educ paeduc maeduc race income98 attend other

| | black female femkid

age | regress | hrs1 childs educ paeduc maeduc race income98 attend

| | other black female femkid

educ | regress | hrs1 childs age paeduc maeduc race income98 attend

| | other black female femkid

paeduc | regress | hrs1 childs age educ maeduc race income98 attend other

| | black female femkid

maeduc | regress | hrs1 childs age educ paeduc race income98 attend other

| | black female femkid

race | mlogit | hrs1 childs age educ paeduc maeduc income98 attend

| | other black female femkid

income98 | regress | hrs1 childs age educ paeduc maeduc race attend other

| | black female femkid

attend | regress | hrs1 childs age educ paeduc maeduc race income98 other

| | black female femkid

other | logit | hrs1 childs age educ paeduc maeduc race income98

| | attend black female femkid

black | logit | hrs1 childs age educ paeduc maeduc race income98

| | attend other female femkid

female | logit | hrs1 childs age educ paeduc maeduc race income98

| | attend other black femkid

femkid | regress | hrs1 childs age educ paeduc maeduc race income98

| | attend other black female

End of dry run. No imputations were done, no files were created.

  • This shows you the defaults Stata would use if we made no further specifications.If there were no missing values for a variable this would show that it will not do anything with that variable.
  • Notice we are using OLS regression (regress) for every variable except of female, other, and black where it wants to do logistic regression and race for which Stata wants to use multinomial logistic regression (mlogit).
  • Because female, other, and black have just two values, Stata figured out we should do a logistic regression.
  • Because race has three values (white, black, and other), Stata figured out we should do a multinomial logistic regression.
  • Sometimes a variable with three values should be treated using ordinal logistic regression (ologit) and sometimes it should be treated using OLS regression (regress), but with three categories, Stata always guesses that we want multinomial logistic regression.

Problems found with the dryrun

  • There is a problem with femkid. It does not make sense to impute childs and female, and also their interaction since their interaction term becausefemkid, by definition, is the product childs times female.
  • Therefore we need to impute childs and female but let the imputed interaction, femkid = childs female. This means we impute the interaction passively. The option is
  • passive(femkid:childs*female).
  • This option also will make sure that femkid is not used as a predictor when we are imputing either childs or female.
  • We haverace which is coded as white, back, or other. With 3 categories we need to create 2 dummy variables, black and other.
  • We let white be our reference group.
  • We should not impute black or otherusing logistic regression like we do with female above because black and other are interdependent (a person should not have an imputed value of 1 on both variables).
  • Therefore, we need to imputerace using multinomial regression but not impute black or other using logistic regression.
  • This will mean that each missing value will be assigned to one and only one race. This approach will give an imputed value for each missing value on race.
  • Then, it will go back to translated these to the dummy variables black and other. This needs to be passively imputed following the active imputation of race. We use the option
  • passive(black:race==2\other:race==3). (Double equal signs, ==, are pronounced “is” in Stata.)
  • To guarantee that multinomial logit was used to actively impute race we could use the option
  • cmd(race:mlogit).
  • We don’t have to include this option because it is the default for race, but do so to illustrate how to specify an estimator. Currently the only available estimators are: regress, logit, mlogit, and ologit.

This gets more complicated because we cannot use race with 3 nominal levels as a predictor and must use black and other as predictors when imputing other variables. So we need to add an option to make this substitution happen:

  • subtitute(race:other black).

Here is what we do for our situation. I realize this is a complex command. Stata commands are rarely even remotely this long except for some complex graphs. The three slashes are used at the end of each line to indicate that the following line is still part of the same command. The m(20) will impute 20 datasets. Other programs are much more difficult to implement and often just impute 5 datasets to make them manageable.

. ice hrs1-femkid using impute.dta, m(20) ///

passive(femkid:childs*female\black:race==2\other:race==3) ///

substitute(race:other black) cmd(race:mlogit)

#missing |

values | Freq. Percent Cum.

------+------

0 | 649 51.43 51.43

1 | 121 9.59 61.01

2 | 381 30.19 91.20

3 | 80 6.34 97.54

4 | 30 2.38 99.92

5 | 1 0.08 100.00

------+------

Total | 1,262 100.00

Variable | Command | Prediction equation

------+------+------

hrs1 | regress | childs age educ paeduc maeduc income98 attend other

| | black female femkid

childs | regress | hrs1 age educ paeduc maeduc income98 attend other

| | black female

age | regress | hrs1 childs educ paeduc maeduc income98 attend other

| | black female femkid

educ | regress | hrs1 childs age paeduc maeduc income98 attend other

| | black female femkid

paeduc | regress | hrs1 childs age educ maeduc income98 attend other

| | black female femkid

maeduc | regress | hrs1 childs age educ paeduc income98 attend other

| | black female femkid

race | mlogit | hrs1 childs age educ paeduc maeduc income98 attend

| | female femkid

income98 | regress | hrs1 childs age educ paeduc maeduc attend other black

| | female femkid

attend | regress | hrs1 childs age educ paeduc maeduc income98 other

| | black female femkid

other | | [Passively imputed from race==3]

black | | [Passively imputed from race==2]

female | logit | hrs1 childs age educ paeduc maeduc income98 attend

| | other black

femkid | | [Passively imputed from childs*female]

Imputing 1..2..3..4..5..6..7..8..9..10..11..12..13..14..15..16..17..18..19..20..file impute.dta saved

Stata is extremely fast. On some programs this would take a long time, but it is all done in a few seconds. With a large number of covariates and auxiliary variables, however, even Stata can take a very long time.

What did this command accomplish?

  • We created 20 datasets, m(20), and put them into a single file in our default directory with the name impute.dta.
  • The 20 datasets are stacked in one big dataset where the first 1,262 observations are the first complete dataset; the second 1,262 are the second dataset, and so on. This file has all of our variables; there are no missing values.
  • It has 25,248 observations altogether (1262 20). These 20 stacked datasets are ready to go with all of our variable names, variable labels, and value labels.
  • Some other programs produce the multiple datasets as separate text files that need to be transformed into datasets by adding variable names, value labels, etc.

Here is a summary of the stacked file, impute.dta:

. sum

Variable | Obs Mean Std. Dev. Min Max

------+------

id | 25240 1351.342 800.8783 4 2812

hrs1 | 25240 42.29011 14.87211 -10.76075 94.99245

childs | 25240 1.599714 1.437436 -4.524939 8

age | 25240 42.05228 12.99656 -3.952193 91.75469

educ | 25240 14.37851 2.75387 0 22.91912

------+------

paeduc | 25240 12.27622 3.703488 0 23.45145

maeduc | 25240 12.21553 3.295468 0 22.31159

race | 25240 1.265491 .5999308 1 3

income98 | 25240 18.39968 4.431575 1 32.87609

attend | 25240 3.867593 2.666426 -5.25985 13.53112

------+------

other | 25240 .0824485 .2750522 0 1

black | 25240 .1005943 .3007967 0 1

female | 25240 .4880745 .4998677 0 1

femkid | 25240 .8467585 1.312306 -4.524939 8

------+------

missfemale | 25240 .1085578 .3110898 0 1

  • This summary of the 25,240 observations shows some minimum values that might seem problematic such as -10.76 hours a week or -4.52 children.
  • We could change all “impossible” values to a possible value replace hrs1 = 0 if hrs1 < 0, replace childs = 0 if childs < 0, replace age = 18 if age < 18, replace attend = 0 if attend < 0, and replace femkid = 0 if childs ==0 & female == 0. We would run the last command last.
  • Notice that the categorical variables (race, other, black, female) only take on discrete values.

Alternatively, we can leave all of the imputed values alone. There are very few out of bounds values in any of our 20 samples. For example, there are no observations below zero on hrs1 in the first dataset of 1,262 observations.

Estimating our model 20 times and combining the 20 separate results.

You will recall that our model is

To run this on a single dataset (single imputation like done with SPSS) our command would be

regress hrs1 male age educ childs in 1/1262, beta

The problem with single imputation is that it does not give us as much information as multiple imputation and will tend to under estimate standard errors. To get unbiased standard errors and better estimates of parameters we need to run the regression on multiple datasets and then pool the results into a combined solution. To do the regression twenty times on twenty datasets and pool the twenty sets of estimates we run this command:

micombine regress hrs1 female other black age educ childs femkid

Multiple imputation parameter estimates (20 imputations)

------

hrs1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

female | -4.54755 1.433964 -3.17 0.002 -7.360783 -1.734317

other | -3.07658 1.62503 -1.89 0.059 -6.264658 .1114975

black | .7583257 1.536302 0.49 0.622 -2.25568 3.772331

age | -.0431244 .0397376 -1.09 0.278 -.1210839 .034835

educ | .3490039 .1591362 2.19 0.028 .0368013 .6612065

childs | .7887713 .4992205 1.58 0.114 -.1906281 1.768171

femkid | -1.477353 .6848294 -2.16 0.031 -2.820891 -.1338153

_cons | 41.47106 2.795495 14.83 0.000 35.98669 46.95542

------

1262 observations.

Because ice and micombine were written by a biostatistician who has little interest in standardized measures like R2 and β’s, we only get an unstandardized solution.

  • The parameter estimates are better than with single imputation because each is the average of all 20 values for the given parameter estimate.
  • Any one imputation might get a coefficient, i. e., an unstandardized B, that is too big or two small, but averaged over 20 repetitions we mitigate this likely error.
  • The standard errors for multiple imputation tend to be larger than for the single imputation and the t-ratios tend to be smaller.
  • Although we normally want to minimize standard errors and hence maximize t-ratios and power, those from the single imputation are incorrect because they ignore the variability across our 20 datasets.
  • This variability is inherent in data imputation and should not be ignored. If the first dataset gets very different results than the second dataset and the second dataset gets very different results than the third, etc., the pooled standard errors for the multiple imputations will be much larger reflecting the uncertainty of the imputation process (See Appendix).
  • If the results are virtually the same for each of the 20 datasets, then the pooled standard errors will not be much larger than those obtained from a single imputation.

We want R2 and the β’s!