Theme: Design of imputation

0.General information

0.2Module code

Theme-design of imputation

0.3Version history

Version / Date / Description of changes / Author / Institute
1.0 / 13-12-2011 / First version / Andrzej Młodak / GUS (PL)
1.1. / 15-12-2011 / First corrected version / Andrzej Młodak / GUS (PL)
2.0 / 15-02-2012 / Revised version / Andrzej Młodak / GUS (PL)
3.0 / 1530-053-2012 / Third version / Andrzej Młodak / GUS (PL)

0.3Template version and print date

Template versionused / 1.0 p 3 d.d. 28-6-2011
Print date / 15-5-2012 13:1715-5-2012 13:091-4-2012 12:2531-3-2012 13:14

Contents

General description – Theme:

1.Summary

2.General description

2.1. Whether to use imputation

2.2. Preliminary verification of the target variable

2.3. The choice of auxiliary variables

2.4. The choice of the imputation method

2.5. Quality control

2.6. Disclosure of the output

3.Design issues

4.Available software tools

5.Decision tree of methods

6.Glossary

7.Literature

A.1Relationship with other modules

A.1.1Related themes described in other modules

A.1.2Methods explicitly referred to in this module

A.1.3Mathematical techniques explicitly referred to in this module

A.1.4GSBPM phases explicitly referred to in this module

A.1.5Tools explicitly referred to in this module

A.1.6Process steps explicitly referred to in this module

General description – Theme:...... 3

1.Summary...... 3

2.General description...... 3

2.1. Whether to use the imputation...... 3

2.2. Preliminary verification of the target variable...... 4

2.3. The choice of auxiliary variables...... 4

2.4. The choice of the imputation method...... 5

2.5. Quality control...... 9

2.6. Disclosure of the output...... 12

3.Glossary...... 12

4.Literature...... 13

A.1Relationship with other modules...... 15

A.1.1Related themes described in other modules...... 15

A.1.2Methods explicitly referred to in this module...... 15

A.1.3Mathematical techniques explicitly referred to in this module...... 15

A.1.4GSBPM phases explicitly referred to in this module...... 15

A.1.5Tools explicitly referred to in this module...... 15

A.1.6Process steps explicitly referred to in this module...... 15

General description – Theme[1]:

1.Summary

As the theme module of this chapter states, two reasons for imputation (rather than other ways of estimation) are convenience and quality. The cConvenience is due toconnected with using complete data files – but some care has tomust be taken (for instance, one has to consider the question of how to estimate the variance has to be considered). The quality of the imputation depends on the information available, e.g. auxiliary variables that can be used (and how well you they can be used them). When we impute one or more values for a unit we have to consider the internal consistency for the unit. Imputation is often doneconducted together with editing. Imputation depends on somea model and should mostlygenerally not be used for very influential units. ThenIn such cases we need ato re–contact the unit or possibly have an expert conduct imputation by an expert. These problems, as well as athe choice between re–weighting and imputation of unit non–response also depend also on convenience and quality. All these aspects should be taken into account when we design the imputation of statistical data.

In this module particular stages of suchthean imputation process design are described. To say more sStrictly speaking, wWhen planning this process, we must take into account several important aspectswhich affect the final quality of imputed values – here called implants – and estimation of aggregated descriptive statistics for the population under investigation. That is, we have to address the following questions:

  • recommendations for the application of imputation,
  • the choice ofauxiliary variables for a particular imputation model,
  • recommendations for the use of different methods,
  • assessment ofthe model fit and accuracy,
  • optimization ofthe cost, timeliness and formal quality components of the process,
  • the effect of adding or not adding disturbance terms to some models of imputation,
  • relationships between particular methods.

These topics will be discussed in more detail lateron in the module. We will want to investigate many various aspects of imputation such as the necessity to make an imputation, a preliminary review of the target variable, the choice of auxiliary variables,the selection of the class of imputation methods and a particular method within the class, the predicted quality of imputed valueslantsin terms of of thetheirdeviation of imputed valuesfrom their ‘true’ values and – in terms of mean square error, coefficients of bias and components of total variance – the precision of estimation using available and imputed data, identification of possible additional bias, the output format and the release of final results.In successive sections all these topics will be described in detail. Together they provide a the design scheme of the imputation process .

2.General description

2.1. Whether to use imputation

In the main theme module of this chapter two main premises there were presented two main premises being as arguments for conducting imputation. The Ffirst of themone points out that the researcher would like to avoid incompletely filled the convenience of complete data filesbases, which is undesirable from the point of view of further data processing, analysis, and data presentation. Missing data can affect the discrepancies in distributions presented in relevant contingency tables or make it impossible to compute direct estimates. On the other hand, –, according to the second premise – , imputation can be used to increase the quality of estimates of relevant parameters for the population, the quality of modeling of variable distribution as well as of microdata. The necessity of having Access to high–-quality population data is a necessitated byconsequence of one of the fundamental principles objectives of statistics to provide reliable and consistent information for users.

It is worth noting that units for which data about the target variable are unavailable can be significantly different than from the remaining ones.In consequenceAs a result, the means in both groups will also be clearly different. IThus, if so, within the a subset of units for which data about the target variable are available, no units can be “similar” to the imputed ones.On the other hand, remindcall that iImputation should be performed mainly to improvetheestimation quality of complex basic descriptivestatistics, i.e. mean, quantiles, variance, skewness, kurtosis, etc., for the higher level of aggregation or for the whole population or for convenience. This means that using only available information in such situations can lead to seriously biased (practically false) results.

In this placeAt this point we would like also to remind repeat the opinion expressed in the main theme module that even if missing values do occur, a decision not to impute can be taken. Then wWeighting is used instead. That is, instead of imputation a researcher can choose to perform estimation or analysis. In this situation researcher should be sure that this action will have no significant consequence for quality of the intended statistical output.

First of all, we should verify, whether providing of a given specific information is necessary. In this context we should be aware of the fact that there is an important distinction between item non–response and unit non–response. Imputation is usually used more frequently in the former case than in the latter (for a non–response unit we have practically no basis enabling us to estimate the data with satisfactory quality). It is essential to know that thea given value is really missing, i.e. that be sure that a blank doesn’t mean zero, for instance. When designing oOther parts of the survey process it is important that this is done inshould therefore be designed to make sure such a way that data can be processed and stored, while, so that the keeping a distinction between missing values and zeros should be done.

If the missing valueslack of information concerns only items, which haven’t don’t have to be filled (such asas international exchange in the case of a firm company operating only within one country – see the main theme module of this chapter), the imputation has makes no sense (in this case blanks mean zeros). But it occurs very rarely that theit is seldom the case that a survey consists only with of such questions. If they are necessary variables but with missing data,then to take a decision on about imputation we must focus on their properties. On the other hand, the distribution of the target variable should be so regular that imputation of missing data will not significantly improve the population statistics and using such statistics for available data will be quite sufficient. Thus, one should consider primarily whether imputation is necessary in this case. More precisely, the use of several criteria is recommended. The first of them is, of course, the amount of missingdata. If it is very small in relation to the total number of units (e.g. sample size), one can expect that imputation may not significantly improve the population estimates. In other words, the distribution of the target variable is in this case so regular that imputation of missing data will not significantly improve the population statistics and using such statistics for available data will be quite sufficient. Therefore,to avoid additional costs, it isbetter not to use imputation.At first lookface value, this statement may seem to be in contrarycontradict to opinions presented in the main theme module, but it is only a pretenceonly apparently. In most cases, it is much more convenient to have a full dataset with imputed data thatthan only its subset with availablefull information. It concerns is especially true if the distribution of missing values is neither unknown nor not able tocan it be assessed. Otherwise, i.e. if we can suppose with great probability that the distribution of unknown values will be strictly concentrated around “typical” values (and hence their impact on the population statistics for population, like mean , will be very small) we can drop the imputation. Such a situation can occur e.g. if we know the revenue values of revenues of firmsbusinesses and we would like to know whichat sumamount of CIT tax they have paid and we observe that revenue valuess of firmscompanies with missing data on CIT are very close (and concentrated around the overall mean), then – due to fixed rates of this tax – we can suppose that the distribution of the sumamount of paid CIT will also be also concentrated around the relevant mean. Hence, their impact on the total mean for the entire population is minimal and can be neglected.

If we have any doubts whether the unknown distribution will be really notin significant, the imputation is necessary. The second (and also supplementary to the above) advice about the possible use of imputation can be analyzing archival data from an analogous, previously conducted survey or relevant data for earlier periods. That is, we can analyze the distribution of the target variable from a historical perspective and given ahigh probability of outliers or data gaps observed for units located at the ‘tails’ of these distributions, we can conclude that imputation will be rather necessary. On the other hand, the opposite recommendation can be formulated if we observe that the distribution of the available data was usually almost identical as that for the global values of the variable. This action is strictly connected with a verification of the target variable made at the start of the design of imputation which will be described in Section 2.2.

Summing up, imputation should be conducted if at least one of these conditions is satisfied:

  1. it is subjectively convenient (i.e. can make the rest of the survey process less complex) and useful to have completely filled dataset,
  2. the amount of non–response is relatively large,
  3. the distribution of available data is expected to be considerably different than that of the global variable,
  4. the unavailable data are expected (based on the past experience) to belong to the ‘tail’ of the distribution of the target variable,
  5. we would like to have consistency within the survey and perhaps also in relations to other surveys,
  6. we would like to increase the quality of microdata.

In the case of occurringUnder circumstances described in the conditions from two to four, the imputation is necessary to guarantee the proper quality of estimates of population parameters estimates. Another problem can lie in the choice between re–weighting applied in the sampling design and imputation of unit non–response. As we have noted at the beginning in this module, in the latter case the imputed values (and, in consequence, the final population estimates) can be seriously biased – especially if there are many non–response units or if a non–response units is supposed to be dominanting in the population. The rRe-weighting can reduce the impact of the non–response unit but it can simultaneously lead simultaneously to a deformation of the distribution of the analyzed variable in the entire population. Hence, the decision in this matter should be takenmade subjectively, with great caution. It would be good if re–weighting canould be based on the opinions and assistance of high–specialized experts.

EThe experts can be used, however, to perform the entire imputation process. That is, their opinion can be useful when thea database where there arewith variables to be imputed and possibly also variables which can be used as auxiliary ones is very complex and mutual dependencies between them are hard to recognize by a non–specialist. If the choice of imputation method is ambiguous, an assistance of an experienced person is also welcomeuseful.

If oOne can also perform also manual imputation. This approach consists in a data correction performed in as a result of the need tonecessary re–contact with the corresponding respondent or following estimation of a missing or erroneous value on the basis of relevant basic subject matter knowledge. The latter case is strictly connected with deductive imputation (see the relevant module), because estimation is made using basic dependencies by the staff involved directly in data collection and processing. That is, if some dependencies between collected data occur “fromby definition” (i.e. if the number of employees=0, then labour cost=0 or the sum of total number of employees ofin different units of an enterprise must be equal to the total number of employees of this enterprise), then they should be used to impute missing data as early as possible, i.e. onat the collection and editing stages of a statistical survey. Such an approach can be sufficient or at least will significantly reduce costs of further activities.

Of course, another option is to do nothing. As we have earlier noted earlier, in some cases it this course of action can be motivated by a small number of gaps or an even distribution of existing data and an allegedly even distribution of missing values. In other situations, it would be unfavorable because of large bias of estimated summary statistics.

At the end of this subsection it is important to consider the situation of a statistical office (possibly as opposed to a researcher), where a survey is run repeatedly and where there are many variables and many units in the sample. In this case, there is not time to make detailed studies for missing values each time. On the other hand, replication of ng such a survey can be a good opportunity to recognize most important and regularly occurring problems, including non–response. On the basis of such experience and experts’ opinions one can choice a unique good procedure, which will be comfortable nvenient for next exercisesfuture survey rounds, efficient and sufficiently robust to occassional outliers, which can occur in some replications.

2.2. Preliminaryverification of the target variable

When we have decided to conduct imputation, we have to perform a preliminary analysis of the target variable. This step is aimed at avoiding unnecessary effort in dealing with empty data cells, which should not be imputed. First of all, we should control whether the population (or sample) contains units which should not be included in the survey. Completion of this task is necessary because in some situations just performing a survey can verify the true status of a unit. For example, although an economic entity is registered in the business register, only a direct interview with its representative may provide information that the unitis no longer active. It might have forgotten to submit an update to the register or possibly the updated information was not entered into the database for personal or technical reasons.

The next stage of verification can concern involve items which logically should not be filled or those whose values should be unique as implied by data for other variables. For instance, if we analyze a one–person firm business (and this size is known e.g. from the business register or declaration in the relevant placesection of the questionnaire), next cells concerning employees (e.g. the number of persons employed on the basis of a labour contract) have to be empty. Otherwise, there is obviously a mistake (on the part of the interviewer or the respondent) and we must explain the problem and remove false data. We will not develop this problem further this problem here, because it belongs to issues of editing presented in another chapter of this handbook. In the second situation a concrete value or a given (and available) variable(s) may imply a relevant value of the target variable (e.g. if the income=0, the corresponding paid income tax must also be equal to zero).The Uuse of such dependencies leads to deductive imputation, which is described in the relevant module of this chapter.

2.3. The choice of auxiliary variables

Authors of the document published by WHO/UNESCAP (2008) Many famous specialists in the field of imputation (e.g. T. de Waal et al. (2011) believe thatthe design of the imputationprocess depends on the kind of imputation procedure to befollowed.As we have shown in the method modules within this chapter, there exist many methods of imputation relying on various computational algorithms and scopes of auxiliary data. To support the selection of the most effective algorithm for a given situation we have to first analyze available data for the target variable and recognize which useful auxiliary data we can collect.

An analysis of the target variable should concern the diversification of available data and its complete records from previous periods or surveys. If the absolute value of the coefficient of variation in all cases is relatively small,then we can consider usingquick and simple (but, in general, weakly effective) methods with significant random factors and, in this case, even looking for any additional variables might prove unnecessary. Otherwise, i.e. if the distribution of current available data is diversified or significantly skewed or differs considerably from those obtained in previous periods or surveys, then more sophisticated methods supported by auxiliary data are required.