Can Hidden Markov Models be applied to estimate error-free employment transition rates? A test of model sensitivity to linkage- error
Keywords: labour market mobility; linkage error; measurement error; HMM (hidden Markov model); sensitivity analysis.
1. Introduction
Sometimes National Statistical Offices (NSO’s) can derive estimates of one variable from more than one source. Although this provides considerable merits, it introduces new challenges as well. The estimates of these different sources on the same variable are often inconsistent. In this case, NSO’s apply several solutions. The first solution is to choose the source that has the best quality. This quality is seldom quantified and mostly evaluated on face value [1]. The second solution is to use meso- and macro-integration techniques to reconcile the sources [2, 3]. However, NSO’s hardly ever try to reconcile the sources by correcting for measurement error in both sources. This error, even when small, may severely overestimate transitions[4, 5].
An increasingly popular method to correct for classification error is Latent Class Analysis (LCA). More specifically, a group of latent class models - Hidden Markov Models (HMMs) - is often applied to longitudinal (or time series) categorical data for this purpose. In the context of correcting for measurement error, HMMs are applied to describe a turnover or transition in some characteristic assuming that it is driven by a process without memory and that it is measured with an error. HMMs probabilistically estimate the latent states on the individual level which on the aggregate level allows determining the distributions of those states. The comparison of the distributions of the latent states to those of the observed ones for given data sources allows determining the overall magnitudes of the measurement errors in those data sources [6, 7].
The availability of longitudinal data with two indicators for the same variable provides a considerable advantage in the application of HMM’s. Specifically, the use of two indicators, as opposed to one, provides the possibility of relaxing some of the key assumptions of the Hidden Markov Models which in turn results into more accurate estimation of reality [8]. This paper tests the sensitivity of such a HMM model to the violation of one of its key assumptions, namely the error-free linkage of the two data sources. Testing this assumption is crucial to ensure the robustness of findings and, what is more, the applicability of the model to other settings. The test case of this paper is the estimation of the size of temporary employment in the Netherlands as well as the number of transitions from temporary to permanent employment and vice versa.
A similar approach to ours has been undertaken by [8]. In that study, in order to correct for measurement error in the contract type and provide more accurate estimates of transition rates from temporary to permanent employment, the authors applied a mixed hidden Markov model to a novel dataset which links the Dutch Labour Force Survey (LFS) with register data from the Employer Register (ER). Besides its methodological contribution, the study provides an important contribution to research on social inequality. Labour market flexibility is one of the cornerstones of the EU political agendas. This has led to a significant increase of temporary employment across Europe, with the Netherlands taking a leading position. This has sparked a heated economic and political debate about the associated social outcomes [8-12]. One of the most important aspects of this debate is the role of temporary employment from a life-course perspective. Specifically, this refers to whether temporary contracts serve as a “stepping stone” towards permanent employment or whether they are a “dead end” and provide no mobility opportunities. In order to analyse this dynamic aspect of temporary employment, highly accurate data regarding the contract type is needed. However, while there are two data sources that provide information regarding the contract type in the Netherlands (i.e. survey and register data) it has been shown that both are subject to measurement error, which, if unaccounted for, severely impedes the accurate estimation of mobility.
2. Methods
This analysis tests one of the main assumptions made in the mixed hidden Markov model used by [8]. Namely, that the LFS and ER are perfectly linked. Previous research suggests that there are various reasons that can potentially cause a mismatch when linking administrative and survey data. For example, when using a combination of variables such as birth date, gender and address - as it is done when linking the LFS to the Population Register- PR- one runs the risk of a match not being unique. Furthermore, the variables related to the residential address used to link both the LFS and ER to PR might be time sensitive due to residential mobility, which can lead to mismatches when data is collected at different points in time for the various sources. Such scenarios can lead to two types of linkage errors: false-positive where erroneous matches are included in the sample and false-negative in which actual matches are considered non-matches and, therefore, excluded from the sample [13, 14].
Overall, it has been shown that, while linkage error can be considered negligible when it is under 1%, the vast majority of ‘good’ and ‘high match scores’ are subject to linkage error varying from approximately 2.5% to 10% and therefore need to be taken into account in the analysis [15].
In order to test whether and to what extent linkage error affects the accuracy of the results obtained by the mixed HMM a sensitivity analysis is ought to be conducted. Therefore, this analysis runs simulations in which various degrees of mismatch (including both false-positive and false-negative errors) are introduced to the LFS and ER linked dataset [15]. Then the measurement error in the contract type and transition rates from temporary to permanent employment are estimated for each of those scenarios (which are characterised by different extents of the two kinds of linkage error) using the mixed HMM applied by [8]. In this model, two measurements for the outcome variable are used - the contract type according to the LFS and according to the ER. The model also assumes that the error in the register data is serially correlated. It also assumes that the latent initial state probabilities and the latent transitions depend on covariates and some unobserved characteristics. A non-parametric approach to these unobserved characteristics is adopted by assuming that individuals belong to different latent classes according to their latent probability of having a particular type of contract. The obtained results are compared to those reported when no linkage error is simulated in order to estimate the bias. The model is graphically illustrated in Figure 1.
*k- latent class membership; X(t)- latent contract type; E(t)- observed contract type in LFS; C(t)- observed contract type in ER.
Figure 1- Path diagram for the HMM with two indicators, serially correlated register errors and predictors for latent transitions and latent state probabilities
In terms of false-negative linkage error specifically, the analysis first simulates a set of scenarios with high overall exclusion error of approximately 20% which represents a considerable level of mismatch and then two sets with medium and low linkage errors which equal approx. 10% and 5%, respectively. For each set, the scenarios that are analysed include three specifications where the probability of exclusion depends on one of the following covariates while omitting the effect of the remaining covariates:
· Age, whereby younger individuals have much higher probabilities of being excluded;
· The land of origin, whereby foreigners are more likely than natives to be excluded;
· The level of education, whereby more highly educated individuals are more likely to be excluded.
The selections of these scenarios can be justified by the fact that younger, non-native and higher educated individuals tend to experience more residential mobility which, as briefly mentioned in the previous section, results in higher probabilities of false negatives [16, 17].
Those covariates are not included in further analysis (i.e. when estimating the temporary to permanent employment transition rates using the mixed HMM). Therefore, while the covariates are allowed to determine the level of linkage error, they do not affect the latent transitions and initial state probabilities in the HMM. This allows generating data that is not missing at random with respect to the transition rates.
To illustrate, in the case of age the exclusion probability of young people (aged 25 to 34) amounts to 0.7, 0.32 and 0.15 for the high, mild and low case respectively while that of older people (aged 35 to 54) equals to 0.01 in all three cases. Those probabilities are set in such a way that there is a substantial difference between the two categories of the age variable so that the excluded group is dominated by young individuals who have been shown to have higher rates of residential mobility and hence higher propensity to be subject to false-negative linkage error. As can be deduced from those exclusion probabilities, the difference between the two age categories, and what comes along the domination of young individuals in the excluded group is most prominent in the high case, followed by the middle and low ones. Furthermore, the probabilities are also set in such a way that the overall proportion of excluded individuals equals more or less 20%, 10% and 5% for the high, medium and low scenarios, respectively.
3. (Preliminary) Results
Table 1 provides a summary of the parameters in each of the scenarios as well as the results which include the latent 3- months’ transition rates from temporary to permanent employment and the relative and absolute bias when compared to the results of the original model which does not include simulated false-negative linkage error.[1] Overall, table 1 shows that while in absolute terms the transition rates under the various scenarios do not greatly differ from the situation of no false-negative linkage error, in relative terms these differences are quite substantial. This is particularly true for the scenario where the error amounts to 20% and depends on age (28%) and where it amounts to 10% and depends on nationality (32%). The impact of education-dependent exclusion error appears much more stable across the different sizes of the error. It appears that the scenarios for which an overall exclusion error of 5% is assumed are on average the most accurate ones. Therefore, it might be the case that the mixed HMM considered is robust to a 5% non-random linkage error. Although, it is worthwhile mentioning that the bias in the scenario in which the exclusion probability equals to 5% overall and depends on the level of education specifically still remains somewhat substantial and amounts to approx. 15%.
The fact that bias is larger for the 10% exclusion error than for the 20% one when the error depends on age and nationality is an unexpected finding. However, as the reported results are based on one time simulations using random distributions, so they might be subject to random noise. In order to assure that the results are free from such random noise the simulations need to be repeated many more times (e.g. 100 replications) and the average transition rates for each of the scenarios need to be considered. This will be the next step in the analysis.
Table 1- Exclusion- error simulation results
Scenario / Overall false-negative linkage error / High probability of exclusion / Low probability of exclusion / Temporary to permanent employment transition rate / Absolute bias / Relative percent biasOriginal' HMM- no simulated linkage error / 0 / - / - / 0.016 / 0.000 / 0
False-negative linkage error dependent on age / 0.2 / 0.7 (young) / 0.01 (mid-age and old) / 0.020 / -0.004 / -28
False-negative linkage error dependent on age / 0.1 / 0.32 (young) / 0.01 (mid/ old age) / 0.021 / -0.005 / -32
False-negative linkage error dependent on age / 0.05 / 0.15 (young) / 0.01 (mid/ old age) / 0.016 / 0.000 / -3
False-negative linkage error dependent on nationality / 0.2 / 0.8 (foreign-born) / 0.025 (native) / 0.017 / -0.002 / -10
False-negative linkage error dependent on nationality / 0.1 / 0.35 (foreign-born) / 0.025 (native) / 0.021 / -0.005 / -34
False-negative linkage error dependent on nationality / 0.05 / 0.12(foreign-born) / 0.025 (native) / 0.016 / 0.000 / -2
False-negative linkage error dependent on education / 0.2 / 0.55 (higher edu) / 0.01 (low/ mid edu) / 0.014 / 0.002 / 12
False-negative linkage error dependent on education / 0.1 / 0.22 (higher edu) / 0.01 (low/ mid edu) / 0.013 / 0.003 / 16
False-negative linkage error dependent on education / 0.05 / 0.08 (higher edu) / 0.01 (low/ mid edu) / 0.013 / 0.002 / 15
4. (Preliminary) Conclusions and next steps
To conclude, the estimation of latent transition rates from temporary to permanent employment in the Netherlands using a mixed HMM appears to be only sensitive to very extreme and unrealistic cases of the violation of the assumption that the two data sources used- LFS and ER- are perfectly linked. Namely, when simulating a substantial non-random false-negative linkage error (i.e. one that amounts to 10% or 20% overall and strongly depends on covariates that affect the latent contract transitions and initial states) the results obtained appear substantially biased. When the linkage-error amounts to 5% and depends to a lesser extent on covariates, on the other hand, the bias significantly decreases. However, the results obtained need to be validated through numerous replications of the simulations, which will be the next step of the analysis. Following that, the analysis will also look at scenarios that combine the effects of the covariates on the exclusion probabilities (which thus far have been considered separately).