Latent class binary regression models – identification and estimation

Anders Holm[I]

and

Morten Pedersen[II]

Abstract In this paper we analyse the identification of the latent class binary regression model. In this model the latent classes are thought to represent unobserved heterogeneity. We show that ignoring unobserved heterogeneity might lead to very biased results. We furthermore illustrate that the model is well identified using panel data, but identification is fragile using cross-sectional data. We propose, based on insight into the model and simulations that a simplified model might work almost just as well for cross-sectional data. Finally we illustrate the applicability and performance of the latent class logit regression model as opposed to an ordinary logit regression model, in two different applications.

Keywords: Latent class, logit model, regression model, panel data, unobserved heterogeneity.

1.  Introduction

In many social science applications of regression analysis one does not observe all the relevant independent variables. In linear models problems with unobserved independent variables depend on whether these variables are correlated with the observed independent variables, see e.g. Ejernæs and Holm (2006). In contrast, in non-linear regression models it is well known that unobserved independent variables potentially leads to bias in the effect of the observed independent variables, even when the unobserved and observed variables are uncorrelated, see Cameron and Heckman (1998), Bretagnolle and Huber-Carol (1988) or Abramson et al. (2000). Consequently, ignoring omitted independent variables, even when they are uncorrelated with the observed independent variables, might potentially lead to incorrect conclusions about the effect of the observed independent variables.

In order to illustrate how to take into account omitted independent variables in non-linear regression models we propose a simple version of the latent class regression logit model. Furthermore, we illustrate why with this model the observed data can reveal information on the effects of both the observed and unobserved independent variables on the dependent variable. We also illustrate why this model is sometime hard to estimate, especially with cross-sectional data, due to weak identification and we propose a simple strategy to improve identification in this case.

In our approach the latent classes need not represent any particular type of omitted variables, but can be seen as a non-parametric approximation to any unknown distribution of omitted variables. Hence, even if the omitted variables are continuous or discrete, or both, we can think of the latent class as a non-parametric approximation to the unknown distribution of these variables.

The justification of our approach comes from Lindsay (1983a) and (1983b) who showed that any mixing distribution representing unobserved heterogeneity can be sufficiently approximated by a latent class distribution with a fixed number of classes. However, the number of classes is proportional to the number of observations. Although this argument makes sense intuitively, it leads to a break down of some of the regularity conditions of maximum likelihood theory (a finite parameter space with parameters in the interior of the parameter space) and, hence, it is impossible to use the classical inference of maximum likelihood theory for these types of models. However, by assuming that the number of latent classes is known in advance, one can use the standard maximum likelihood for the parameters of the model. Furthermore, the exact number of latent classes can be determined by alternative goodness of fit measures, e.g. the Bayesian information criteria (BIC) see e.g. Dayton (1999). Although this strategy tends to understate the estimated standard errors of the parameter estimates it has been shown to be a feasible solution in the applied literature, see Greene (2003). In practical applications, see Heckman and Singer (1984), Davies (1993) or Holm (2002) one often finds that a small number of latent classes are sufficient to capture the significant features of the distribution of the omitted variables.

The remainder of the paper is organized as follows: Section two introduces the model, section 3 discusses identification, section 4 presents simulations results, section 5 contains an application, and section 6 concludes.

2.  The model

We analyze a latent class binary logit regression model. The dependent variable is Y and takes the value y = 0 and y = 1. We formulate the latent regression logit model with J latent classes as:

(1)

where is a constant term, is a vector of explanatory variables, is a corresponding row vector of regression coefficients, is the effect of the j’th latent class on the probability of observing Y = 1, and finally is the frequency of the j’th latent class in the population. The parameters of the model to be estimated are,,,, j = 1,..,J, where J is the number of latent classes. This model takes into account unobserved heterogeneity arising from omitting independent variables. The unobserved heterogeneity might either be thought of as the representation of a true discrete distribution of unobserved heterogeneity or as an approximation to any unknown distribution of unobserved heterogeneity, discrete or continuous.

The latent class frequencies,, must meet the restrictions: and . Hence, the following re-parameterization is useful when estimating the frequencies:

where now, j = 1,…J are parameters to be estimated. Furthermore we divide with to get:

.

It follows that the number of identifiable parameters for the latent class frequencies is J –1. Furthermore, we also find that re-defining leaves, j = 1,…, J, hence a normalization of the effect of the latent classes is warranted. We follow the so called dummy-coding and normalize.

As the purpose of this paper is a discussion of the intuition behind identification and not a full rigorous proof of identification of the latent class model, we work with a simplified two-class model with one independent variable (sentence is very long). The model is then written as:

(2)

where x is now a single continuous variable and is a regression coefficient and where and . From (2) we construct the log-likelihood function for a sample of n independent observations:

(3)

where

and

and finally where. Note that we now implicitly use:

In the following example we illustrate how the latent class logistic regression model and the standard logit regression model might lead to very different estimates of the effect of the observed independent variable. Consider the following two way table:

- TABLE 1 HERE -

From the table we find the log-odds ratio to be, on average, roughly one. However, the table can be thought of as comprised of the following two sub-tables:

-  TABLE 2 HERE –

From table two, it is evident, that in both sub-samples, the log-odds ratio is approximately two. Hence, ignoring grouping, we estimate the log-odds ratio,, with about 100 % bias. This is confirmed by the following ML estimates from an ordinary logistic regression model and a latent class model with two classes.

-  TABLE 3 HERE –

The likelihood value of the ordinary regression model and the latent class model seem not to yield dramatically different fit to the data. The ratio of the log-likelihoods is 1.003, even though the estimate of differ dramatically between the two models. To illustrate this consider the following figure:

- FIGURE 1 HERE -

The figure shows observed and predicted probabilities of Y = 1. From the figure it is clear that there are only small differences between the predicted probabilities of the logistic regression model and the latent class model (which in this case yields a perfect fit to the data because it is a saturated model). It is likely that the variation in x will only yield minor discrepancies in predicted probabilities between the logistic regression model and the latent class model. And often, from these variations it will be difficult to determine whether these discrepancies are due to non-linear effects of x on the log-odds of Y or the presence of latent classes.[1]

3.  Identification

Going back to the log-likelihood function we find the following log-likelihood equations:

where and .

From the Log-likelihood equations we find that when. This means that whenever, i.e. there is no information on the value of p. In this case, the last equation becomes redundant and identification of p is not possible. In practice this means that when is close to 0, the likelihood function might behave badly and identification might be problematic.

In order to study how the observed information (the distribution of Y and X) may or may not lead to identification of the distribution of the latent classes, we find the posterior distribution of conditional on Y and X:

Now differentiate wrt. x and equate to zero to obtain:

as the denominator is always defined. This means that whenever x varies so does the posterior probability of observing a latent class membership, except when the latent class membership effect is zero.

If we have panel data, i.e., repeated observations in both of Y and X, in general we have that, where subscript t =1, 2 denotes what part of the panel the observation belongs. Hence changing values of not only x but also y along the panel might lead to information on . This can be seen by:

Hence whenever y varies so does also the posterior probability of observing a latent class membership, except when the latent class membership effect is zero. We also find that:

As the term inside the bracket will always be negative, this is not a feasible solution. Hence, identification of improves when both Y and X vary. Finally, note that does not lead to any conclusion about the value of. That is, the observations that only change the values of the independent variable do not contribute to the identification of the latent classes.

We may summarize these findings in the following proposition:

Proof: See the appendix. The proposition states that if two different posterior probabilities are equal for different x (the case of cross sectional data) or y and or x (the case of panel data), this must be because the distribution of the latent classes are degenerate, at least for the observed information used in the comparison. Hence, this observed information is non-informative with respect to the distribution of the latent classes. Vice versa, if the posterior probabilities differ for different observed (non-redundant) information this information is informative on the distribution of the latent classes.

4.  Some Simulations

In order to study the effect of cross sectional and panel data identification of the latent class model, we run a number of simulations. We run 100 simulations each on datasets with 500 observations, including repeated observations in panels. The simulations have varying degrees of identification in terms of number of panels and the variation in x. The results are shown in table 5 below.

- TABLE 4 HERE -

From table 4 it is evident that the latent class model (LCM) with continuous x (infinite outcomes) and five panels yields estimates which are close to the true values and with small Root Mean Square Error (RMSE). However, it is also clear that in the case of only one panel (i.e. cross-sectional data) and two outcomes of x, the LCM performs poorly, although it still estimates the slope coefficient of x,, with much less bias than the logit model. As the slope of x is our parameter of interest, we may try to improve the fit of the model by reducing the number of nuisance parameters. Therefore we fix the parameter for the weight of the latent classes (the transformed probabilities of the latent classes,) to arrive at the Latent Class model with Fixed Weights (LCMFW).[2] In order to asses the impact of this in real applications we have fixed at a value different from the true value.[3] From table 4 we find that in the weakly identified case this approach actually leads to better estimates, whereas it leads to worse results in the better identified cases.

But why in particular fix the weight. Why not any of the other parameters? First of all, we found from the likelihood equations, that the equation for the weight was redundant when the effect of the latent class approached zero. Hence, for some values of the other parameters, there is no information on how to choose a particular value of. Further, from principal component analysis (PCA) of the estimates in the simulations in table 4 we find the following eigenvalues and eigen-vectors of the estimated parameters in the simulations:

- TABLE 5 HERE -

By comparing the two top panels of table 5, representing PCA of the simulations on panel data, with the three lower panels, representing PCA on cross-sectional data, we find that the sum of the eigen values are much lower in the simulations pertaining to panel data simulation than those pertaining to cross-sectional data. This reflects the increased accuracy of panel data estimation compared to cross-section estimation.

The first and largest eigenvalue corresponds to a eigen-vector with large loadings on the constant term,, and especially the effect of the latent class,. Hence, large parts of the RMSE on these two parameters are due to the fact that they are correlated. The second largest eigenvalue, which is still of considerable relative size in the cross-sectional simulations, pertain to an eigen-vector with a large loading in the weight of the latent class. Therefore we conclude that a large part of the RMSE on this parameter is not correlated with any of the other parameters or in other words, can, on cross-sectional data, take a wide range of values that leave the other parameters relatively unaffected. Hence, when identification is fragile, it seems relevant to fix in estimations.