Ronald H. Heck / 2
EDEP 768 (S2011): Seminar in Structural Equation Modeling / March 28, 2011
University of Hawai‘i at Mānoa / Class Notes on Dichotomous and Ordinal Variables


Class Notes on Dichotomous and Ordinal Variables

There might be times when it is beneficial to conduct CFA with the indicators as categorical (implying that they are either dichotomous (like no =0 yes =1) or ordinal (e.g., 5 point scales). Although it is usually acceptable to treat ordinal data as continuous if there are at least 5 points on the scale, acceptable solutions can be obtained if the ordinal variables do not depart too much from normality (Rigdon, 1998). However, we now have expanded options for dealing with these types of variables.

The major issue to overcome with ordinal data is the calculation of a proper set of covariance relationships for variables measured on ordinal scales. Calculating the Pearson’s product moment correlation {r}strictly speaking would be inefficient since the individual items are not measured on interval scales. When indicators are not continuous a set of probabilities must instead be modeled. Mplus uses a latent linear model with a set of thresholds to incorporate categorical variables into the general modeling framework. Commonly used estimation is ML and WLS (weighted least squares). WLS estimation is appropriate for calculating the necessary covariances properly when there are categorical variables present. Another approach is ML estimation, which for categorical uses the logit link function as a default. When categorical data are estimated with ML in Mplus, a numerical integration algorithm must be used to provide estimates. This can take considerable time with more factors, items, and increased sample size. This results in residual variances not being directly estimated for the items (i.e., no error terms).

The latent variable framework is based on the premise that the observed categorical values from an unobserved (latent) continuous variable and a set of thresholds corresponding to cutpoints between the observed categories. Therefore, two ordinal variables are assumed to represent a pair of latent variables that have a bivariate normal distribution. This facilitates the estimation of the correlation between them.

For ordinal variables in Mplus, the ordinal variable with ordered n categories is based on the cumulative probabilities of the category considered c and the c-1 previous categories. There is one less threshold than the number of categories (C-1) of the dependent variable. The characteristic feature of the proportional odds model is that the two expressions share the same slope. This means that the corresponding conditional probability curves expressed as a function of x are parallel and only differ due to the thresholds. So where the binary outcome has one threshold and 1 set of coefficients, the ordinal cumulative odds model with have one set of estimates to predict membership in the C-1 categories.

We can run the preliminary model in the first assignment as being defined by ordinal indicators. We will run the model with WLS estimation first. This is done as in the following file.


TITLE: Factor Model with Categorical Indicators;

DATA: FILE IS C:\Mplus\assign1FA.dat;

Format is free;

VARIABLE: Names are item1 item2 item3 item4 item5 item6 g;

Usevariables are item1-item6;

categorical are item1-item6;

ANALYSIS:

Estimator is WLS;

Model:

F1 by item1* item2 item3;

F2 by item4* item5 item6;

F1@1

F2@1;

OUTPUT: STANDARDIZED;

Output

First, we obtain the proportions in each of the 5 categories comprising the ordinal items.

UNIVARIATE PROPORTIONS AND COUNTS FOR CATEGORICAL VARIABLES

ITEM1

Category 1 0.029 11.000

Category 2 0.083 32.000

Category 3 0.172 66.000

Category 4 0.349 134.000

Category 5 0.367 141.000

ITEM2

Category 1 0.026 10.000

Category 2 0.148 57.000

Category 3 0.398 153.000

Category 4 0.427 164.000

ITEM3

Category 1 0.047 18.000

Category 2 0.135 52.000

Category 3 0.255 98.000

Category 4 0.354 136.000

Category 5 0.208 80.000

ITEM4

Category 1 0.039 15.000

Category 2 0.115 44.000

Category 3 0.211 81.000

Category 4 0.331 127.000

Category 5 0.305 117.000

ITEM5

Category 1 0.016 6.000

Category 2 0.081 31.000

Category 3 0.297 114.000

Category 4 0.393 151.000

Category 5 0.214 82.000

ITEM6

Category 1 0.013 5.000

Category 2 0.102 39.000

Category 3 0.188 72.000

Category 4 0.344 132.000

Category 5 0.354 136.000

MODEL FIT INFORMATION

Number of Free Parameters 30

Chi-Square Test of Model Fit

Value 9.199

Degrees of Freedom 8

P-Value 0.3258

RMSEA (Root Mean Square Error Of Approximation)

Estimate 0.020

90 Percent C.I. 0.000 0.065

Probability RMSEA <= .05 0.831

CFI/TLI

CFI 0.999

TLI 0.998

We can see that this model appears to fit the data very well. You can see that the fit indices are also pretty consistent with ML estimation for the indicators when they were considered as continuous (I think chi-square was 9.47 for 8 df in assignment 1). Below are standardized loadings of the items defining the factors.

STDYX Standardization

Two-Tailed

Estimate S.E. Est./S.E. P-Value

DECMAK BY

ITEM1 0.755 0.034 22.107 0.000

ITEM2 0.683 0.041 16.781 0.000

ITEM3 0.748 0.034 22.324 0.000

EVAL BY

ITEM4 0.835 0.024 34.482 0.000

ITEM5 0.820 0.025 32.454 0.000

ITEM6 0.853 0.024 35.475 0.000

EVAL WITH

DECMAK 0.868 0.030 28.925 0.000

Thresholds

{I did not list these}

Variances

DECMAK 1.000 0.000 999.000 999.000

EVAL 1.000 0.000 999.000 999.000

R-SQUARE

Observed Two-Tailed Residual

Variable Estimate S.E. Est./S.E. P-Value Variance

ITEM1 0.570 0.052 11.053 0.000 0.430

ITEM2 0.467 0.056 8.391 0.000 0.533

ITEM3 0.560 0.050 11.162 0.000 0.440

ITEM4 0.697 0.040 17.241 0.000 0.303

ITEM5 0.672 0.041 16.227 0.000 0.328

ITEM6 0.727 0.041 17.737 0.000 0.273

We can also run the model with maximum likelihood (ML) and defining the indicators as categorical. You just need to change the analysis command.

TITLE: Assignment 1 With Categorical Indictators;

DATA: FILE IS C:\Mplus\Examples EDEP 768\assign1fa.dat;

Format is free;

VARIABLE: Names are item1 item2 item3 item4 item5 item6

Usevariables item1-item6;

Categorical = item1-item6;

ANALYSIS: Estimator is ML;

Model:

DecMak by item1*1 item2 item3;

Eval by item4*1 item5 item6;

DecMak@1;

Eval@1;

OUTPUT: SAMPSTAT STANDARDIZED;

You can observe the item loadings on the factor if you run this on your own. We can see that this provides estimates of errors for items. The squared factor loadings are a bit different from the WLS estimation.

R-SQUARE

Observed Two-Tailed

Variable Estimate S.E. Est./S.E. P-Value

ITEM1 0.545 0.057 9.531 0.000

ITEM2 0.419 0.058 7.186 0.000

ITEM3 0.511 0.055 9.297 0.000

ITEM4 0.639 0.048 13.378 0.000

ITEM5 0.666 0.047 14.219 0.000

ITEM6 0.713 0.045 15.679 0.000

ITEM6 0.727 0.041 17.737 0.000

MODEL FIT INFORMATION

Number of Free Parameters = 30

Loglikelihood

Chi-Square Test of Model Fit for the Binary and Ordered Categorical

(Ordinal) Outcomes**

Pearson Chi-Square

Value 2724.054

Degrees of Freedom 12450

P-Value 1.0000

Likelihood Ratio Chi-Square

Value 972.703

Degrees of Freedom 12450

P-Value 1.0000

Testing the Model Across Groups

One of the issues now is, however, that the model with categorical indicators cannot be tested across groups in the same manner as we previously conducted them. With categorical indicators, Mplus requires a type of “mixture model” to be estimated. Mixture models are quantitative models that have a categorical latent variable that represents mixtures of subpopulations where population membership is typically not known but, rather, is inferred from the data. This type of analysis estimates the number and size of the latent classes in the mixture and assigns membership to in the latent classes to individuals in the population. In the comparison of the model across groups where we know group membership, we can use the “known classes” option in Mplus to conduct the multiple group analysis with categorical indicators.

The default in Mplus is that the same model fits across the number of latent classes specified. So this works nicely in conducting a comparison between groups, since it will test equal factors, item loadings, item thresholds (instead of intercepts) and factor correlations in the first model. Note that there are no errors to test since the categorical items do not have error terms.

Hypothesis 1: Equal factors, item loadings, item thresholds (for ordinal indicators), and factor correlations. Note: You will not be able to run this with DEMO as there are too many variables for it. Indicators are defined as categorical.

TITLE: Comparing the Model Across Groups (Ordinal Indicators);

DATA: FILE IS E:\assign1FA.dat;

Format is free;

VARIABLE: Names are item1 item2 item3 item4 item5 item6 g;

Usevariables are item1-item6;

categorical are item1-item6;

classes = cg(2);

Knownclass = cg(g = 1 g = 2);

ANALYSIS:

Type = mixture;

Estimator is ML;

Algorithm = integration;

Model: %OVERALL%

F1 by item1* item2 item3;

F1@1;

F2 by item4* item5 item6;

F2@1;

OUTPUT: SAMPSTAT STANDARDIZED;

You might keep this as a reference for some rainy day.

FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES

BASED ON THE ESTIMATED MODEL

Latent

Classes

1 179.00000 0.46615

2 205.00000 0.53385

Note: Above are the class counts which confirm that the two organization types are being compared.

THE MODEL ESTIMATION TERMINATED NORMALLY

TESTS OF MODEL FIT

Loglikelihood

H0 Value -2901.860

Information Criteria

Number of Free Parameters 33

Akaike (AIC) 5869.721

Bayesian (BIC) 6000.092

Sample-Size Adjusted BIC 5895.387

(n* = (n + 2) / 24)

Chi-Square Test of Model Fit for the Binary and Ordered Categorical

(Ordinal) Outcomes**

Pearson Chi-Square

Value 3692.445

Degrees of Freedom 24937

P-Value 1.0000

Likelihood Ratio Chi-Square

Value 1184.099

Degrees of Freedom 24937

P-Value 1.0000

This chi-square test suggests there is considerable evidence that the model fits strongly across the two classes (defined as organizational types). We can accept measurement invariance. The model is tested for equal factors, items, item intercepts, and factor correlation between groups.

Here is the standardized solution so you can examine the loadings, etc.

STDYX Standardization

Two-Tailed

Estimate S.E. Est./S.E. P-Value

Latent Class 1

F1 BY

ITEM1 0.724 0.040 18.095 0.000

ITEM2 0.623 0.047 13.302 0.000

ITEM3 0.701 0.039 17.849 0.000

F2 BY

ITEM4 0.778 0.032 24.297 0.000

ITEM5 0.792 0.031 25.351 0.000

ITEM6 0.836 0.028 29.557 0.000

F2 WITH

F1 0.854 0.037 22.994 0.000

Means

F1 0.633 0.128 4.929 0.000

F2 0.716 0.120 5.987 0.000

Thresholds

{I did not list these as they are invariant and not usually interpreted}

Variances

F1 1.000 0.000 999.000 999.000

F2 1.000 0.000 999.000 999.000

Latent Class 2

F1 BY

ITEM1 0.724 0.040 18.095 0.000

ITEM2 0.623 0.047 13.302 0.000

ITEM3 0.701 0.039 17.849 0.000

F2 BY

ITEM4 0.778 0.032 24.297 0.000

ITEM5 0.792 0.031 25.351 0.000

ITEM6 0.836 0.028 29.557 0.000

F2 WITH

F1 0.854 0.037 22.994 0.000

Means

F1 0.000 0.000 999.000 999.000

F2 0.000 0.000 999.000 999.000

Variances

F1 1.000 0.000 999.000 999.000

F2 1.000 0.000 999.000 999.000

{Notice in this formulation the reference group is the second latent class—type 2, or service). So the factor means are significantly higher in Group 1 (product).

WE can back out and test equal factors separately across the groups.

After the %OVERALL% Model Statements, you can add this model statement:

%cg#1%

f1 with f2;

RESULTS:

Model 1 Chi Square Value = 3692.445

Model 2 Chi Square Value = 3714.357

Delta Chi Square Value = 21.912 (1 df)

Suggests relaxing the correlations between factors does not improve the model.

Running the Path Model with Continuous Mediating and Dichotomous Outcome

I checked into the issue of standardizing logit coefficients and found that there have been at least 6 ways to obtain standardized logit coefficients. The problem is in finding a standard deviation for the Y variable. If you notice in the Mplus output, an estimate of the covariance with the dependent variable (readprof) is not included in the preliminary statistics. I did manage to discover the meaning of the standardized coefficients in Mplus, however.

How Do You Standardize the Predictors?

Typically, for continuous variables we standardize a variable by taking the unstandardized beta of the predictor multiplied by ratio of its standard deviation to the standard deviation of the dependent variable:

β* (σx/ σy) .

We start with the fact that the variance in a logit distribution is approximately 3.29 (). To estimate the standard deviation of y, save the predicted logit values for each case from the logistic regression. The predicted values of η have a normal distribution, so we can obtain the variance from the descriptive statistics. We then add this variance to the variance of the error term (remember there are two values in a binomial distribution for each level of x), which is defined as 3.29 in logistic regression (it is always the same number in a logistic distribution). We then take the square root of the sum to obtain a measure of the standard deviation of the dependent variable (Pampel, 2000). This estimate of the standard deviation for y depends on the other variables in the model, so it will change with new independent variables.

I obtained the variance of the predicted logged odds of readprof in SPSS. You can save the predicted log odds for each person.

From descriptive statistics, I also obtained the variance and standard deviations of the other variables affecting readprof.

To calculate the standard deviation of readprof then:

= = 3.17

Lang Standardized coefficient = b(sdx/sdy) = .079(31.64/3.17)=.789

Age = -0.13(4.04/3.17 = .166

These estimates are very close to the Mplus standardized coefficients, so they provide a good approximation for how the program is standardizing dichotomous variables.

MODEL RESULTS LOGIT RESULTS

Two-Tailed

Estimate S.E. Est./S.E. P-Value

READPROF ON

LANG 0.079 0.014 5.739 0.000

LOWSES -0.340 0.543 -0.625 0.532

AGE -0.130 0.070 -1.861 0.063