EDEP 768 (S2011): Seminar in Structural Equation Modeling / March 28, 2011
University of Hawai‘i at Mānoa / Class Notes on Dichotomous and Ordinal Variables
Class Notes on Dichotomous and Ordinal Variables
There might be times when it is beneficial to conduct CFA with the indicators as categorical (implying that they are either dichotomous (like no =0 yes =1) or ordinal (e.g., 5 point scales). Although it is usually acceptable to treat ordinal data as continuous if there are at least 5 points on the scale, acceptable solutions can be obtained if the ordinal variables do not depart too much from normality (Rigdon, 1998). However, we now have expanded options for dealing with these types of variables.
The major issue to overcome with ordinal data is the calculation of a proper set of covariance relationships for variables measured on ordinal scales. Calculating the Pearson’s product moment correlation {r}strictly speaking would be inefficient since the individual items are not measured on interval scales. When indicators are not continuous a set of probabilities must instead be modeled. Mplus uses a latent linear model with a set of thresholds to incorporate categorical variables into the general modeling framework. Commonly used estimation is ML and WLS (weighted least squares). WLS estimation is appropriate for calculating the necessary covariances properly when there are categorical variables present. Another approach is ML estimation, which for categorical uses the logit link function as a default. When categorical data are estimated with ML in Mplus, a numerical integration algorithm must be used to provide estimates. This can take considerable time with more factors, items, and increased sample size. This results in residual variances not being directly estimated for the items (i.e., no error terms).
The latent variable framework is based on the premise that the observed categorical values from an unobserved (latent) continuous variable and a set of thresholds corresponding to cutpoints between the observed categories. Therefore, two ordinal variables are assumed to represent a pair of latent variables that have a bivariate normal distribution. This facilitates the estimation of the correlation between them.
For ordinal variables in Mplus, the ordinal variable with ordered n categories is based on the cumulative probabilities of the category considered c and the c-1 previous categories. There is one less threshold than the number of categories (C-1) of the dependent variable. The characteristic feature of the proportional odds model is that the two expressions share the same slope. This means that the corresponding conditional probability curves expressed as a function of x are parallel and only differ due to the thresholds. So where the binary outcome has one threshold and 1 set of coefficients, the ordinal cumulative odds model with have one set of estimates to predict membership in the C-1 categories.
We can run the preliminary model in the first assignment as being defined by ordinal indicators. We will run the model with WLS estimation first. This is done as in the following file.
TITLE: Factor Model with Categorical Indicators;
DATA: FILE IS C:\Mplus\assign1FA.dat;
Format is free;
VARIABLE: Names are item1 item2 item3 item4 item5 item6 g;
Usevariables are item1-item6;
categorical are item1-item6;
ANALYSIS:
Estimator is WLS;
Model:
F1 by item1* item2 item3;
F2 by item4* item5 item6;
F1@1
F2@1;
OUTPUT: STANDARDIZED;
Output
First, we obtain the proportions in each of the 5 categories comprising the ordinal items.
UNIVARIATE PROPORTIONS AND COUNTS FOR CATEGORICAL VARIABLES
ITEM1
Category 1 0.029 11.000
Category 2 0.083 32.000
Category 3 0.172 66.000
Category 4 0.349 134.000
Category 5 0.367 141.000
ITEM2
Category 1 0.026 10.000
Category 2 0.148 57.000
Category 3 0.398 153.000
Category 4 0.427 164.000
ITEM3
Category 1 0.047 18.000
Category 2 0.135 52.000
Category 3 0.255 98.000
Category 4 0.354 136.000
Category 5 0.208 80.000
ITEM4
Category 1 0.039 15.000
Category 2 0.115 44.000
Category 3 0.211 81.000
Category 4 0.331 127.000
Category 5 0.305 117.000
ITEM5
Category 1 0.016 6.000
Category 2 0.081 31.000
Category 3 0.297 114.000
Category 4 0.393 151.000
Category 5 0.214 82.000
ITEM6
Category 1 0.013 5.000
Category 2 0.102 39.000
Category 3 0.188 72.000
Category 4 0.344 132.000
Category 5 0.354 136.000
MODEL FIT INFORMATION
Number of Free Parameters 30
Chi-Square Test of Model Fit
Value 9.199
Degrees of Freedom 8
P-Value 0.3258
RMSEA (Root Mean Square Error Of Approximation)
Estimate 0.020
90 Percent C.I. 0.000 0.065
Probability RMSEA <= .05 0.831
CFI/TLI
CFI 0.999
TLI 0.998
We can see that this model appears to fit the data very well. You can see that the fit indices are also pretty consistent with ML estimation for the indicators when they were considered as continuous (I think chi-square was 9.47 for 8 df in assignment 1). Below are standardized loadings of the items defining the factors.
STDYX Standardization
Two-Tailed
Estimate S.E. Est./S.E. P-Value
DECMAK BY
ITEM1 0.755 0.034 22.107 0.000
ITEM2 0.683 0.041 16.781 0.000
ITEM3 0.748 0.034 22.324 0.000
EVAL BY
ITEM4 0.835 0.024 34.482 0.000
ITEM5 0.820 0.025 32.454 0.000
ITEM6 0.853 0.024 35.475 0.000
EVAL WITH
DECMAK 0.868 0.030 28.925 0.000
Thresholds
{I did not list these}
Variances
DECMAK 1.000 0.000 999.000 999.000
EVAL 1.000 0.000 999.000 999.000
R-SQUARE
Observed Two-Tailed Residual
Variable Estimate S.E. Est./S.E. P-Value Variance
ITEM1 0.570 0.052 11.053 0.000 0.430
ITEM2 0.467 0.056 8.391 0.000 0.533
ITEM3 0.560 0.050 11.162 0.000 0.440
ITEM4 0.697 0.040 17.241 0.000 0.303
ITEM5 0.672 0.041 16.227 0.000 0.328
ITEM6 0.727 0.041 17.737 0.000 0.273
We can also run the model with maximum likelihood (ML) and defining the indicators as categorical. You just need to change the analysis command.
TITLE: Assignment 1 With Categorical Indictators;
DATA: FILE IS C:\Mplus\Examples EDEP 768\assign1fa.dat;
Format is free;
VARIABLE: Names are item1 item2 item3 item4 item5 item6
Usevariables item1-item6;
Categorical = item1-item6;
ANALYSIS: Estimator is ML;
Model:
DecMak by item1*1 item2 item3;
Eval by item4*1 item5 item6;
DecMak@1;
Eval@1;
OUTPUT: SAMPSTAT STANDARDIZED;
You can observe the item loadings on the factor if you run this on your own. We can see that this provides estimates of errors for items. The squared factor loadings are a bit different from the WLS estimation.
R-SQUARE
Observed Two-Tailed
Variable Estimate S.E. Est./S.E. P-Value
ITEM1 0.545 0.057 9.531 0.000
ITEM2 0.419 0.058 7.186 0.000
ITEM3 0.511 0.055 9.297 0.000
ITEM4 0.639 0.048 13.378 0.000
ITEM5 0.666 0.047 14.219 0.000
ITEM6 0.713 0.045 15.679 0.000
ITEM6 0.727 0.041 17.737 0.000
MODEL FIT INFORMATION
Number of Free Parameters = 30
Loglikelihood
Chi-Square Test of Model Fit for the Binary and Ordered Categorical
(Ordinal) Outcomes**
Pearson Chi-Square
Value 2724.054
Degrees of Freedom 12450
P-Value 1.0000
Likelihood Ratio Chi-Square
Value 972.703
Degrees of Freedom 12450
P-Value 1.0000
Testing the Model Across Groups
One of the issues now is, however, that the model with categorical indicators cannot be tested across groups in the same manner as we previously conducted them. With categorical indicators, Mplus requires a type of “mixture model” to be estimated. Mixture models are quantitative models that have a categorical latent variable that represents mixtures of subpopulations where population membership is typically not known but, rather, is inferred from the data. This type of analysis estimates the number and size of the latent classes in the mixture and assigns membership to in the latent classes to individuals in the population. In the comparison of the model across groups where we know group membership, we can use the “known classes” option in Mplus to conduct the multiple group analysis with categorical indicators.
The default in Mplus is that the same model fits across the number of latent classes specified. So this works nicely in conducting a comparison between groups, since it will test equal factors, item loadings, item thresholds (instead of intercepts) and factor correlations in the first model. Note that there are no errors to test since the categorical items do not have error terms.
Hypothesis 1: Equal factors, item loadings, item thresholds (for ordinal indicators), and factor correlations. Note: You will not be able to run this with DEMO as there are too many variables for it. Indicators are defined as categorical.
TITLE: Comparing the Model Across Groups (Ordinal Indicators);
DATA: FILE IS E:\assign1FA.dat;
Format is free;
VARIABLE: Names are item1 item2 item3 item4 item5 item6 g;
Usevariables are item1-item6;
categorical are item1-item6;
classes = cg(2);
Knownclass = cg(g = 1 g = 2);
ANALYSIS:
Type = mixture;
Estimator is ML;
Algorithm = integration;
Model: %OVERALL%
F1 by item1* item2 item3;
F1@1;
F2 by item4* item5 item6;
F2@1;
OUTPUT: SAMPSTAT STANDARDIZED;
You might keep this as a reference for some rainy day.
FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES
BASED ON THE ESTIMATED MODEL
Latent
Classes
1 179.00000 0.46615
2 205.00000 0.53385
Note: Above are the class counts which confirm that the two organization types are being compared.
THE MODEL ESTIMATION TERMINATED NORMALLY
TESTS OF MODEL FIT
Loglikelihood
H0 Value -2901.860
Information Criteria
Number of Free Parameters 33
Akaike (AIC) 5869.721
Bayesian (BIC) 6000.092
Sample-Size Adjusted BIC 5895.387
(n* = (n + 2) / 24)
Chi-Square Test of Model Fit for the Binary and Ordered Categorical
(Ordinal) Outcomes**
Pearson Chi-Square
Value 3692.445
Degrees of Freedom 24937
P-Value 1.0000
Likelihood Ratio Chi-Square
Value 1184.099
Degrees of Freedom 24937
P-Value 1.0000
This chi-square test suggests there is considerable evidence that the model fits strongly across the two classes (defined as organizational types). We can accept measurement invariance. The model is tested for equal factors, items, item intercepts, and factor correlation between groups.
Here is the standardized solution so you can examine the loadings, etc.
STDYX Standardization
Two-Tailed
Estimate S.E. Est./S.E. P-Value
Latent Class 1
F1 BY
ITEM1 0.724 0.040 18.095 0.000
ITEM2 0.623 0.047 13.302 0.000
ITEM3 0.701 0.039 17.849 0.000
F2 BY
ITEM4 0.778 0.032 24.297 0.000
ITEM5 0.792 0.031 25.351 0.000
ITEM6 0.836 0.028 29.557 0.000
F2 WITH
F1 0.854 0.037 22.994 0.000
Means
F1 0.633 0.128 4.929 0.000
F2 0.716 0.120 5.987 0.000
Thresholds
{I did not list these as they are invariant and not usually interpreted}
Variances
F1 1.000 0.000 999.000 999.000
F2 1.000 0.000 999.000 999.000
Latent Class 2
F1 BY
ITEM1 0.724 0.040 18.095 0.000
ITEM2 0.623 0.047 13.302 0.000
ITEM3 0.701 0.039 17.849 0.000
F2 BY
ITEM4 0.778 0.032 24.297 0.000
ITEM5 0.792 0.031 25.351 0.000
ITEM6 0.836 0.028 29.557 0.000
F2 WITH
F1 0.854 0.037 22.994 0.000
Means
F1 0.000 0.000 999.000 999.000
F2 0.000 0.000 999.000 999.000
Variances
F1 1.000 0.000 999.000 999.000
F2 1.000 0.000 999.000 999.000
{Notice in this formulation the reference group is the second latent class—type 2, or service). So the factor means are significantly higher in Group 1 (product).
WE can back out and test equal factors separately across the groups.
After the %OVERALL% Model Statements, you can add this model statement:
%cg#1%
f1 with f2;
RESULTS:
Model 1 Chi Square Value = 3692.445
Model 2 Chi Square Value = 3714.357
Delta Chi Square Value = 21.912 (1 df)
Suggests relaxing the correlations between factors does not improve the model.
Running the Path Model with Continuous Mediating and Dichotomous Outcome
I checked into the issue of standardizing logit coefficients and found that there have been at least 6 ways to obtain standardized logit coefficients. The problem is in finding a standard deviation for the Y variable. If you notice in the Mplus output, an estimate of the covariance with the dependent variable (readprof) is not included in the preliminary statistics. I did manage to discover the meaning of the standardized coefficients in Mplus, however.
How Do You Standardize the Predictors?
Typically, for continuous variables we standardize a variable by taking the unstandardized beta of the predictor multiplied by ratio of its standard deviation to the standard deviation of the dependent variable:
β* (σx/ σy) .
We start with the fact that the variance in a logit distribution is approximately 3.29 (). To estimate the standard deviation of y, save the predicted logit values for each case from the logistic regression. The predicted values of η have a normal distribution, so we can obtain the variance from the descriptive statistics. We then add this variance to the variance of the error term (remember there are two values in a binomial distribution for each level of x), which is defined as 3.29 in logistic regression (it is always the same number in a logistic distribution). We then take the square root of the sum to obtain a measure of the standard deviation of the dependent variable (Pampel, 2000). This estimate of the standard deviation for y depends on the other variables in the model, so it will change with new independent variables.
I obtained the variance of the predicted logged odds of readprof in SPSS. You can save the predicted log odds for each person.
From descriptive statistics, I also obtained the variance and standard deviations of the other variables affecting readprof.
To calculate the standard deviation of readprof then:
= = 3.17
Lang Standardized coefficient = b(sdx/sdy) = .079(31.64/3.17)=.789
Age = -0.13(4.04/3.17 = .166
These estimates are very close to the Mplus standardized coefficients, so they provide a good approximation for how the program is standardizing dichotomous variables.
MODEL RESULTS LOGIT RESULTS
Two-Tailed
Estimate S.E. Est./S.E. P-Value
READPROF ON
LANG 0.079 0.014 5.739 0.000
LOWSES -0.340 0.543 -0.625 0.532
AGE -0.130 0.070 -1.861 0.063