Analyses Involving Categorical Dependent Variables
When Dependent Variables are Categorical
Chi-square analysis is frequently used.
Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets?
Dependent variable is Death: No (0) vs. Yes (1).
So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without helmets.
Comments on Chi-square analyses
What’s good?
1. The analysis is appropriate. It hasn’t been supplanted by something else.
2. The results are usually easy to communicate, especially to lay audiences.
3. A DV with a few more than 2 categories can be easily analyzed.
4. An IV with only a few more than 2 categories can be easily analyzed.
What’s bad?
1. Incorporating more than one independent variable is awkward, requiring multiple tables.
2. Certain tests, such as tests of interactions, can’t be performed when you have more than one IV.
3. Chi-square analyses can’t be done when you have continuous IVs unless you categorize the continuous IVs which goes against recommendations to NOT categorize continuous variables because you lose power.
Alternatives to the Chi-square test.We’ll focus on Dichotomous (two-valued) DVs.
1. Techniques based on linear regression
a. Multiple Linear Regression. Regress the dichotomous DV onto the mix of IVs.
b. Discriminant Analysis (equivalent to MR when DV is dichotomous)
Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous.
1. Assumption is that underlying relationship between Y and X is linear.
But when Y has only two values, how can that be?
2. Y-hats when Y is continuous are typically realizable values of Y. But when Y has only two values, most of the Y-hats will be values that are not either of those two values. In that case, what are they?
3. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption.
4. Residuals will probably not be normally distributed.
5. Regression line will extend beyond the more negative of the two Y values in the negative direction and beyond the more positive value in the positive direction.
2. Logistic Regression
3. Probit analysis
Logistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. We’ll focus on it.
The Logistic Regression Equation
Without restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1.
When you have a two-valued DV it is convenient to think of Y-hatas the likelihood or probability that one of the values will occur. We’ll use that conceptualization in what follows and view Y-hat as the probability that Y will equal 1.
The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1). So we’re conceptualizing Y-hat as the probability that Y is 1.
The equation for simple Logistic Regression (analogous to Predicted Y = B0 + B1*X in linear regression)
(B0 + B1*X)
1 e
P(Y=1) = ------= ------
-(B0 + B1*X) (B0 + B1*X)
1 + e e+ 1
The logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1. P(Y=1) is never negative and never larger than 1.
The curve of the equation . . .
B0: B0 is analogous to the linear regression “constant” , i.e., intercept parameter. B0 defines the "height" of the curve. B0is an elevation parameter. Also called a difficulty parameter in some applications.
B1: B1 is analogous to the slope of the linear regression line. B1 defines the “steepness” of the curve. It is sometimes called a discrimination parameter.
The larger the value of B1, the “steeper” the curve, the more quickly it goes from 0 to 1.
Note that there is a MAJOR difference between the linear regression and logistic regression curves - - -
The logistic regression lines asymptote at 0 and 1. They’re bounded by 0 and 1.
But the linear regression lines extend below 0 on the left and above 1 on the right.
If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.
Why we must fit ogival-shaped curves – the curse of categorization
Here’s a perfectly nice linear relationship between score values, from a recent study.
This relationship is of ACT Comp scores to Wonderlic scores.
[DataSet3] G:\MdbR\0DataFiles\BalancedScale_110706.sav
Here’s the relationship when ACT Comp has been dichotomized at 23, into Low vs. High.
When, proportions of High scores are plotted vs. WPT value, we get the following
So, to fit the above curve, we need a model that is ogival. This is where the logistic regression function comes into play.
This means that even if the “underlying” true values are linearly related, the dichotomized values that we may have to work with may not be linearly related to the independent variable.
Crosstabs and Logistic Regression
Applied to the same 2x2 situation
The FFROSH data.
The data here are from a study of the effect of the Freshman Seminar course on 1st semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables – first semester GPA excluding the seminar course and whether a student continued into the 2nd semester.
The dependent variable in this analysis is whether or not a student moved directly into the 2nd semester in the spring following his/her 1st fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not.
The analysis reported here was a serendipitous finding regarding the time at which students register for school. It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2nd semester.
After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1st day of class that a student registered for classes. The variable was called EARLIREG – for EARLY REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1st day. (The 150 day value was chosen after inspection of the 1st semester GPA data.)
So the analysis that follows examines the relationship of RETAINED to EARLIREG, retention to the 2nd semester to early registration.
The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION.
First, univariate analyses . . .
GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'.
Fre var=retained earlireg.
Frequency / Percent / Valid Percent / Cumulative PercentValid / .00 / 552 / 11.6 / 11.6 / 11.6
1.00 / 4201 / 88.4 / 88.4 / 100.0
Total / 4753 / 100.0 / 100.0
Frequency / Percent / Valid Percent / Cumulative PercentValid / .00 / 2316 / 48.7 / 48.7 / 48.7
1.00 / 2437 / 51.3 / 51.3 / 100.0
Total / 4753 / 100.0 / 100.0
crosstabs retained by earlireg /cells=cou col /sta=chisq.
The same analysis using Logistic RegressionAnalyze -> Regression -> Binary Logistic
logistic regression retained WITH earlireg.
Logistic Regression
The Logistic Regression procedure fits the logistic regression model to the data. It estimates the parameters of the logistic regression equation.
That equation is P(Y) = ------
-(B0 + B1X)
1 + e
It performs the estimation in two stages. The first stage estimates only B0. So the model fit to the data in the first stage is simply
P(Y) = ------
1 + e
SPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B0 is estimated
Block 0: Beginning Block (estimating only B0)
Explanation of the above table:
The program computes Y-hat for each case using the logistic regression formula with the estimate of B0. If Y-hat is <= a predetermined cut value of 0.500, that case is recorded as a predicted 0. If Y-hat is > 0.5, the program records that case as a predicted 1. It then creates the above table of number of actual 1’s and 0’s vs. predicted 1’s and 0’s.
The prediction equation for Block 0 is Y-hat = 1/(1 + e –2.030) (The value 2.030 is shown in the “Variables in the Equation” table below.). Recall that B1 is not yet in the equation. This means that Y-hat is a constant, equal to .8839 for each case. (I got this by entering the prediction equation into a calculator.) Since Y-hat for each case is > 0.5, all predictions are 1, which is why the above table has only predicted 1’s. Sometimes this table is more useful than it was in this case.
The above “Variables in the Equation” box is the Logistic Regression equivalent of the “Coefficients Box” in regular regression analysis.
The test statistic is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)2. So (2.030/.045)2 = 2,035, which would be 2009.624 if the two coefficients were represented with greater precision.
Exp(B) is the odds ratio: e2.030 More on it later.
The “Variables not in the Equation” gives information on each independent variable that is not in the equation. Specifically, it tells you whether or not the variable would be “significant” if it were added to the equation. In this case, it’s telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .
Block 1: Method = Enter (Adding B1*X to the equation)
Whew – three chi-square statistics.
“Step”: Compared to previous step in a stepwise regression. Ignore for now.
“Block”: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output.
“Model”: Ignore for now
The value under “-2 Log likelihood” is a test of how well the model fit the data in an absolute sense. Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to “percent of variance accounted for”. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model.
The above table is the revised version of the table presented in Block 0.
Note that since X is a dichotomous variable here, there are only two y-hat values. They are
P(Y) = ------= .842 (see below)
-(B0 + B1*0)
1 + e
P(Y) = ------= .924 (see below)
-(B0 + B1*1)
1 + e
As we’ll see below, in both cases, the y-hat was > .5, so predicted Y in the table was 1 for all cases.
The prediction equation is Y-hat = 1 / (1 + e-(.1.670 + .830*EARLIREG).
Since EARLIREG has only two values, those students who registered early will have predicted RETAINED v alue of 1/(1+e-(1.670+.830*1)) = .924. Those who registered late will have predicted RETAINED value of
1/(1+e-(1.670+.830*0) = 1/(1+e-1.670)).= .842. Since both predicted values are above .5, this is why all the cases were predicted to be retained in the table on the previous page.
Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0.
Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is
Odds when X=1 .924/(1-.924)12.158
Odds ratio = ------= ------= ------= 2.29.
Odds when X= 0 .842/(1-.842)5.329
So a person who registered early had odds of being retained that were 2.29 times the odds of a person registering late being retained.
Graphical representation of what we’ve just found.
The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.) The curve is analogous to the straight line plot in a regular regression analysis.
1. When there is only one dichotomous predictor, the CROSSTABS and LOGISTIC REGRESSION give the same significance results, although each gives different ancillary information.
BUT as mentioned above . . .
2. CROSSTABS cannot be used to analyze relationships in which the X variable is continuous.
3. CROSSTABS can be used in a rudimentary fashion to analyze relationships between a dichotomous Y and 2 or more categorical X’s, but the analysis IS rudimentary and is laborious. No tests of interactions are possible. The analysis involves inspection and comparison of multiple tables.
4. CROSSTABS, of course, cannot be used when there is a mixture of continuous and categorical IV’s.
5. LOGISTIC REGRESSION can be used to analyze all the situations mentioned in 2-4 above.
6. So CROSSTABS should be considered for the very simplest situations involving one categorical predictor. But LOGISTIC REGRESSION is the analytic technique of choice when there are two or more categorical predictors and when there are one or more continuous predictors.
Logistic Regression with one Continuous Independent Variable
The data analyzed here represent the relationship of Pancreatitis Diagnosis to measures of Amylase and Lipase. Both Amylase and Lipase levels are tests that can predict the occurrence of Pancreatitis. Generally, it is believed that the larger the value of either, the greater the likelihood of Pancreatitis.
The objective here is to determine which alone is the better predictor of the condition and to determine if both are needed.
Because the distributions of both predictors were positively skewed, logarithms of the actual Amylase and Lipase values were used for this handout and for some of the following handouts.
This handout illustrates the analysis of the relationship of Pancreatitis diagnosis to only Amylase.
The name of the dependent variables is PANCGRP. It is 1 if the person is diagnosed with Pancreatitis. It is 0 otherwise.
Distributions of logamy and loglip – still somewhat positively skewed even though logarithms were taken.
The logamy and loglip scores are highly positively correlated. For that reason, it may be that once either is in the equation, adding the other won’t significantly increase the fit of the model. We’ll test that hypothesis later.
1. Scatterplots with individual cases.
Relationship of Pancreatitis Diagnosis to log(Amylase)
This graph represents a primary problem with visualizing results when the dependent variable is a dichotomy. It is difficult to see the relationship that may very well be represented by the data. One can see, however, that when log amylase is low, there are more 0’s (no Pancreatitis) and when log amylase is high there are more 1’s (presence of Pancreatitis).
The line through the scatterplot is the linear line of best fit. It was easy to generate. It represents the relationship of probability of Pancreatitis to log amylase that would be assumed if a linear regression were conducted.
But, the logistic regression analysis assumes that the relationship between probability of Pancreatitis to log amylase is different. The relationship assumed by the logistic regression analysis would be an S-shaped curve, called an ogive.
Below are the same data, this time with the line of best fit generated by the logistic regression analysis through it. While neither line fits the observed individual case points well in the middle, it’s easy to see that the logistic line fits better at small and at large values of log amylase.
2. Grouping cases to show a relationship when the DV is a dichotomy.
The plots above were plots of individual cases. Each point represented the DV value of a case (0 or 1) vs. that person’s IV value (log amylase value). The problem was that the plot didn’t really show the relationship because the DV could take on only two values - 0 and 1.
When the DV is a dichotomy, it my be profitable to form groups of cases with similar IV values and plot the proportion of 1’s within each group vs. the IV value for that group.
To illustrate this, groups were formed for every .2 increase in log amylase. That is, the values 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, and 3.8 were used as group mid points. Each case was assigned to a group based on how close that case’s log amylase value was to the group midpoint. So, for example, all cases between 1.5 and 1.7 were assigned to the 1.6 group.
Syntax: compute logamygp = rnd(logamy,.2).
Then the proportion of 1’s within each group was computed. The figure below is a plot of the proportion of 1’s within each group vs. the groups midpoints. Note that the points form a curve, quite a bit like the ogival form from the logistic regression analysis shown on the previous page.
The plot of proportions above suggests that the S-shaped curve of the logistic regression model may better represent the increase in probability of Pancreatitis than the straight line curve of the linear regression model.
The analyses that follow illustrate the application of both analyses to the data.
3. Linear Regression analysis of the logamy data, just for old time’s sake.
/DEPENDENT pancgrp