Stat 701 Handout on
Binary Logistic Regression
The Study Of Interest (Example on page 575 of text): The data provided below is from a study to assess the ability to complete a task within a specified time pertaining to a complex programming problem, and to relate this ability to the experience level of the programmer. Twenty-five programmers were used in this study. They were all given the same task. The data set from the study is given below.
X = Months of Programming Experience;
Y = Success in Task (1 = Successful, 0 = Failure).
Note that X, the predictor variable is a quantitative variable; while Y, the response variable is a dichotomous, qualitative variable.
The scatterplot of the data is given below.
The problem is to obtain a model for relating the response variable (Y) to the predictor variable (X). The model utilized is called the logistic regression model described as follows:
Let be the conditional probability of observing a successful outcome in performing the task when the level of programming experience of the subject is x. In the logistic regression model it is assumed that
This is equivalent to assuming that
Here are two graphs of this logistic function corresponding to two sets of values of (0, 1). Note that one of the graphs will be a very bad model for the data above, while the other graph might be a good model for the success probability of the programming data above.
Interpretation of the Coefficients (discussed in more detail in class):
0 = intercept term for the linear model of the log-odds.
First, the ODDS of the probability (x) is given by
The coefficient 1 could be interpreted in several ways.
- It could be viewed as the change in the value of the log-odds when the value of the predictor variable is changed by one unit.
- exp(1) could also be interpreted as the ODDS RATIO (OR), which is the ratio of the odds when the predictor value is (x+1) and the odds when the predictor value is x. Symbolically,
Thus, 1 could also be interpreted as the LOGARITHM of the ODDS RATIO, that is, 1 = ln(OR).
Estimation and Testing when Dealing with Logistic Model
- Maximum Likelihood Estimation Procedure.
- Testing hypothesis is via likelihood ratio tests.
Will not go into any detail about these methods of inference, but simply illustrate them using the results from the logistic regression analysis in Minitab. It should be noted that there are no closed form expressions to the regression coefficient estimates. They are obtained iteratively, and the object of this iterative procedure is to obtain the regression coefficients that will maximize the likelihood function. As such, the estimation procedure is a very computer-intensive procedure.
We now illustrate the results of the Minitab Analysis.
Binary Logistic Regression
(Minitab Output)
Step Log-Likelihood
0 -17.148
1 -12.866
2 -12.714
3 -12.712
4 -12.712
5 -12.712
Link Function: Logit
Response Information
Variable Value Count
TaskSucc 1 11 (Event)
0 14
Total 25
Logistic Regression Table
Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant -3.060 1.259 -2.43 0.015
MonOfExp 0.16149 0.06498 2.49 0.013 1.18 1.03 1.33
Log-Likelihood = -12.712
Test that all slopes are zero: G = 8.872, DF = 1, P-Value = 0.003
Goodness-of-Fit Tests
Method Chi-Square DF P
Pearson 19.623 17 0.294
Deviance 19.879 17 0.280
Hosmer-Lemeshow 5.946 8 0.653
Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)
Group
Value 1 2 3 4 5 6 7 8 9 10 Total
1
Obs 0 0 1 1 1 1 2 1 1 3 11
Exp 0.2 0.3 0.3 1.0 1.2 1.0 1.2 1.4 1.6 2.6
0
Obs 2 3 1 3 2 1 0 1 1 0 14
Exp 1.8 2.7 1.7 3.0 1.8 1.0 0.8 0.6 0.4 0.4
Total 2 3 2 4 3 2 2 2 2 3 25
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs Number Percent Summary Measures
Concordant 127 82.5% Somers' D 0.66
Discordant 25 16.2% Goodman-Kruskal Gamma 0.67
Ties 2 1.3% Kendall's Tau-a 0.34
Total 154 100.0%
A Goodness-Of-Fit Criterion
Model Deviance: compares the log-likelihood of the fitted logistic model with the perfectly fitting model (called the saturated model). The smaller the value of this deviance, the better is the fit. The DEVIANCE statistic is given by:
The p(Xi) is the estimate of the success probability for the predictor value of Xi. Under the hypothesis that the logistic model is correct, the statistic DEV(X) follows a chi-square distribution with degrees-of-freedom of n - 1 (in general, n - p, where p-1 is the number of predictor variables).
Chi-Square Statistic: The data is grouped into classes according to their fitted logit values. Let there be c groups. For each group, determine the number of observed successes (denoted by Oj1's) and the number of observed failures (denoted by Oj0's). Also, for each group, obtain the expected successes and failures (denoted by Ej1's and Ej0's). If the logistic regression model is appropriate, then the observed and expected frequencies for each of the cells/groupings will tend to be close to each other. This closeness, or lack thereof, is measured by the chi-square statistic given by:
If the model is appropriate then this chi-square statistic follows a chi-square distribution with degress-of-freedom of c-2, so to test the model, this is compared to the 100(1-)th percentile of the chi-square distribution with c-2 degrees-of-freedom.
Some Diagnostic Plots
These diagnostic plots are obtained by computing the above statistics when a given observation is deleted.
Implementation Using SAS
THE PROGRAM
/* Logistic Regression Illustration */
data prgtask;
input MonExp TskSucc Est;
cards;
14 0 0.310262
29 0 0.835263
6 0 0.109996
25 1 0.726602
18 1 0.461837
4 0 0.082130
18 0 0.461837
12 0 0.245666
22 1 0.620812
6 0 0.109996
30 1 0.856299
11 0 0.216980
30 1 0.856299
5 0 0.095154
20 1 0.542404
13 0 0.276802
9 0 0.167100
32 1 0.891664
24 0 0.693379
13 1 0.276802
19 0 0.502134
4 0 0.082130
28 1 0.811825
22 1 0.620812
8 1 0.145815
;
procprint;
proclogisticDESCENDING;
/* The keyword DESCENDING is to indicate that 1=Success */
model TskSucc = MonExp / waldclcorrbcovbitprintlackfitplclplrlrsquare;
run;
The OUTPUT
Mon Tsk
Obs Exp Succ Est
1 14 0 0.31026
2 29 0 0.83526
3 6 0 0.11000
4 25 1 0.72660
5 18 1 0.46184
6 4 0 0.08213
7 18 0 0.46184
8 12 0 0.24567
9 22 1 0.62081
10 6 0 0.11000
11 30 1 0.85630
12 11 0 0.21698
13 30 1 0.85630
14 5 0 0.09515
15 20 1 0.54240
16 13 0 0.27680
17 9 0 0.16710
18 32 1 0.89166
19 24 0 0.69338
20 13 1 0.27680
21 19 0 0.50213
22 4 0 0.08213
23 28 1 0.81183
24 22 1 0.62081
25 8 1 0.14582
The LOGISTIC Procedure
Model Information
Data Set WORK.PRGTASK
Response Variable TskSucc
Number of Response Levels 2
Number of Observations 25
Link Function Logit
Optimization Technique Fisher's scoring
Response Profile
Ordered Total
Value TskSucc Frequency
1 1 11
2 0 14
Maximum Likelihood Iteration History
Iter Ridge -2 Log L Intercept MonExp
0 0 34.296490 -0.241162 0
1 0 25.732187 -2.401052 0.127956
2 0 25.428428 -2.982504 0.157626
3 0 25.424575 -3.058497 0.161427
4 0 25.424574 -3.059696 0.161486
Last Change in -2 Log L 9.1283891E-7
Last Evaluation of Gradient
Intercept MonExp
-1.577658E-7 5.635832E-7
Convergence criterion (GCONV=1E-8) satisfied.
The LOGISTIC Procedure
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 36.296 29.425
SC 37.515 31.862
-2 Log L 34.296 25.425
R-Square 0.2987 Max-rescaled R-Square 0.4003
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 8.8719 1 0.0029
Score 7.9742 1 0.0047
Wald 6.1760 1 0.0129
Analysis of Maximum Likelihood Estimates
Standard
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -3.0597 1.2594 5.9029 0.0151
MonExp 1 0.1615 0.0650 6.1760 0.0129
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
MonExp 1.175 1.035 1.335
Association of Predicted Probabilities and Observed Responses
Percent Concordant 82.5 Somers' D 0.662
Percent Discordant 16.2 Gamma 0.671
Percent Tied 1.3 Tau-a 0.340
Pairs 154 c 0.831
The LOGISTIC Procedure
Profile Likelihood Confidence
Interval for Parameters
Parameter Estimate 95% Confidence Limits
Intercept -3.0597 -6.0369 -0.9159
MonExp 0.1615 0.0500 0.3140
Wald Confidence Interval for Parameters
Parameter Estimate 95% Confidence Limits
Intercept -3.0597 -5.5280 -0.5914
MonExp 0.1615 0.0341 0.2888
Profile Likelihood Confidence Interval for Adjusted Odds Ratios
Effect Unit Estimate 95% Confidence Limits
MonExp 1.0000 1.175 1.051 1.369
Estimated Covariance Matrix
Variable Intercept MonExp
Intercept 1.585967 -0.0754
MonExp -0.0754 0.004222
Estimated Correlation Matrix
Variable Intercept MonExp
Intercept 1.0000 -0.9214
MonExp -0.9214 1.0000
The LOGISTIC Procedure
Partition for the Hosmer and Lemeshow Test
TskSucc = 1 TskSucc = 0
Group Total Observed Expected Observed Expected
1 3 0 0.26 3 2.74
2 3 1 0.37 2 2.63
3 3 0 0.63 3 2.37
4 3 1 0.86 2 2.14
5 3 1 1.43 2 1.57
6 3 3 1.78 0 1.22
7 3 2 2.23 1 0.77
8 4 3 3.44 1 0.56
Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square DF Pr > ChiSq
5.1453 6 0.5253
Another Example
Multiple Logistic Regression
Study Considered (Example on page 582 but using the whole data set): To investigate an epidemic outbreak of a disease that is spread by mosquitoes, individuals were randomly sampled within two sectors in a city to determine if the person has recently contracted the disease under study. Response variables was coded 1 = Yes, 0 = No. The predictor variables considered are:
- Age, a quantitative variable;
- SocioEconomic status, a qualitative variable taking the values Upper, Middle, Lower, and which were then coded by using two dummy variables with the following coding: (0, 0) = Upper, (1, 0) = Middle, and (0, 1) = Lower.
- CitySector, which is a qualitative variable taking values Sector 1 (coded 1) and Sector 2 (coded 2).
To give you an idea of the data set, the plot below is a scatterplot of Disease Status versus Age.
Using Minitab, we fit a multiple logistic regression model. The results of this analysis is summarized next.
Binary Logistic Regression
Link Function: Logit
Response Information
Variable Value Count
DiseaseS 1 107 (Event)
0 89
Total 196
Logistic Regression Table
Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant 0.1963 0.7011 0.28 0.780
Age 0.03596 0.01001 3.59 0.000 1.04 1.02 1.06
SocEcoSt -0.9768 0.2012 -4.85 0.000 0.38 0.25 0.56
SocEcoSt 0.7751 0.3584 2.16 0.031 2.17 1.08 4.38
CitySect -0.0213 0.3927 -0.05 0.957 0.98 0.45 2.11
Log-Likelihood = -107.826
Test that all slopes are zero: G = 54.406, DF = 4, P-Value = 0.000
Goodness-of-Fit Tests
Method Chi-Square DF P
Pearson 165.767 165 0.469
Deviance 185.154 165 0.135
Hosmer-Lemeshow 9.343 8 0.314
Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)
Group
Value 1 2 3 4 5 6 7 8 9 10 Total
1
Obs 6 3 3 9 9 13 14 15 18 17 107
Exp 2.9 4.5 6.0 8.2 10.5 11.5 15.0 14.4 16.5 17.6
0
Obs 13 17 16 11 11 6 7 4 2 2 89
Exp 16.1 15.5 13.0 11.8 9.5 7.5 6.0 4.6 3.5 1.4
Total 19 20 19 20 20 19 21 19 20 19 196
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs Number Percent Summary Measures
Concordant 7560 79.4% Somers' D 0.59
Discordant 1930 20.3% Goodman-Kruskal Gamma 0.59
Ties 33 0.3% Kendall's Tau-a 0.29
Total 9523 100.0%
CONCLUSIONS??
Question: Suppose now that we want to see the effect of SocioEconomic Status on Disease Outbreak, given that the predictors of AGE and CITY SECTOR are already in the model. To answer this question, we need to fit the reduced model which only contains AGE and CITY SECTOR as predictors in order to be able to compute the DEVIANCE statistic for SOCIOECONOMIC STATUS after accounting for AGE and CITY SECTOR. This statistic will be denoted by
DEV(SocEconStat | Age, City Sector) = DEV(Age, City Sector) - DEV(Age, SocEconStat, CitySect).
This is called the partial deviance and is analogous to the extra-sum of squares idea in multiple linear regression.
The results of fitting the reduced model is given below:
Binary Logistic Regression
Link Function: Logit
Response Information
Variable Value Count
DiseaseS 1 107 (Event)
0 89
Total 196
Logistic Regression Table
Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant -0.6875 0.2599 -2.65 0.008
Age 0.034064 0.009345 3.65 0.000 1.03 1.02 1.05
CitySect 0.1739 0.3449 0.50 0.614 1.19 0.61 2.34
Log-Likelihood = -126.065
Test that all slopes are zero: G = 17.928, DF = 2, P-Value = 0.000
Goodness-of-Fit Tests
Method Chi-Square DF P
Pearson 93.182 91 0.417
Deviance 119.708 91 0.023
Hosmer-Lemeshow 13.116 8 0.108
Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)
Group
Value 1 2 3 4 5 6 7 8 9 10 Total
1
Obs 11 8 9 5 10 12 11 9 13 19 107
Exp 7.5 7.4 8.9 8.9 9.6 10.9 11.2 12.9 13.7 16.0
0
Obs 10 11 12 14 9 8 8 11 6 0 89
Exp 13.5 11.6 12.1 10.1 9.4 9.1 7.8 7.1 5.3 3.0
Total 21 19 21 19 19 20 19 20 19 19 196
SAS IMPLEMENTATIO N
/* Multiple Logistic Regression */
data DisOut;
input ObsNum Age SocEcD1 SocEcD2 CitySect DisSta;
label SocEcD1 = "Indicator for Middle SocioEcon Status"
SocEcD2 = "Indicator for Lower SocioEcon Status"
CitySect = "City Sector (0 = Sector 1)"
DisSta = "Disease Status (1=Diseased)";
Cards;
(Data Set to be Inserted here)
run;
proc print;
run;
proc logistic;
model DisSta = Age SocEcD1 SocEcD2 CitySect / itprint plcl plrl rsquare lackfit;
run;
THE OUTPUT
Data Set: WORK.DISOUT
Response Variable: DISSTA Disease Status (1=Diseased)
Response Levels: 2
Number of Observations: 196
Link Function: Logit
Response Profile
Ordered
Value DISSTA Count
1 0 89
2 1 107
Maximum Likelihood Iterative Phase
Iter Step -2 Log L INTERCPT AGE SOCECD1 SOCECD2 CITYSECT
0 INITIAL 270.058302 -0.184192 0 0 0 0
1 IRLS 217.769394 -0.316255 -0.025919 0.811906 -0.580761 0.016157
2 IRLS 215.679039 -0.213630 -0.034772 0.958907 -0.751783 0.021381
3 IRLS 215.652532 -0.196544 -0.035939 0.976515 -0.774711 0.021312
4 IRLS 215.652526 -0.196262 -0.035956 0.976768 -0.775062 0.021305
5 IRLS 215.652526 -0.196262 -0.035956 0.976768 -0.775062 0.021305
Last Change in -2 Log L: 1.136868E-13
Last Evaluation of Gradient
INTERCPT AGE SOCECD1 SOCECD2 CITYSECT
-8.223913E-7 -0.000054821 -4.526762E-7 -1.795719E-6 -5.701351E-7
The LOGISTIC Procedure
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 272.058 225.653 .
SC 275.336 242.043 .
-2 LOG L 270.058 215.653 54.406 with 4 DF (p=0.0001)
Score . . 48.404 with 4 DF (p=0.0001)
RSquare = 0.2424 Max-rescaled RSquare = 0.3241
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 -0.1963 0.7011 0.0784 0.7795 . .
AGE 1 -0.0360 0.0100 12.9026 0.0003 -0.374763 0.965
SOCECD1 1 0.9768 0.2012 23.5669 0.0001 0.467169 2.656
SOCECD2 1 -0.7751 0.3584 4.6760 0.0306 -0.210140 0.461
CITYSECT 1 0.0213 0.3927 0.0029 0.9567 0.005348 1.022
Association of Predicted Probabilities and Observed Responses
Concordant = 79.4% Somers' D = 0.590
Discordant = 20.4% Gamma = 0.591
Tied = 0.2% Tau-a = 0.294
(9523 pairs) c = 0.795
Parameter Estimates and 95% Confidence Intervals
Profile Likelihood
Confidence Limits
Parameter
Variable Estimate Lower Upper
INTERCPT -0.1963 -1.5875 1.1740
AGE -0.0360 -0.0565 -0.0170
SOCECD1 0.9768 0.5926 1.3843
SOCECD2 -0.7751 -1.4879 -0.0772
CITYSECT 0.0213 -0.7506 0.7961
Conditional Odds Ratios and 95% Confidence Intervals
Profile Likelihood
Confidence Limits
Odds
Variable Unit Ratio Lower Upper
AGE 1.0000 0.965 0.945 0.983
SOCECD1 1.0000 2.656 1.809 3.992
SOCECD2 1.0000 0.461 0.226 0.926
CITYSECT 1.0000 1.022 0.472 2.217
Hosmer and Lemeshow Goodness-of-Fit Test
DISSTA = 0 DISSTA = 1
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Group Total Observed Expected Observed Expected
1 21 2 1.71 19 19.29
2 20 2 3.71 18 16.29
3 20 5 4.95 15 15.05
4 20 7 5.90 13 14.10
5 21 6 8.62 15 12.38
6 20 12 9.78 8 10.22
7 21 14 13.05 7 7.95
8 20 15 14.11 5 5.89
9 20 17 16.03 3 3.97
10 13 9 11.14 4 1.86
Goodness-of-fit Statistic = 7.1833 with 8 DF (p=0.5170)
______
SELECTING BEST VARIABLES
You may also use SAS to select the appropriate variables to include in your model. You do this by using the INCLUDE = p and SELECTION = STEPWISE option in the MODEL statement. The value of p tells SAS to include in the model the first p variables listed. Thus, for the above data set, we could use the command
proc logistic;
model DisSta =SocEcD1 SocEcD2 CitySect Age / include = 2 selection=stepwise;
run;
The relevant part of the output is given below:
Stepwise Selection Procedure
The following variables will be included in each model:
INTERCPT SOCECD1 SOCECD2
Step 0. The INCLUDE variables were entered.
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 272.058 236.851 .
SC 275.336 246.685 .
-2 LOG L 270.058 230.851 39.207 with 2 DF (p=0.0001)
Score . . 37.067 with 2 DF (p=0.0001)
Residual Chi-Square = 14.7090 with 2 DF (p=0.0006)
Step 1. Variable AGE entered:
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept and
Criterion Only Covariates Chi-Square for Covariates
AIC 272.058 223.655 .
SC 275.336 236.768 .
-2 LOG L 270.058 215.655 54.403 with 3 DF (p=0.0001)
Score . . 48.402 with 3 DF (p=0.0001)
Residual Chi-Square = 0.0029 with 1 DF (p=0.9567)
NOTE: No (additional) variables met the 0.05 significance level for entry into the model.
Summary of Stepwise Procedure
Variable Number Score Wald Pr >
Step Entered Removed In Chi-Square Chi-Square Chi-Square
1 AGE 3 14.7000 . 0.0001
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Standardized Odds
Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio
INTERCPT 1 -0.2009 0.6960 0.0833 0.7729 . .
SOCECD1 1 0.9772 0.2010 23.6272 0.0001 0.467385 2.657
SOCECD2 1 -0.7700 0.3463 4.9459 0.0262 -0.208780 0.463
AGE 1 -0.0358 0.00978 13.4347 0.0002 -0.373561 0.965
Association of Predicted Probabilities and Observed Responses
Concordant = 79.4% Somers' D = 0.591
Discordant = 20.3% Gamma = 0.592
Tied = 0.3% Tau-a = 0.294
(9523 pairs) c = 0.795
Conclusions: By using this procedure, it determined that the variable City Sector is not an important predictor.
Note: If you did not include the option INCLUDE = 2, then it will also see if the SocioEconomic variables are also important. Below is the program and output:
Relevant Program Portion:
proc logistic;
model DisSta = SocEcD1 SocEcD2 CitySect Age / selection=stepwise;
run;
1