Bios 523 Handout On

Stat 701 Handout on

Binary Logistic Regression

The Study Of Interest (Example on page 575 of text): The data provided below is from a study to assess the ability to complete a task within a specified time pertaining to a complex programming problem, and to relate this ability to the experience level of the programmer. Twenty-five programmers were used in this study. They were all given the same task. The data set from the study is given below.

X = Months of Programming Experience;

Y = Success in Task (1 = Successful, 0 = Failure).

Note that X, the predictor variable is a quantitative variable; while Y, the response variable is a dichotomous, qualitative variable.

The scatterplot of the data is given below.

The problem is to obtain a model for relating the response variable (Y) to the predictor variable (X). The model utilized is called the logistic regression model described as follows:

Let be the conditional probability of observing a successful outcome in performing the task when the level of programming experience of the subject is x. In the logistic regression model it is assumed that

This is equivalent to assuming that

Here are two graphs of this logistic function corresponding to two sets of values of (0, 1). Note that one of the graphs will be a very bad model for the data above, while the other graph might be a good model for the success probability of the programming data above.

Interpretation of the Coefficients (discussed in more detail in class):

0 = intercept term for the linear model of the log-odds.

First, the ODDS of the probability (x) is given by

The coefficient 1 could be interpreted in several ways.

It could be viewed as the change in the value of the log-odds when the value of the predictor variable is changed by one unit.
exp(1) could also be interpreted as the ODDS RATIO (OR), which is the ratio of the odds when the predictor value is (x+1) and the odds when the predictor value is x. Symbolically,

Thus, 1 could also be interpreted as the LOGARITHM of the ODDS RATIO, that is, 1 = ln(OR).

Estimation and Testing when Dealing with Logistic Model

Maximum Likelihood Estimation Procedure.
Testing hypothesis is via likelihood ratio tests.

Will not go into any detail about these methods of inference, but simply illustrate them using the results from the logistic regression analysis in Minitab. It should be noted that there are no closed form expressions to the regression coefficient estimates. They are obtained iteratively, and the object of this iterative procedure is to obtain the regression coefficients that will maximize the likelihood function. As such, the estimation procedure is a very computer-intensive procedure.

We now illustrate the results of the Minitab Analysis.

Binary Logistic Regression

(Minitab Output)

Step Log-Likelihood

0 -17.148

1 -12.866

2 -12.714

3 -12.712

4 -12.712

5 -12.712

Link Function: Logit

Response Information

Variable Value Count

TaskSucc 1 11 (Event)

0 14

Total 25

Logistic Regression Table

Odds 95% CI

Predictor Coef StDev Z P Ratio Lower Upper

Constant -3.060 1.259 -2.43 0.015

MonOfExp 0.16149 0.06498 2.49 0.013 1.18 1.03 1.33

Log-Likelihood = -12.712

Test that all slopes are zero: G = 8.872, DF = 1, P-Value = 0.003

Goodness-of-Fit Tests

Method Chi-Square DF P

Pearson 19.623 17 0.294

Deviance 19.879 17 0.280

Hosmer-Lemeshow 5.946 8 0.653

Table of Observed and Expected Frequencies:

(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

Group

Value 1 2 3 4 5 6 7 8 9 10 Total

Obs 0 0 1 1 1 1 2 1 1 3 11

Exp 0.2 0.3 0.3 1.0 1.2 1.0 1.2 1.4 1.6 2.6

Obs 2 3 1 3 2 1 0 1 1 0 14

Exp 1.8 2.7 1.7 3.0 1.8 1.0 0.8 0.6 0.4 0.4

Total 2 3 2 4 3 2 2 2 2 3 25

Measures of Association:

(Between the Response Variable and Predicted Probabilities)

Pairs Number Percent Summary Measures

Concordant 127 82.5% Somers' D 0.66

Discordant 25 16.2% Goodman-Kruskal Gamma 0.67

Ties 2 1.3% Kendall's Tau-a 0.34

Total 154 100.0%

A Goodness-Of-Fit Criterion

Model Deviance: compares the log-likelihood of the fitted logistic model with the perfectly fitting model (called the saturated model). The smaller the value of this deviance, the better is the fit. The DEVIANCE statistic is given by:

The p(Xi) is the estimate of the success probability for the predictor value of Xi. Under the hypothesis that the logistic model is correct, the statistic DEV(X) follows a chi-square distribution with degrees-of-freedom of n - 1 (in general, n - p, where p-1 is the number of predictor variables).

Chi-Square Statistic: The data is grouped into classes according to their fitted logit values. Let there be c groups. For each group, determine the number of observed successes (denoted by Oj1's) and the number of observed failures (denoted by Oj0's). Also, for each group, obtain the expected successes and failures (denoted by Ej1's and Ej0's). If the logistic regression model is appropriate, then the observed and expected frequencies for each of the cells/groupings will tend to be close to each other. This closeness, or lack thereof, is measured by the chi-square statistic given by:

If the model is appropriate then this chi-square statistic follows a chi-square distribution with degress-of-freedom of c-2, so to test the model, this is compared to the 100(1-)th percentile of the chi-square distribution with c-2 degrees-of-freedom.

Some Diagnostic Plots

These diagnostic plots are obtained by computing the above statistics when a given observation is deleted.

Implementation Using SAS

THE PROGRAM

/* Logistic Regression Illustration */

data prgtask;

input MonExp TskSucc Est;

cards;

14 0 0.310262

29 0 0.835263

6 0 0.109996

25 1 0.726602

18 1 0.461837

4 0 0.082130

18 0 0.461837

12 0 0.245666

22 1 0.620812

6 0 0.109996

30 1 0.856299

11 0 0.216980

30 1 0.856299

5 0 0.095154

20 1 0.542404

13 0 0.276802

9 0 0.167100

32 1 0.891664

24 0 0.693379

13 1 0.276802

19 0 0.502134

4 0 0.082130

28 1 0.811825

22 1 0.620812

8 1 0.145815

;

procprint;

proclogisticDESCENDING;

/* The keyword DESCENDING is to indicate that 1=Success */

model TskSucc = MonExp / waldclcorrbcovbitprintlackfitplclplrlrsquare;

run;

The OUTPUT

Mon Tsk

Obs Exp Succ Est

1 14 0 0.31026

2 29 0 0.83526

3 6 0 0.11000

4 25 1 0.72660

5 18 1 0.46184

6 4 0 0.08213

7 18 0 0.46184

8 12 0 0.24567

9 22 1 0.62081

10 6 0 0.11000

11 30 1 0.85630

12 11 0 0.21698

13 30 1 0.85630

14 5 0 0.09515

15 20 1 0.54240

16 13 0 0.27680

17 9 0 0.16710

18 32 1 0.89166

19 24 0 0.69338

20 13 1 0.27680

21 19 0 0.50213

22 4 0 0.08213

23 28 1 0.81183

24 22 1 0.62081

25 8 1 0.14582

The LOGISTIC Procedure

Model Information

Data Set WORK.PRGTASK

Response Variable TskSucc

Number of Response Levels 2

Number of Observations 25

Link Function Logit

Optimization Technique Fisher's scoring

Response Profile

Ordered Total

Value TskSucc Frequency

1 1 11

2 0 14

Maximum Likelihood Iteration History

Iter Ridge -2 Log L Intercept MonExp

0 0 34.296490 -0.241162 0

1 0 25.732187 -2.401052 0.127956

2 0 25.428428 -2.982504 0.157626

3 0 25.424575 -3.058497 0.161427

4 0 25.424574 -3.059696 0.161486

Last Change in -2 Log L 9.1283891E-7

Last Evaluation of Gradient

Intercept MonExp

-1.577658E-7 5.635832E-7

Convergence criterion (GCONV=1E-8) satisfied.

The LOGISTIC Procedure

Model Fit Statistics

Intercept

Intercept and

Criterion Only Covariates

AIC 36.296 29.425

SC 37.515 31.862

-2 Log L 34.296 25.425

R-Square 0.2987 Max-rescaled R-Square 0.4003

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 8.8719 1 0.0029

Score 7.9742 1 0.0047

Wald 6.1760 1 0.0129

Analysis of Maximum Likelihood Estimates

Standard

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -3.0597 1.2594 5.9029 0.0151

MonExp 1 0.1615 0.0650 6.1760 0.0129

Odds Ratio Estimates

Point 95% Wald

Effect Estimate Confidence Limits

MonExp 1.175 1.035 1.335

Association of Predicted Probabilities and Observed Responses

Percent Concordant 82.5 Somers' D 0.662

Percent Discordant 16.2 Gamma 0.671

Percent Tied 1.3 Tau-a 0.340

Pairs 154 c 0.831

The LOGISTIC Procedure

Profile Likelihood Confidence

Interval for Parameters

Parameter Estimate 95% Confidence Limits

Intercept -3.0597 -6.0369 -0.9159

MonExp 0.1615 0.0500 0.3140

Wald Confidence Interval for Parameters

Parameter Estimate 95% Confidence Limits

Intercept -3.0597 -5.5280 -0.5914

MonExp 0.1615 0.0341 0.2888

Profile Likelihood Confidence Interval for Adjusted Odds Ratios

Effect Unit Estimate 95% Confidence Limits

MonExp 1.0000 1.175 1.051 1.369

Estimated Covariance Matrix

Variable Intercept MonExp

Intercept 1.585967 -0.0754

MonExp -0.0754 0.004222

Estimated Correlation Matrix

Variable Intercept MonExp

Intercept 1.0000 -0.9214

MonExp -0.9214 1.0000

The LOGISTIC Procedure

Partition for the Hosmer and Lemeshow Test

TskSucc = 1 TskSucc = 0

Group Total Observed Expected Observed Expected

1 3 0 0.26 3 2.74

2 3 1 0.37 2 2.63

3 3 0 0.63 3 2.37

4 3 1 0.86 2 2.14

5 3 1 1.43 2 1.57

6 3 3 1.78 0 1.22

7 3 2 2.23 1 0.77

8 4 3 3.44 1 0.56

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square DF Pr > ChiSq

5.1453 6 0.5253

Another Example

Multiple Logistic Regression

Study Considered (Example on page 582 but using the whole data set): To investigate an epidemic outbreak of a disease that is spread by mosquitoes, individuals were randomly sampled within two sectors in a city to determine if the person has recently contracted the disease under study. Response variables was coded 1 = Yes, 0 = No. The predictor variables considered are:

Age, a quantitative variable;
SocioEconomic status, a qualitative variable taking the values Upper, Middle, Lower, and which were then coded by using two dummy variables with the following coding: (0, 0) = Upper, (1, 0) = Middle, and (0, 1) = Lower.
CitySector, which is a qualitative variable taking values Sector 1 (coded 1) and Sector 2 (coded 2).

To give you an idea of the data set, the plot below is a scatterplot of Disease Status versus Age.

Using Minitab, we fit a multiple logistic regression model. The results of this analysis is summarized next.

Binary Logistic Regression

Link Function: Logit

Response Information

Variable Value Count

DiseaseS 1 107 (Event)

0 89

Total 196

Logistic Regression Table

Odds 95% CI

Predictor Coef StDev Z P Ratio Lower Upper

Constant 0.1963 0.7011 0.28 0.780

Age 0.03596 0.01001 3.59 0.000 1.04 1.02 1.06

SocEcoSt -0.9768 0.2012 -4.85 0.000 0.38 0.25 0.56

SocEcoSt 0.7751 0.3584 2.16 0.031 2.17 1.08 4.38

CitySect -0.0213 0.3927 -0.05 0.957 0.98 0.45 2.11

Log-Likelihood = -107.826

Test that all slopes are zero: G = 54.406, DF = 4, P-Value = 0.000

Goodness-of-Fit Tests

Method Chi-Square DF P

Pearson 165.767 165 0.469

Deviance 185.154 165 0.135

Hosmer-Lemeshow 9.343 8 0.314

Table of Observed and Expected Frequencies:

(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

Group

Value 1 2 3 4 5 6 7 8 9 10 Total

Obs 6 3 3 9 9 13 14 15 18 17 107

Exp 2.9 4.5 6.0 8.2 10.5 11.5 15.0 14.4 16.5 17.6

Obs 13 17 16 11 11 6 7 4 2 2 89

Exp 16.1 15.5 13.0 11.8 9.5 7.5 6.0 4.6 3.5 1.4

Total 19 20 19 20 20 19 21 19 20 19 196

Measures of Association:

(Between the Response Variable and Predicted Probabilities)

Pairs Number Percent Summary Measures

Concordant 7560 79.4% Somers' D 0.59

Discordant 1930 20.3% Goodman-Kruskal Gamma 0.59

Ties 33 0.3% Kendall's Tau-a 0.29

Total 9523 100.0%

CONCLUSIONS??

Question: Suppose now that we want to see the effect of SocioEconomic Status on Disease Outbreak, given that the predictors of AGE and CITY SECTOR are already in the model. To answer this question, we need to fit the reduced model which only contains AGE and CITY SECTOR as predictors in order to be able to compute the DEVIANCE statistic for SOCIOECONOMIC STATUS after accounting for AGE and CITY SECTOR. This statistic will be denoted by

DEV(SocEconStat | Age, City Sector) = DEV(Age, City Sector) - DEV(Age, SocEconStat, CitySect).

This is called the partial deviance and is analogous to the extra-sum of squares idea in multiple linear regression.

The results of fitting the reduced model is given below:

Binary Logistic Regression

Link Function: Logit

Response Information

Variable Value Count

DiseaseS 1 107 (Event)

0 89

Total 196

Logistic Regression Table

Odds 95% CI

Predictor Coef StDev Z P Ratio Lower Upper

Constant -0.6875 0.2599 -2.65 0.008

Age 0.034064 0.009345 3.65 0.000 1.03 1.02 1.05

CitySect 0.1739 0.3449 0.50 0.614 1.19 0.61 2.34

Log-Likelihood = -126.065

Test that all slopes are zero: G = 17.928, DF = 2, P-Value = 0.000

Goodness-of-Fit Tests

Method Chi-Square DF P

Pearson 93.182 91 0.417

Deviance 119.708 91 0.023

Hosmer-Lemeshow 13.116 8 0.108

Table of Observed and Expected Frequencies:

(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

Group

Value 1 2 3 4 5 6 7 8 9 10 Total

Obs 11 8 9 5 10 12 11 9 13 19 107

Exp 7.5 7.4 8.9 8.9 9.6 10.9 11.2 12.9 13.7 16.0

Obs 10 11 12 14 9 8 8 11 6 0 89

Exp 13.5 11.6 12.1 10.1 9.4 9.1 7.8 7.1 5.3 3.0

Total 21 19 21 19 19 20 19 20 19 19 196

SAS IMPLEMENTATIO N

/* Multiple Logistic Regression */

data DisOut;

input ObsNum Age SocEcD1 SocEcD2 CitySect DisSta;

label SocEcD1 = "Indicator for Middle SocioEcon Status"

SocEcD2 = "Indicator for Lower SocioEcon Status"

CitySect = "City Sector (0 = Sector 1)"

DisSta = "Disease Status (1=Diseased)";

Cards;

(Data Set to be Inserted here)

run;

proc print;

run;

proc logistic;

model DisSta = Age SocEcD1 SocEcD2 CitySect / itprint plcl plrl rsquare lackfit;

run;

THE OUTPUT

Data Set: WORK.DISOUT

Response Variable: DISSTA Disease Status (1=Diseased)

Response Levels: 2

Number of Observations: 196

Link Function: Logit

Response Profile

Ordered

Value DISSTA Count

1 0 89

2 1 107

Maximum Likelihood Iterative Phase

Iter Step -2 Log L INTERCPT AGE SOCECD1 SOCECD2 CITYSECT

0 INITIAL 270.058302 -0.184192 0 0 0 0

1 IRLS 217.769394 -0.316255 -0.025919 0.811906 -0.580761 0.016157

2 IRLS 215.679039 -0.213630 -0.034772 0.958907 -0.751783 0.021381

3 IRLS 215.652532 -0.196544 -0.035939 0.976515 -0.774711 0.021312

4 IRLS 215.652526 -0.196262 -0.035956 0.976768 -0.775062 0.021305

5 IRLS 215.652526 -0.196262 -0.035956 0.976768 -0.775062 0.021305

Last Change in -2 Log L: 1.136868E-13

Last Evaluation of Gradient

INTERCPT AGE SOCECD1 SOCECD2 CITYSECT

-8.223913E-7 -0.000054821 -4.526762E-7 -1.795719E-6 -5.701351E-7

The LOGISTIC Procedure

Model Fitting Information and Testing Global Null Hypothesis BETA=0

Intercept

Intercept and

Criterion Only Covariates Chi-Square for Covariates

AIC 272.058 225.653 .

SC 275.336 242.043 .

-2 LOG L 270.058 215.653 54.406 with 4 DF (p=0.0001)

Score . . 48.404 with 4 DF (p=0.0001)

RSquare = 0.2424 Max-rescaled RSquare = 0.3241

Analysis of Maximum Likelihood Estimates

Parameter Standard Wald Pr > Standardized Odds

Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio

INTERCPT 1 -0.1963 0.7011 0.0784 0.7795 . .

AGE 1 -0.0360 0.0100 12.9026 0.0003 -0.374763 0.965

SOCECD1 1 0.9768 0.2012 23.5669 0.0001 0.467169 2.656

SOCECD2 1 -0.7751 0.3584 4.6760 0.0306 -0.210140 0.461

CITYSECT 1 0.0213 0.3927 0.0029 0.9567 0.005348 1.022

Association of Predicted Probabilities and Observed Responses

Concordant = 79.4% Somers' D = 0.590

Discordant = 20.4% Gamma = 0.591

Tied = 0.2% Tau-a = 0.294

(9523 pairs) c = 0.795

Parameter Estimates and 95% Confidence Intervals

Profile Likelihood

Confidence Limits

Parameter

Variable Estimate Lower Upper

INTERCPT -0.1963 -1.5875 1.1740

AGE -0.0360 -0.0565 -0.0170

SOCECD1 0.9768 0.5926 1.3843

SOCECD2 -0.7751 -1.4879 -0.0772

CITYSECT 0.0213 -0.7506 0.7961

Conditional Odds Ratios and 95% Confidence Intervals

Profile Likelihood

Confidence Limits

Odds

Variable Unit Ratio Lower Upper

AGE 1.0000 0.965 0.945 0.983

SOCECD1 1.0000 2.656 1.809 3.992

SOCECD2 1.0000 0.461 0.226 0.926

CITYSECT 1.0000 1.022 0.472 2.217

Hosmer and Lemeshow Goodness-of-Fit Test

DISSTA = 0 DISSTA = 1

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Group Total Observed Expected Observed Expected

1 21 2 1.71 19 19.29

2 20 2 3.71 18 16.29

3 20 5 4.95 15 15.05

4 20 7 5.90 13 14.10

5 21 6 8.62 15 12.38

6 20 12 9.78 8 10.22

7 21 14 13.05 7 7.95

8 20 15 14.11 5 5.89

9 20 17 16.03 3 3.97

10 13 9 11.14 4 1.86

Goodness-of-fit Statistic = 7.1833 with 8 DF (p=0.5170)

______

SELECTING BEST VARIABLES

You may also use SAS to select the appropriate variables to include in your model. You do this by using the INCLUDE = p and SELECTION = STEPWISE option in the MODEL statement. The value of p tells SAS to include in the model the first p variables listed. Thus, for the above data set, we could use the command

proc logistic;

model DisSta =SocEcD1 SocEcD2 CitySect Age / include = 2 selection=stepwise;

run;

The relevant part of the output is given below:

Stepwise Selection Procedure

The following variables will be included in each model:

INTERCPT SOCECD1 SOCECD2

Step 0. The INCLUDE variables were entered.

Model Fitting Information and Testing Global Null Hypothesis BETA=0

Intercept

Intercept and

Criterion Only Covariates Chi-Square for Covariates

AIC 272.058 236.851 .

SC 275.336 246.685 .

-2 LOG L 270.058 230.851 39.207 with 2 DF (p=0.0001)

Score . . 37.067 with 2 DF (p=0.0001)

Residual Chi-Square = 14.7090 with 2 DF (p=0.0006)

Step 1. Variable AGE entered:

Model Fitting Information and Testing Global Null Hypothesis BETA=0

Intercept

Intercept and

Criterion Only Covariates Chi-Square for Covariates

AIC 272.058 223.655 .

SC 275.336 236.768 .

-2 LOG L 270.058 215.655 54.403 with 3 DF (p=0.0001)

Score . . 48.402 with 3 DF (p=0.0001)

Residual Chi-Square = 0.0029 with 1 DF (p=0.9567)

NOTE: No (additional) variables met the 0.05 significance level for entry into the model.

Summary of Stepwise Procedure

Variable Number Score Wald Pr >

Step Entered Removed In Chi-Square Chi-Square Chi-Square

1 AGE 3 14.7000 . 0.0001

Analysis of Maximum Likelihood Estimates

Parameter Standard Wald Pr > Standardized Odds

Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio

INTERCPT 1 -0.2009 0.6960 0.0833 0.7729 . .

SOCECD1 1 0.9772 0.2010 23.6272 0.0001 0.467385 2.657

SOCECD2 1 -0.7700 0.3463 4.9459 0.0262 -0.208780 0.463

AGE 1 -0.0358 0.00978 13.4347 0.0002 -0.373561 0.965

Association of Predicted Probabilities and Observed Responses

Concordant = 79.4% Somers' D = 0.591

Discordant = 20.3% Gamma = 0.592

Tied = 0.3% Tau-a = 0.294

(9523 pairs) c = 0.795

Conclusions: By using this procedure, it determined that the variable City Sector is not an important predictor.

Note: If you did not include the option INCLUDE = 2, then it will also see if the SocioEconomic variables are also important. Below is the program and output:

Relevant Program Portion:

proc logistic;

model DisSta = SocEcD1 SocEcD2 CitySect Age / selection=stepwise;

run;