Logistic Regression in R
In this section of the notes we examine logistic regression in R. There are several functions that I wrote for plotting diagnostics similar to what SAS does, although the inspiration for them came from work Prof. Malone and I did for OLS as part of his senior project.
Example 1: Oral Contraceptive Use and Myocardial Infarctions
Set up a text file with the data in columns with variable names at the top. The case and control counts are in separate columns. The risk factor OC use and stratification variable Age follow.
> OCMI.data = read.table(file.choose(),header=T) # read in text file
> OCMI.data
MI NoMI Age OCuse
1 4 62 1 Yes
2 2 224 1 No
3 9 33 2 Yes
4 12 390 2 No
5 4 26 3 Yes
6 33 330 3 No
7 6 9 4 Yes
8 65 362 4 No
9 6 5 5 Yes
10 93 301 5 No
> attach(OCMI.data)
> OC.glm <- glm(cbind(MI,NoMI)~Age+OCuse,family=binomial) # fit model
> summary(OC.glm)
Call:
glm(formula = cbind(MI, NoMI) ~ Age + OCuse, family = binomial)
Deviance Residuals:
[1] 0.456248 -0.520517 1.377693 -0.886710 -1.685521 0.714695 -0.130922 0.033643
[9] -0.045061 0.008822
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.3698 0.4347 -10.054 < 2e-16 ***
Age2 1.1384 0.4768 2.388 0.0170 *
Age3 1.9344 0.4582 4.221 2.43e-05 ***
Age4 2.6481 0.4496 5.889 3.88e-09 ***
Age5 3.1943 0.4474 7.140 9.36e-13 ***
OCuseYes 1.3852 0.2505 5.530 3.19e-08 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 158.0085 on 9 degrees of freedom
Residual deviance: 6.5355 on 4 degrees of freedom
AIC: 58.825
Number of Fisher Scoring iterations: 3
Find OR associated with oral contraceptive use ADJUSTED for age.
Recall: CMH procedure gave 3.97.
> exp(1.3852)
[1] 3.995625
Find a 95% CI for OR associated with OC use.
> exp(1.3852-1.96*.2505)
[1] 2.445428
> exp(1.3852+1.96*.2505)
[1] 6.528518
Interpreting the age effect in terms of OR’s ADJUSTING for OC use.
Note: The reference group is Age = 1 which was women 25 – 29 years of age.
> OC.glm$coefficients
(Intercept) Age2 Age3 Age4 Age5 OCuseYes
-4.369850 1.138363 1.934401 2.648059 3.194292 1.385176
> Age.coefs <- OC.glm$coefficients[2:5]
> exp(Age.coefs)
Age2 Age3 Age4 Age5
3.121653 6.919896 14.126585 24.392906
Find 95% CI for age = 5 group.
> exp(3.1943-1.96*.4474)
[1] 10.14921
> exp(3.1943+1.96*.4474)
[1] 58.62751
Example 2: Coffee Drinking and Myocardial Infarctions
CoffeeMI.data = read.table(file.choose(),header=T)
> CoffeeMI.data
Smoking Coffee MI NoMI
1 Never > 5 7 31
2 Never < 5 55 269
3 Former > 5 7 18
4 Former < 5 20 112
5 1-14 Cigs > 5 7 24
6 1-14 Cigs < 5 33 114
7 15-25 Cigs > 5 40 45
8 15-25 Cigs < 5 88 172
9 25-34 Cigs > 5 34 24
10 25-34 Cigs < 5 50 55
11 35-44 Cigs > 5 27 24
12 35-44 Cigs < 5 55 58
13 45+ Cigs > 5 30 17
14 45+ Cigs < 5 34 17
> attach(CoffeeMI.data)
> Coffee.glm = glm(cbind(MI,NoMI)~Smoking+Coffee,family=binomial)
> summary(Coffee.glm)
Call:
glm(formula = cbind(MI, NoMI) ~ Smoking + Coffee, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7650 -0.4510 -0.0232 0.2999 0.7917
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2981 0.1819 -7.136 9.60e-13 ***
Smoking15-25 Cigs 0.6892 0.2119 3.253 0.00114 **
Smoking25-34 Cigs 1.2462 0.2398 5.197 2.02e-07 ***
Smoking35-44 Cigs 1.1988 0.2389 5.017 5.24e-07 ***
Smoking45+ Cigs 1.7811 0.2808 6.342 2.27e-10 ***
SmokingFormer -0.3291 0.2778 -1.185 0.23616
SmokingNever -0.3153 0.2279 -1.384 0.16646
Coffee> 5 0.3200 0.1377 2.324 0.02012 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 173.7899 on 13 degrees of freedom
Residual deviance: 3.7622 on 6 degrees of freedom
AIC: 84.311
Number of Fisher Scoring iterations: 3
OR for drinking 5 or more cups of coffee per day.
Note: CMH procedure gave OR = 1.375
> exp(.3200)
[1] 1.377128
95% CI for OR associated with heavy coffee drinking
> exp(.3200 - 1.96*.1377)
[1] 1.051385
> exp(.3200 + 1.96*.1377)
[1] 1.803794
Reordering a Factor
To examine the effect of smoking we might want to “reorder” the levels of smoking status so that individuals who have never smoked are used as the reference group. To do this in R you must do the following:
Smoking =factor(Smoking,levels=c("Never","Former","1-14 Cigs","15-25 Cigs","25-34 Cigs","35-44 Cigs","45+ Cigs"))
The first level specified in the levels subcommand will be used as the reference group, “Never” in this case. Refitting the model with the reordered smoking status factor gives the following:
> Coffee.glm2 <-glm(cbind(MI,NoMI)~Smoking+Coffee,family=binomial)
> summary(Coffee.glm2)
Call:
glm(formula = cbind(MI, NoMI) ~ Smoking + Coffee, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7650 -0.4510 -0.0232 0.2999 0.7917
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.61344 0.14068 -11.469 < 2e-16 ***
SmokingFormer -0.01376 0.25376 -0.054 0.9568
Smoking1-14 Cigs 0.31533 0.22789 1.384 0.1665
Smoking15-25 Cigs 1.00451 0.17976 5.588 2.30e-08 ***
Smoking25-34 Cigs 1.56150 0.21254 7.347 2.03e-13 ***
Smoking35-44 Cigs 1.51417 0.21132 7.165 7.77e-13 ***
Smoking45+ Cigs 2.09646 0.25855 8.108 5.13e-16 ***
Coffee> 5 0.31995 0.13766 2.324 0.0201 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 173.7899 on 13 degrees of freedom
Residual deviance: 3.7622 on 6 degrees of freedom
AIC: 84.311
Number of Fisher Scoring iterations: 3
Notice that “SmokingNever” is now absent from the output so we know it is being used as the reference group. The OR’s associated with the various levels of smoking are computed below.
> Smoke.coefs = Coffee.glm$coefficients[2:7]
> exp(Smoke.coefs)
SmokingFormer Smoking1-14 Cigs Smoking15-25 Cigs Smoking25-34 Cigs
0.986338 1.370715 2.730561 4.765984
Smoking35-44 Cigs Smoking45+ Cigs
4.545632 8.137279
Confidence intervals for each could be computed in the standard way.
Some Details for Categorical Predictors with More Than Two Levels
Consider the coffee drinking/MI study above. The stratification variable smoking has seven levels. Thus it requires six dummy variables to define it. The level that is not defined using a dichotomous dummy variable serves as the reference group. The table below shows how the value of the dummy variables:
Level / D2 / D3 / D4 / D5 / D6 / D7Never
(Reference
Group) / 0 / 0 / 0 / 0 / 0 / 0
Former / 1 / 0 / 0 / 0 / 0 / 0
1 – 14 Cigs / 0 / 1 / 0 / 0 / 0 / 0
15 – 24 Cigs / 0 / 0 / 1 / 0 / 0 / 0
25 – 34 Cigs / 0 / 0 / 0 / 1 / 0 / 0
35 – 44 Cigs / 0 / 0 / 0 / 0 / 1 / 0
45+ Cigs / 0 / 0 / 0 / 0 / 0 / 1
Example: Coffee Drinking and Myocardial Infarctions
CoffeeMI.data = read.table(file.choose(),header=T)
> CoffeeMI.data
Smoking Coffee MI NoMI
1 Never > 5 7 31
2 Never < 5 55 269
3 Former > 5 7 18
4 Former < 5 20 112
5 1-14 Cigs > 5 7 24
6 1-14 Cigs < 5 33 114
7 15-25 Cigs > 5 40 45
8 15-25 Cigs < 5 88 172
9 25-34 Cigs > 5 34 24
10 25-34 Cigs < 5 50 55
11 35-44 Cigs > 5 27 24
12 35-44 Cigs < 5 55 58
13 45+ Cigs > 5 30 17
14 45+ Cigs < 5 34 17
The Logistic Model
whereCoffee is a dichotomous predictor equal to 1 if they drink 5 or more cups of coffee per day.
Comparing the log-odds of a heavy coffee drinker who who smokes 15-25 cigarettes day to a heavy coffee drinker who has never smoked we have.
Taking the difference gives,
thus
the odds ratio associated with smoking 15-24 cigarettes per day when compared to individuals who have never smoked amongst heavy coffee drinkers. Because is not involved in the odds ratio the result is the same for non-heavy coffee drinkers as well!
You can also consider combinations of factors, e.g. if we compared heavy coffee drinkers who smoked 15-24 cigarettes to a non-heavy coffee drinkers who have never smoked the associated OR would be given by.
Using our fitted model the OR’s ratios discussed above would be.
> summary(Coffee.glm)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.61344 0.14068 -11.469 < 2e-16 ***
SmokingFormer -0.01376 0.25376 -0.054 0.9568
Smoking1-14 Cigs 0.31533 0.22789 1.384 0.1665
Smoking15-25 Cigs 1.00451 0.17976 5.588 2.30e-08 ***
Smoking25-34 Cigs 1.56150 0.21254 7.347 2.03e-13 ***
Smoking35-44 Cigs 1.51417 0.21132 7.165 7.77e-13 ***
Smoking45+ Cigs 2.09646 0.25855 8.108 5.13e-16 ***
Coffee> 5 0.31995 0.13766 2.324 0.0201 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
OR for 15-24 cigarette smokers vs. never smokers (regardless of coffee drinking status)
> exp(1.00451)
[1] 2.730569
OR for 15-24 cigarette smokers who are also heavy coffee drinkers vs. non-smokers who are not heavy coffee drinkers
> exp(.31995 + 1.00451)
[1] 3.760154
Similar calculations could be done for other combinations of coffee and cigarette use.
Example 3: Risk Factors for Low Birth Weight
Response
Y = low birth weight, i.e. birth weight 2500 grams (1 = yes, 0 = no)
Set of potential predictors
X1 = previous history of premature labor (1 = yes, 0 = no)
X2 = hypertension during pregnancy (1 = yes, 0 = no)
X3 = smoker (1 = yes, 0 = no)
X4 = uterine irritability (1 = yes, 0 = no)
X5 = minority (1 = yes, 0 = no)
X6 = mother’s age in years
X7 = mother’s weight at last menstrual cycle
Analysis in R
> Lowbirth = read.table(file.choose(),header=T)
> Lowbirth[1:5,] # print first 5 rows of the data set
Low Prev Hyper Smoke Uterine Minority Age Lwt race bwt
1 0 0 0 0 1 1 19 182 2 2523
2 0 0 0 0 0 1 33 155 3 2551
3 0 0 0 1 0 0 20 105 1 2557
4 0 0 0 1 1 0 21 108 1 2594
5 0 0 0 1 1 0 18 107 1 2600
Make sure categorical variables are interpreted as factors by using the factor command
> Low = factor(Low)
> Prev = factor(Prev)
> Hyper = factor(Hyper)
> Smoke = factor(Smoke)
> Uterine = factor(Uterine)
> Minority = factor(Minority)
Note: This is not really necessary for dichotomous variables that are coded (0,1).
Fit a preliminary model using all available covariates
> low.glm = glm(Low~Prev+Hyper+Smoke+Uterine+Minority+Age+Lwt,family=binomial)
> summary(low.glm)
Call:
glm(formula = Low ~ Prev + Hyper + Smoke + Uterine + Minority +
Age + Lwt, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6010 -0.8149 -0.5128 1.0188 2.1977
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.378479 1.170627 0.323 0.74646
Prev1 1.196011 0.461534 2.591 0.00956 **
Hyper1 1.452236 0.652085 2.227 0.02594 *
Smoke1 0.959406 0.405302 2.367 0.01793 *
Uterine1 0.647498 0.466468 1.388 0.16511
Minority1 0.990929 0.404969 2.447 0.01441 *
Age -0.043221 0.037493 -1.153 0.24900
Lwt -0.012047 0.006422 -1.876 0.06066 .
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Null deviance: 232.40 on 185 degrees of freedom
Residual deviance: 196.71 on 178 degrees of freedom
AIC: 212.71
Number of Fisher Scoring iterations: 3
It appears that both uterine irritability and mother’s age are not significant. We can fit the reduced model eliminating both terms and test whether the model is significantly degraded by using the general chi-square test (see the SAS example).
> low.reduced = glm(Low~Prev+Hyper+Smoke+Minority+Lwt,family=binomial)
> summary(low.reduced)
Call:
glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7277 -0.8219 -0.5368 0.9867 2.1517
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.261274 0.885803 -0.295 0.76803
Prev1 1.181940 0.444254 2.661 0.00780 **
Hyper1 1.397219 0.656271 2.129 0.03325 *
Smoke1 0.981849 0.398300 2.465 0.01370 *
Minority1 1.044804 0.394956 2.645 0.00816 **
Lwt -0.014127 0.006387 -2.212 0.02697 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 232.40 on 185 degrees of freedom
Residual deviance: 200.32 on 180 degrees of freedom
AIC: 212.32
Number of Fisher Scoring iterations: 3
* Recall:
Residual Deviance Null Hypothesis Model: df = 180
Residual Deviance Alternative Hypothesis Model: df = 178
General Chi-Square Test
Fail to reject the null, the reduced model is adequate.
Interpretation of Model Parameters
OR’s Associated with Categorical Predictors
> low.reduced
Call: glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial)
Coefficients:
(Intercept) Prev1 Hyper1 Smoke1 Minority1 Lwt
-0.26127 1.18194 1.39722 0.98185 1.04480 -0.01413
Degrees of Freedom: 185 Total (i.e. Null); 180 Residual
Null Deviance: 232.4
Residual Deviance: 200.3 AIC: 212.3
Estimated OR’s
> exp(low.reduced$coefficients[2:5])
Prev1 Hyper1 Smoke1 Minority1
3.260693 4.043938 2.669388 2.842841
95% CI for OR Associated with History of Premature Labor
> exp(1.182-1.96*.444)
[1] 1.365827
> exp(1.182+1.96*.444)
[1] 7.78532
Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.366 and 7.785 times larger for mothers with a history of premature labor.
95% CI for OR Associated with Hypertension
> exp(1.397 - 1.96*.6563)
[1] 1.117006
> exp(1.397 + 1.96*.6563)
[1] 14.63401
Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.117 and 14.63 times larger for mothers with hypertension during pregnancy.
95% CI for OR Associated with Smoking
> exp(.981849 - 1.96*.3983)
[1] 1.222846
> exp(.981849 + 1.96*.3983)
[1] 5.827086
Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.223 and 5.827 times larger for mothers who smoked during pregnancy.
95% CI for OR Associated with Minority Status
> exp(1.0448 - 1.96*.3950)
[1] 1.310751
> exp(1.0448 + 1.96*.3950)
[1] 6.16569
Holding everything else constant we estimate that the odds of having an infant with low birth weight are between 1.311 and 6.166 times larger for non-white mothers.
OR Associated with Mother’s Weight at Last Menstrual Cycle
Because this is a continuous predictor with values over 100 we should use an increment larger than one when considering the effect of mother’s weight on birth weight. Here we will use an increment of c = 10 lbs., although certainly there are other possibilities.
> exp(-10*.014127)
[1] 0.8682549
i.e. 13.2% decrease in the OR for each additional 10 lbs. in premenstrual weight.
A 95% CI for this OR is:
> exp(10*(-.014127) - 1.96*10*.006387)
[1] 0.7660903
> exp(10*(-.014127) + 1.96*10*.006387)
[1] 0.9840439
Create a sequence of weights from smallest observed weight to the largest observed weight by ½ pound increments.
x = seq(min(Lwt),max(Lwt),.5)
Here I have set the other covariates as follows: previous history (1 = yes), hypertension (0 = no), smoking status (1 = yes), and minority (0 = no).
fit = predict(low.reduced,data.frame(Prev=factor(rep(1,length(x))),
Hyper=factor(rep(0,length(x))),Smoke=factor(rep(1,length(x))),Minority=factor(rep(0,length(x))),Lwt=x),type="response")
plot(x,fit,xlab=”Mother’s Weight”,ylab=”P(Low|Prev=1,Smoke=1,Lwt)”)
Case Diagnostics(Delta Deviance and Cook’s Distance)
As in the case of ordinary least squares (OLS) regression we need to be wary of cases that may have unduly high influence on our results and those that are poorly fit. The most common influence measure is Cook’s Distance and a good measure of poorly fit cases is the Delta Deviance.
Essentially Cook’s Distance (or measures the changes in the estimated parameters when the ithobservation is deleted. This change is measured for each of the observations and can be plotted versus or observation number to aid in the identification of high influence cases. Several cut-offs have been proposed for Cook’s Distance, the most common being to classify an observation as having large influence if or, in case of large sample size n, .
Cook’s Distance
where is the Pearson’s residual defined above.
Delta deviance measures the change in the deviance (D) when the ith case is deleted. Values around 4 or larger are considered to cases that are poorly fit.
These cases correspond to cases to individuals where but is small, or cases where but is large.
In cases of both high influence and poor fit it is good to look at the covariate values for these individuals and we can begin to address the role they play in the analysis. In many cases there will be several individuals with the same covariate pattern, especially if most or all of the predictors are categorical in nature.
> Diagplot.glm(low.reduced)
> Diagplot.log(low.reduced)
Cases 11 and 13 have the highest Cook’s distances although they are not that large. It should be noted also that they are also somewhat poorly fit. Cases 129, 144, 152, and 180 appear to be poorly fit. The information on all of these cases is shown below.
Lowbirth[c(11,13,129,144,152,180),]
Low Prev Hyper Smoke Uterine Minority Age Lwt race bwt
11 0 0 1 0 0 1 19 95 3 2722
13 0 0 1 0 0 1 22 95 3 2750
129 1 0 0 0 1 0 29 130 1 1021
144 1 0 0 0 1 1 21 200 2 1928
152 1 0 0 0 0 0 24 138 1 2100
180 1 0 0 1 0 0 26 190 1 2466
Case 152 had a low birth weight infant even in the absence of the identified potential risk factors. The fitted values for all four of the poorly fit cases are quite small.
> fitted(low.reduced)[c(11,13,129,144,152,180)]
11 13 129 144 152 180
0.69818500 0.69818500 0.10930602 0.11486743 0.09877858 0.12307383
Cases 11 and 13 have high predicted probabilities despite the fact that they had babies with normal birth weight. Their relatively high leverage might come from the fact that there were very few hypertensive minority women in the study. These two facts combined lead to the relatively large Cook’s Distances for these two cases.
Plotting Estimated Conditional Probabilities ~
A summary of the reduced model is given below:
> low.reduced
Call: glm(formula = Low ~ Prev + Hyper + Smoke + Minority + Lwt, family = binomial)
Coefficients:
(Intercept) Prev1 Hyper1 Smoke1 Minority1 Lwt
-0.26127 1.18194 1.39722 0.98185 1.04480 -0.01413
Degrees of Freedom: 185 Total (i.e. Null); 180 Residual
Null Deviance: 232.4
Residual Deviance: 200.3 AIC: 212.3
To easily plot probabilities in R we can write a function that takes covariate values and compute the desired conditional probability.
> x <- seq(min(Lwt),max(Lwt),.5)
> PrLwt <- function(x,Prev,Hyper,Smoke,Minority) {
+ L <- -.26127 + 1.18194*Prev + 1.39722*Hyper + .98185*Smoke +
+ 1.0448*Minority - .01413*x
+ exp(L)/(1 + exp(L))
+ }
> plot(x,PrLwt(x,1,1,1,1),xlab="Mother's Weight",ylab="P(Low=1|x)",
+ ylim=c(0,1),type="l")
> title(main="Plot of P(Low=1|X) vs. Mother's Weight")
> lines(x,PrLwt(x,0,0,0,0),lty=2,col="red")
> lines(x,PrLwt(x,1,1,0,0),lty=3,col="blue")
> lines(x,PrLwt(x,0,0,1,1),lty=4,col="green")
R Function – Diagplot.log
Plot Cook’s Distance and Delta Deviance for Logistic Regression Models
Diagplot.log = function(glm1)
{
k <- length(glm1$coef)
h <- lm.influence(glm1)$hat
fv <- fitted(glm1)
pr <- resid(glm1, type = "pearson")
dr <- resid(glm1, type = "deviance")
par(mfrow = c(2, 1))
n <- length(fv)
index <- seq(1, n, 1)
Ck <- (1/k)*((pr^2) * h)/((1 - h)^2)
Cd <- dr^2/(1 - h)
plot(index, Ck, type = "n", xlab = "Index", ylab =
"Cook's Distance", cex = 0.7, main =
"Plot of Cook's Distance vs. Index", col = 1)
points(index, Ck, col = 2)
identify(index, Ck)
plot(index, Cd, type = "n", xlab = "Index", ylab =
"Delta Deviance", cex = 0.7, main =
"Plot of Delta Deviance vs. Index")
points(index, Cd, col = 2)
identify(index, Cd)
par(mfrow = c(1, 1))
invisible()
}
Interactions and Higher Order Terms (Note ~ uses data frame: Lowbwt)
Working with a slightly different version of the low birth weight data available which includes an additional predictor, ftv, which is a factor that indicates the number of first trimester doctor visits the woman (coded as: 0, 1, or 2+). We will examine how the model below was developed in the next section where we discuss model development.
In the model below we have added an interaction between age and the number of first trimester visits. The logistic model is:
> summary(bigmodel)
Call:
glm(formula = low ~ age + lwt + smoke + ptd + ht + ui + ftv +
age:ftv + smoke:ui, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8945 -0.7128 -0.4817 0.7841 2.3418
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.582389 1.420834 -0.410 0.681885
age 0.075538 0.053945 1.400 0.161428
lwt -0.020372 0.007488 -2.721 0.006513 **
smoke1 0.780047 0.420043 1.857 0.063302 .
ptd1 1.560304 0.496626 3.142 0.001679 **
ht1 2.065680 0.748330 2.760 0.005773 **
ui1 1.818496 0.666670 2.728 0.006377 **
ftv1 2.921068 2.284093 1.279 0.200941
ftv2+ 9.244460 2.650495 3.488 0.000487 ***
age:ftv1 -0.161823 0.096736 -1.673 0.094360 .
age:ftv2+ -0.411011 0.118553 -3.467 0.000527 ***
smoke1:ui1 -1.916644 0.972366 -1.971 0.048711 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 183.07 on 177 degrees of freedom
AIC: 207.07
Number of Fisher Scoring iterations: 4
bigmodel$coefficients
(Intercept) age lwt smoke1 prev1 ht1
-0.58238913 0.07553844 -0.02037234 0.78004747 1.56030401 2.06567991
ui1 ftv1 ftv2+ age:ftv1 age:ftv2+ smoke1:ui1
1.81849631 2.92106773 9.24445985 -0.16182328 -0.41101103 -1.91664380
Calculate P(Low|Age,FTV) for women of average pre-pregnancy weight with all other risk factors absent. Similar calculations could be done if we wanted to add in other factors as well.
First we calculate the logits as function of age for three levels of FTV 0, 1, and 2+ respectively.
> L <- -.5824 + .0755*agex - .02037*mean(lwt)
> L1 <- -.5824 + .0755*agex - .02037*mean(lwt) + 2.9211 - .16182*agex
> L2 <- -.5824 + .0755*agex - .02037*mean(lwt) + 9.2445 - .4110*agex
Next we calculate the associated conditional probabilities.
> P <- exp(L)/(1+exp(L))
> P1 <- exp(L1)/(1+exp(L1))
> P2 <- exp(L2)/(1+exp(L2))
Finally we plot the probability curves as function of age and FTV.
> plot(agex,P,type="l",xlab="Age",ylab="P(Low|Age,FTV)",ylim=c(0,1))
> lines(agex,P1,lty=2,col="blue")
> lines(agex,P2,lty=3,col="red")
> title(main="Interaction Between Age and First Trimester Visits",cex=.6)
We also have an interaction between smoking and uterine irritability added to the model. This will affect how we interpret the two in terms of odds ratios. We need to consider the OR associated with smoking for women without uterine irritability, the OR associated with uterine irritability for nonsmokers, and finally the OR associated with smoking and having uterine irritability during pregnancy.
These estimated odds ratios are given below:
OR for Smoking with No Uterine Irritability
> exp(.7800)
[1] 2.181472
OR for Uterine Irritability with No Smoking
> exp(1.8185)
[1] 6.162608
OR for Smoking and Uterine Irritability
> exp(.7800+1.8185-1.91664)
[1] 1.977553
This result is hard to explain physiologically and so this interaction term might be removed from the model.
Model Selection Methods
Stepwise methods used in logistic regression are the same as those used in ordinary least square regression however the measure is the AIC (Akaike Information Criteria) as opposed to Mallow’s Ck statistic. Like Mallow’s statistic, AIC balances residual deviance and the number of parameters in the model.
AIC = D + 2k
Where D = residual deviance, k = total number of estimated parameters, and is an estimate of the dispersion parameter which is taken to be 1 in models where overdispersion is not present. Overdispersion occurs when the data consists of the number of successes out of > 1 trials and the trials are not independent (e.g. male birth data from your last homework).