Chapter 5-5. Deriving the Logistic Regression Model
In this chapter, we will derive the logistic regression model, or in other words, explain why it has the form that it does. We will see why the model produces odds ratios (which is not at all intuitive).
First, however, we will discuss the criticisms of using linear regression for a dichotomous outcome. Because of these criticisms, linear regression is generally rejected as an acceptable model for a dichotomous outcome. Logistic regression is not subject to these criticisms and is the most widely used model for a dichotomous outcome.
What Linear Regression with a Dichotomous Outcome Gives You
As an example, Wright et al. (1983) were interested in risk factors for giving birth to a baby with low birthweight. Out of 900 births, 10.9% were low birthweight outcomes.
Reading the data in,
FileOpen
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on wright_lowbw.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\wright_lowbw.dta",
clear
* which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use wright_lowbw, clear
______
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.
Examining the outcome variable, using a frequency table
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: lowbw
OK
tabulate lowbw
* <or>
tab lowbw // abbreviating the "tabulate" command
lowbw | Freq. Percent Cum.
------+------
0 | 802 89.11 89.11
1 | 98 10.89 100.00
------+------
Total | 900 100.00
We see that the outcome variable is in the correct 0-1 coding scheme.
We wish to know if the data support the conjecture that maternal smoking is a risk factor for a low birthweight delivery. The dataset contains a variable “smo”, with the values
1 = no, mother does not smoke
2 = yes, mother smokes
tab smosmo | Freq. Percent Cum.
------+------
1 | 664 73.78 73.78
2 | 236 26.22 100.00
------+------
Total | 900 100.00
Recoding this to a 0-1 variable,
Create or change data
Other variable-transformation commands
Recode categorical variable
Main tab: Variables: smo
Required: (1=0)(2=1)
< You could click on “Examples” to see some examples
on how to specify this part. >
Options tab: Generate new variables: smokes
OK
recode smo (1=0)(2=1), gen(smokes)
* <or>
recode smo 1=0 2=1, gen(smokes)
Checking our work,
StatisticsSummaries, tables & tests
Tables
Two-way tables with measures of associaton
Row variable: smo
Column variable: smokes
OK
tab smo smokes
| RECODE of smo
smo | 0 1 | Total
------+------+------
1 | 664 0 | 664
2 | 0 236 | 236
------+------+------
Total | 664 236 | 900
| smo
We see that the new variable was created correctly.
We can test the smoking/low birthweight association using an ordinary chi-square test.
Summaries, tables & tests
Tables
Two-way tables with measures of associaton
Row variable: lowbw
Column variable: smokes
Test statistics: Pearson chi-squared
Cell contents: Within column relative frequencies
OK
tab lowbw smokes, chi2 col
+------+
| Key |
|------|
| frequency |
| column percentage |
+------+
| smokes
lowbw | 0 1 | Total
------+------+------
0 | 605 197 | 802
| 91.11 83.47 | 89.11
------+------+------
1 | 59 39 | 98
| 8.89 16.53 | 10.89
------+------+------
Total | 664 236 | 900
| 100.00 100.00 | 100.00
Pearson chi2(1) = 10.4736 Pr = 0.001
If these are cohort study data (see box), then we can interpret these data using a “risk difference” measure of effect:
Rexposed = Risk(low birthweight | mother smokes) = 39 / 236 = 0.1653 or 16.5%
Runxposed = Risk(low birthweight | mother not smokes) = 59 / 664 = 0.0889 or 8.9%
Risk Difference = RD = Rexposed - Runxposed = 0.1653 – 0.0889 = 0.0764 or 7.6%
Similarly, if these are cross-sectional data, then we can interpret these data using a “prevalence difference” measure of effect:
Pexposed = Prevalence(low birthweight | mother smokes) = 39 / 236 = 0.1653 or 16.5%
Punxposed = Prevalence(low birthweight | mother not smokes) = 59 / 664 = 0.0889 or 8.9%
Prevalence Difference = PD = Pexposed - Punxposed = 0.1653 – 0.0889 = 0.0764 or 7.6%
If these are case-control data, than the risk cannot be computed (see box)
Study Design Data Layouts
Cohort Study
ExposureDisease / Exposed (1) / Unexposed (0) / Totals*
cases (1) / a (col %) / b (col %) / n1
noncases (0) / c / d / n0
Totals* / N1 / N0
Cross-Sectional Study
ExposureDisease / Exposed (1) / Unexposed (0) / Totals*
cases (1) / a (col %) / b (col %) / n1
noncases (0) / c / d / n0
Totals* / n1 / n0
Case-Control Study
ExposureDisease / Exposed (1) / Unexposed (0) / Totals*
cases (1) / a (col %) / b (col %) / N1
noncases (0) / c / d / N0
Totals* / n1 / n0
*The uppercase N’s (sample sizes) are fixed by the researcher,
and the lowercase n’s are observed totals.
In the case-control study design, risk estimates (such as proportion of exposed with disease) cannot be estimated because we fixed the row totals. That is, the number of cases among the exposed were not allowed to vary, by sampling, in proportion to what exist in the population. The column %’s do not reflect the proportion of cases among the exposed in the population. (To intuitively recognize this, we frequently select 50% cases and 50% controls—it is not likely the disease incidence is 50% cases in the population.)
The risk and risk difference estimates computed above,
| smokes
lowbw | 0 1 | Total
------+------+------
0 | 605 197 | 802
| 91.11 83.47 | 89.11
------+------+------
1 | 59 39 | 98
| 8.89 16.53 | 10.89
------+------+------
Total | 664 236 | 900
| 100.00 100.00 | 100.00
Pearson chi2(1) = 10.4736 Pr = 0.001
along with confidence intervals and the chi-square test, can be obtain using the “cs” (cohort study) command, which is part of the collection of commands called “epitab” (if you search for “cs” in Stata’s help, you will see “epitab), which stands for Tables for Epidemiologists.
StatisticsEpidemiology and related
Tables for epidemiologists
Cohort study risk ratio, etc.
Main tab: Case variable: lowbw
Exposed variable: smokes
OK
cs lowbw smokes
| RECODE of smo |
| Exposed Unexposed | Total
------+------+------
Cases | 39 59 | 98
Noncases | 197 605 | 802
------+------+------
Total | 236 664 | 900
| |
Risk | .1652542 .0888554 | .1088889
| |
| Point estimate | [95% Conf. Interval]
|------+------
Risk difference | .0763988 | .024305 .1284926
Risk ratio | 1.85981 | 1.276662 2.709327
Attr. frac. ex. | .4623108 | .2167074 .6309046
Attr. frac. pop | .1839808 |
+------
chi2(1) = 10.47 Pr>chi2 = 0.0012
We see that these results agree exactly with our crosstabulation analysis performed using the tabulate (tab) command.
From Chapter 5-4, we know that linear regression is a mean difference model, and with covariates is an adjusted mean difference model.
Since a proportion is simply the mean of a 0-1 scored variable
then linear regression of lowbw as the outcome variable and smokes as the predictor variable, should give us the mean difference, or more specifically in this case, the risk difference.
StatisticsLinear model and related
Linear regression
Model tab: Dependent variable: lowbw
Independent variables: smokes
OK
regress lowbw smokes
Source | SS df MS Number of obs = 900
------+------F( 1, 898) = 10.57
Model | 1.01627402 1 1.01627402 Prob > F = 0.0012
Residual | 86.3126149 898 .096116498 R-squared = 0.0116
------+------Adj R-squared = 0.0105
Total | 87.3288889 899 .097140032 Root MSE = .31003
------
lowbw | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
smokes | .0763988 .0234953 3.25 0.001 .0302868 .1225108
_cons | .0888554 .0120314 7.39 0.000 .0652426 .1124683
------
We see the regression coefficient for smokes is identically the risk difference from the cs command, with a very similar p value (exactly the same when the sample is very large).
| RECODE of smo |
| Exposed Unexposed | Total
------+------+------
Cases | 39 59 | 98
Noncases | 197 605 | 802
------+------+------
Total | 236 664 | 900
| |
Risk | .1652542 .0888554 | .1088889
| |
| Point estimate | [95% Conf. Interval]
|------+------
Risk difference | .0763988 | .024305 .1284926
Risk ratio | 1.85981 | 1.276662 2.709327
Attr. frac. ex. | .4623108 | .2167074 .6309046
Attr. frac. pop | .1839808 |
+------
chi2(1) = 10.47 Pr>chi2 = 0.0012
Major Criticism of Linear Regression for a Dichotomous Outcome
Although we just observed that linear regression for a dichotomous outcome appears to be perfectly reasonable, you will rarely see such a linear regression reported. Many researchers and statisticians are not even aware that it is a feasible approach.
This is because the linear regression model has been criticized in two ways.
The minor criticism is that the residuals are not normally distributed, which is an assumption of linear regression. This is not so bad, and the criticism can basically be ignored.
The major criticism is that it is possible to get predicted risk probabilities less than 0 or greater than 1 (risk is only defined to be between 0 and 1, since risk is a proportion). This is a “big” problem to statisticians, since statisticians like statistical methods to be consistent, always providing a reasonable solution.
Another shortcoming: Since risk cannot be estimated in a case-control study, the risk difference estimate would be meaningless. Therefore, a different type of regression model is required for that study design.
Let’s verify that linear regression predicts the risk probabilities (the proportions). To see this, we obtain predicted values from the model we just ran.
StatisticsPostestimation
Predictions, residuals, etc.
Main tab: New variable name: pred_lowbw
Produce: Linear prediction (xb)
OK
predict pred_lowbw, xb
* <or>
predict pred_lowbw // since “xb” is default option
Requesting a frequency table for each value of “smokes”,
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: pred_lowbw
by/if/in tab: Repeat command by groups:
Variables that define groups: smokes
OK
* Note: To use “by”, the data must be sorted on the “by” variable.
* This can be done before using “by” (actually sorts data in data
* editor, or on the fly (leaves data in data editor as is)
*
sort smokes
by smokes: tabulate pred_lowbw
* <or>
by smokes, sort : tab pred_lowbw
* <or>
bysort smokes: tab pred_lowbw
-> smokes = 0
Fitted |
values | Freq. Percent Cum.
------+------
.0888554 | 664 100.00 100.00
------+------
Total | 664 100.00
-> smokes = 1
Fitted |
values | Freq. Percent Cum.
------+------
.1652542 | 236 100.00 100.00
------+------
Total | 236 100.00
We see all nonsmokers were predicted as the nonsmoker risk (0.0889) and all smokers were predicted as the smoker risk (0.1653), consistent with the regression coefficients (the prediction equation).
------
lowbw | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
smokes | .0763988 .0234953 3.25 0.001 .0302868 .1225108
_cons | .0888554 .0120314 7.39 0.000 .0652426 .1124683
------
display .0888554+.0763988 .1652542
This happens because the regression lines goes through the two proportions and the only two values on the x-axis are the two smoking categories.
twoway (scatter lowbw smokes, mlabel(lowbw))
(scatter pred_lowbw smokes)
(lfit lowbw smokes)
, text(0.129 0 "0.089", size(medium) placement(ne))
text(0.205 1 "0.165", size(medium) placement(nw))
legend(off)
ytitle("Proportion With Low Birthweight")
xtitle("Mother Smokes")
;
#delimit cr
For these data, the linear regression did not predict outside of the 0-1 range.
Jewell (2004, pp 181-183) admits that linear regression can be used to model risk difference (also called excess risk). However, he immediately follows this with a discussion of the potential for predictions outside of the 0-1 range and then goes on to present logistic regression as the model of choice in general (see box).
Jewell (2004, p.183),
“As the interpretation of b reveals, the linear model is most useful for modeling Excess Risk. It is difficult to translate the parameters directly into a Relative Risk or Odds Ratio. A further consequence is that the linear model cannot be directly applied to case-control data since Excess Risk cannot be estimated from such designs without additional information.
There is an additional structural drawback to use of the linear model with binary outcome data. Whatever the value of the parameters a and b (¹ 0), at some values in the range of X, either low values or high values, the model (Equation 12.1) predicts values of px < 0 or px > 1, not permissible for risks. This may not be a practical concern when this occurs for X values that are far from those observed in the population; for example, if the risk of infant mortality is predicted to become negative for birthweights less than 10 g. However, in cases where the risks are either very low or very high, the true values of risk are already very close to these boundaries, and it may be safer to use a model that does not allow for negative risks or risks greater than one.”
Jewell than proposes logistic regression to be such a model.
Example of a Dataset Where Predicted Probabilities Go Outside 0-1 Interval
We will use the Vaso dataset (vaso.dta). These data, originally published by Finney (1947), were obtained in a carefully controlled study of the effect of the RATE and VOLume of air inspired by human subjects on the occurrence (coded 1) or non-occurrence (coded 0) of a transient vasoconstriction RESPonse in the skin of the fingers.
FileOpen
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on vaso.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\vaso.dta", clear
* which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use vaso, clear
Fitting a linear regression