17:1 What Is Logistic Regression?

AGR206Ch17LOGISTIC.doc Revised: 6/8/04

Chapter 17. Logistic Regression.

17:1 What is Logistic Regression?

In general, logistic regression is a method to classify objects, plots, observations, cases, or individuals (all are synonyms in this subject) into pre-existing non-overlapping classes, categories, or groups. From this point of view, logistic regression has exactly the same goals as discriminant analysis.

Example 1. You want to predict a discrete outcome of operating a farm: the farm is or is not economically sustainable. For this prediction you can use a number of characteristics measured on other farms that operated for a while and either failed or remained in operation and develop a prediction equation that describes the probability of success or failure.

Example 2. A population of animals that live about 4 years has to be described by its age structure. An equation to classify each individual into each age class can be developed based on a random sample of individuals that were tagged at birth over the last 10 years. Length, weight and other continuous and categorical variables can be tested through logistic regression to produce an optimal age-predicting curve that can be applied to any animal trapped.

17:2 When and why to use Logistic Regression?

As indicated before, logistic regression has the same uses as discriminant analysis, but there are some differences.

1. The response variable has to be binary or ordinal.

2. Logistic regression is a non-parametric method that requires no specific distribution of the errors or response variables.

3. Predictors can be continuous, discrete, or combinations of variables.

4. Non-linear relationships between the response and predictors are accommodated.

5. Because of its similarity with regression, logistic regression offers easy model-building or variable selection procedures.

6. The logistic model directly predicts the probability that each object belongs to each group as a function of the values of the predictors.

7. Parameter estimates are obtained by maximum likelihood methods that require computationally intensive numerical solutions.

These differences suggest that logistic regression is a better choice than discriminant analysis when there are categorical predictors, when the assumption of multivariate normality is not met, when the effects of a predictor on the outcome are not linear, and when a large number of predictors have to be screened for predictive power.

When the assumptions of multivariate normality and linearity are met, discriminant analysis, if applicable, is more efficient than logistic regression.

17:3 Model and assumptions.

Consider a binary response, for example sex (Y) of an individual before any obvious external dimorphism is developed. Suppose that females tend to have a slightly different shape from males, as evinced by the weight/length ratio (X). Given that it has a certain shape, the probability that any individual is male is p, so the probability of female q=1-p. For practical purpose let Y=1 when male and Y=0 when female. Logistic regression determines if and how p varies with shape.

17:3.1 Model

The expected value of Y, given X is calculated as usual:

Theoretical and practical considerations indicate that the effects of the predictor X on the expected value of Y, if any, can be represented by the logistic model. This model is flexible and can represent lines that range from almost straight horizontal to almost straight vertical, within the interval [0,1] of valid probability values.

The estimated parameters b0 and b1 can be used to get a predicted logit(p) and then a p value for any level of X. The logit function can be quite general, because it can accept continuous and discrete variables, and not assumptions are imposed on their distributions.

In this example of simple logistic regression, the "back-transformed" values can be plotted against X and a simple decision rule emerges as seen in the following Figure 17-1.

Figure 17-1. Logistic regression of sex on an index of shape. Fictitious data. The figure can be interpreted as the proportion of females and males in the population for each value of X.

Based on the model and the figure, a rule to classify individuals with shape X lower than 8.4 as females is established. The model also gives us a continuous measure of error rate as a function of the X variable. A curve that becomes steep reflects a better discrimination, whereas a flat line shows that the predictor contains no information about the response. Note that the meaning of the scatter of points in these plots is limited, because there is only one predicted probability associated with each point, but no observed probability. Each point is placed horizontally at its observed value of X and then vertically at a height randomly chosen within the correct region. In Figure 17-1, the points above the blue line are all male, and those below the line are female. Because the heights are randomly chosen, each time you run the logistic regression for a given dataset a different scatter plot will be produced.

Figure 17-2. Simulation of situations when shapeX is a good (left) and a poor (right) predictor of sex. The adequacy of the model is given by the Rsquare or U value, which is 0.75 and 0.03 for the left and right panels.

17:3.2 Assumptions and limitations

As usual, the training sample has to be a random sample of the population for which the equation will be used. Logistic regression requires no additional assumptions about the distributions of the predictors or predicted variables, so it is quite useful from this point of view. However, if the usual assumptions of multivariate normality are met, discriminant analysis is usually a more efficient and stronger method.

Some limitations or cautions must be considered for logistic regression.

17:3.2.1 Ratio of observations to predictors

If too many predictors are included relative to the number of cases or observations, the analysis can produce large values for the parameters and standard errors. This is particularly problematic when there are several nominal predictors that generate a large number of cells or “dummy” variables in the linear model. This situation can be corrected by merging categories and by obtaining more observations such that all possible cells are represented in the sample. As a guideline, have a minimum of 30 observations per continuous predictor plus 6-10 for each combination of values of each nominal predictor.

17:3.2.2 Observations or cases per cell

Because the analysis is based on a test of goodness of fit, the presence of cells with expected values smaller than one or with fewer than 5 observations significantly reduces the power of the test. Check all pairs of nominal variables and merge categories as necessary to obtain cells with expected frequencies greater than 1 and to have less than 20% of cells with observed frequencies less than 5.

17:3.2.3 Collinearity among predictors

A multiple linear regression, solved by maximum likelihood, is at the core of logistic regression. Thus, the method is subject to exactly the same collinearity problems described for multiple linear regression. This is addressed by a process of backward elimination, whereby all variables and interactions are included into the model at first, and then one proceeds to eliminate the least significant interaction and run the modified model again. Proceed deleting one effect at a time until the model contains only significant interactions, significant simple effects and non-significant simple effects involved in significant interactions.

17:3.2.4 Extreme values of predictors

The equation is sensitive to extreme values of continuous predictors. Although no distribution is assumed for the predictors, they should be explored by standardizing and flagging observations with absolute values greater than 3. Multivariate outliers can also be studied by standard techniques.

17:4 Detection as classification.

This type of analysis is frequent in the health sciences, as individuals have to be “classified” as having or not having a condition or disease based on the result of a test (X variable or predictor). Frequently, tests involve titration or quantitative measurements of antibodies or chemicals that exists both in individuals with and without the disease.

17:4.1 Structure of the problem.

The population of individuals can be exhaustively partitioned into those who fall in the, say “infected” and those who are in the “not infected” classes. Based on more expensive tests or in tracking the evolution of patients, a test has been developed to determine if people are infected or not. The test yields a value X, for example, concentration of a certain protein in the blood, which is related to the infection. The application of a logistic regression to assign individuals to “positive” or “negative” groups results in individuals in each of the four possible classes as shown in the table below, which contains fictional data.

Test result
Positive / Negative
True state / Infected / Correct
52 / False negative
8
Not infected / False positive
20 / Correct
100

In assessing the classification procedure, it is important to take into account the false positives and negative together and separately. Consider, in the sex example above, how the numbers of females classified as males and males classified as females change as the “critical” shapeX is varied from the minimum to the maximum. In the case of diagnosis and signal detection, this relationship is important to assess the consequences of making mistakes and in deciding the overall performance of the test.

Because the subjects tested are usually not a random sample of the population (people who are feeling well are less likely to be tested), one has to correct the probabilities to assess how the test would do in the general population. When the training sample does not represent the prevalence of infection in the general population, it still yields correct conditional probabilities within rows: given that the subject is infected (or not) the row frequencies are the probabilities of positive and negative test results. The rows frequencies for the table above are presented below. The row frequencies should add up to 1 across columns.

The correction to determine what proportion of the positives are actually infected in the general population is done on the basis of an a priori estimation of the probability that any subject from the population is infected (prevalence of the infection). One is interested in finding the proportion of those random individuals tested who are correctly identified as infected.

Test result
Positive / Negative
True state / Infected / 52/60 / 8/60
Not infected / 20/120 / 100/120

Assuming that the prevalence in the general population is 15%, the proportion of individuals that test positive who are actually infected is:

The problem is that because the prevalence is usually a small number, the total number of positives becomes highly “contaminated” by false positives, because most of the subjects in the population are not infected.

17:4.2 Measures of usefulness of the classification function.

Two measures, sensitivity and specificity, are calculated for 2x2 tables.

17:4.2.1 Sensitivity

Sensitivity is the probability that the test correctly identifies the presence of infection. In the table above, sensitivity is the proportion of correct positives within the infected individuals, P(positive | infected) or probability of positive given infected (52/60).

17:4.2.2 Specificity

Specificity is the proportion of individuals correctly identified as not being infected. This is the probability of negative given not infected (100/120).

Note that the correction to determine the proportion of true positives in the population depends on the sensitivity but not on the specificity.

17:5 Obtaining and interpreting output with SAS.

17:5.1 SAS code.

proc logistic data=sex;

model sex=shape / corrb ctable;

run;

17:5.2 SAS output.

The LOGISTIC Procedure

Data Set: WORK.SEX

Response Variable: SEX

Response Levels: 2

Number of Observations: 100

Link Function: Logit

Response Profile

Ordered

Value SEX Count

1 female 50

2 male 50

Model Fitting Information and Testing Global Null Hypothesis BETA=0

Intercept

Intercept and

Criterion Only Covariates Chi-Square for Covariates

AIC 140.629 116.517 .

SC 143.235 121.727 .

-2 LOG L 138.629 112.517 26.112 with 1 DF (p=0.0001)

Score . . 23.136 with 1 DF (p=0.0001)

Analysis of Maximum Likelihood Estimates

Parameter Standard Wald Pr > Standardized Odds

Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio

INTERCPT 1 9.6830 2.2634 18.3013 0.0001 . .

SHAPE 1 -1.1448 0.2668 18.4147 0.0001 -0.686124 0.318

Association of Predicted Probabilities and Observed Responses

Concordant = 77.6% Somers' D = 0.553

Discordant = 22.3% Gamma = 0.554

Tied = 0.1% Tau-a = 0.279

(2500 pairs) c = 0.777

Estimated Correlation Matrix

Variable INTERCPT SHAPE

INTERCPT 1.00000 -0.99487

SHAPE -0.99487 1.00000

Classification Table

Correct Incorrect Percentages

------

Prob Non- Non- Sensi- Speci- False False

Level Event Event Event Event Correct tivity ficity POS NEG

------

0.040 50 0 50 0 50.0 100.0 0.0 50.0 .

0.060 50 1 49 0 51.0 100.0 2.0 49.5 0.0

0.080 50 2 48 0 52.0 100.0 4.0 49.0 0.0

0.100 50 4 46 0 54.0 100.0 8.0 47.9 0.0

0.120 50 6 44 0 56.0 100.0 12.0 46.8 0.0

0.140 49 9 41 1 58.0 98.0 18.0 45.6 10.0

0.160 49 11 39 1 60.0 98.0 22.0 44.3 8.3

0.180 49 12 38 1 61.0 98.0 24.0 43.7 7.7

0.200 48 12 38 2 60.0 96.0 24.0 44.2 14.3

0.220 47 14 36 3 61.0 94.0 28.0 43.4 17.6

0.240 47 15 35 3 62.0 94.0 30.0 42.7 16.7

0.260 45 17 33 5 62.0 90.0 34.0 42.3 22.7

0.280 44 20 30 6 64.0 88.0 40.0 40.5 23.1

0.300 44 20 30 6 64.0 88.0 40.0 40.5 23.1

0.320 44 20 30 6 64.0 88.0 40.0 40.5 23.1

0.340 43 20 30 7 63.0 86.0 40.0 41.1 25.9

0.360 43 21 29 7 64.0 86.0 42.0 40.3 25.0

0.380 41 25 25 9 66.0 82.0 50.0 37.9 26.5

0.400 40 26 24 10 66.0 80.0 52.0 37.5 27.8

0.420 36 26 24 14 62.0 72.0 52.0 40.0 35.0

0.440 36 28 22 14 64.0 72.0 56.0 37.9 33.3

0.460 35 30 20 15 65.0 70.0 60.0 36.4 33.3

0.480 35 32 18 15 67.0 70.0 64.0 34.0 31.9

0.500 34 33 17 16 67.0 68.0 66.0 33.3 32.7

0.520 34 33 17 16 67.0 68.0 66.0 33.3 32.7

0.540 31 37 13 19 68.0 62.0 74.0 29.5 33.9

0.560 31 37 13 19 68.0 62.0 74.0 29.5 33.9

0.580 30 39 11 20 69.0 60.0 78.0 26.8 33.9

0.600 29 41 9 21 70.0 58.0 82.0 23.7 33.9

0.620 28 42 8 22 70.0 56.0 84.0 22.2 34.4