STAT 460 Lab 10 Turn in Sheet 11/22/2004

To receive credit for this lab, turn this sheet in before leaving the lab.

Name: ______

Lab Section: ____

1. What is the p-value? What null hypothesis is rejected?

2. What is a one unit change for each of the two variables?

3. Interpret the cross-tabulation.

4. List one or two things that are still unclear to you.

9

STAT 460 Lab 10 Instructions 11/22/2004

Goals: In this lab you will learn to perform chi square tests and learn about logistic regression.

Part I. Chi-square test of independence

Summary:

Chi square test are used to test whether two variables measured on a group of subjects are independent. A table of cell counts for each combination of the levels of X and Y, called a contingency table, is usually produced first. If X and Y are independent, then the probability distribution of X is the same for each Y (and vice versa). The chi square statistic is the sum of (observed – expected)2/expected where “expected” is the expected number of subjects in each cell if X and Y are independent. The null distribution of the chi square statistic is the chi square distribution with (r-1)*(c-1) df, where r is the number of rows and c is the number of columns in the contingency table.

a)  Categorical explanatory and categorical outcome (response) variables or two categorical outcome variables

b)  Null hypothesis: X is independent of Y or Pr(Y=1|X=0)=Pr(Y=1|X=1)=…

c)  Construct a contingency table which counts numbers of subjects for each combination of levels of variable X and variable Y:

d)  Expected value (under the null hypothesis) = (row count * column count)/(total count)

e)  Chi square statistic is

f)  Under H0, X2 follows a chi-square (χ2) distribution with (r-1)*(c-1) d.f.

g)  SAS (Statistics/Table Analysis with Statistics:ChiSquare):

Task 1: Chi-square for Fiber

A manufacturer was considering marketing crackers high in a certain kind of edible fiber as a dieting aid. Dieters would consume some crackers before a meal, filling their stomachs so that they would feel less hungry and eat less. A laboratory studied whether people would in fact eat less in this way.

Overweight female subjects ate crackers with different types of fiber (bran fiber, gum fiber, both, and a control cracker) and were then allowed to eat as much as they wished from a prepared menu. The amount of food they consumed and their weight were monitored, along with any side effects they reported. Unfortunately, some subjects developed uncomfortable bloating and gastric upset from some of the fiber crackers. A contingency table of "Cracker" versus "Bloat" shows the relationship between the four different types of cracker and the four levels (1 being HIGH, 2 =LOW, 3=MEDIUM, and 4 NO bloating) of severity of bloating as reported by the subjects.

We will use a Chi-Square test to see whether Bloating is independent of Cracker (the type of fiber eaten).

.

a.  Load fiber.txt into SAS.

b.  Create a contingency table and run the chi square test as follows. From the menu choose Statistics/Table Analysis. Enter “cracker” as rows and bloat as columns. Under Statistics click Chi Square, and Exact Test. Under Table add expected counts and row and column percents.

c.  In the cross-tabulation results, notice, e.g.,

1.  observed count for no bloating and fiber, is 7

2.  expected count under the null, is 4.25 = (12 x 17)/48

3.  the percent of subjects with no bloat in the four different fiber types is (bran) 58.3, (combo) 16.7, (control) 50.0 and (gum) 16.7%.

4.  Row Pct, gives conditional distributions for BLOAT given CRACKER; e.g. conditional probability values of the conditional distribution of BLOAT given CRACKER is BRAN are (0, 0.333, 0.083, 0.583) and these values sum to 1. Further, e.g. P(bloat=2|cracker=bran)=0.333.

i.  What is the conditional distribution P(bloat| cracker= control)?

5.  Col Pct, gives conditional distributions for CRACKER given BLOAT, e.g., conditional probability values of the conditional distribution of CRACKER given BLOAT is level 2 are (0.267, 0.333, 0.267, 0.133) and these values sum to 1. Further, e.g. P(cracker=combo|bloat=2)=0.333.

i.  What is the conditional distribution P(cracker|bloat=1)?

6.  Odds of no bloating is the number of subjects no bloating/ number of subjects with bloating to 1, e.g. 17 to 21 or 17/21 to 1

7.  Relative risk of being bloated based on whether you are eating a control cracker or not: (10/12)/(21/36)=1.43 to 1

i.  You can do the above calculation manually or create a collapsed 2x2 table to get the summarized counts and then calculate the risk Data/Transform/Recode values such that BLOAT levels 1,2,3 to 1 and 4 to 0, and CRACKER control to 0 and the rest to 1.

ii. Reports/Tables choose Row classes/Column classes .

8.  Increased risk is 10/12 – 21/36 = 0.25

9.  Odds Ratio = (2*21)/(10*15)= 7/25 = 0.21

d.  In the Chi Square tests, the “Value” is the chi square test statistic. The “Prob” is the asymptotic p-value. The word asymptotic indicates that the null sampling distribution that is used to convert from the statistic to the p-value is only an approximation when the sample size is not large. Many people consider the approximation to be inappropriate when some cells have less than 5 expected counts, but somewhat smaller values are usually OK.

e.  (♠1) What is the p-value? What null hypothesis is rejected?

f.  One solution to the problem of low cell counts is to combine categories. Use Data/Transform/RecodeValues. Create bloat2 (label for New Column Name) from bloat . Note that 2 and 4 are the codes for “low” and “none”. Use Original and New Values to code 2 and 4 to 0 for no bloat and to code 1 and 3 to 1 bloat. Rerun the analysis and re-interpret. This p-value is more reliable because few cells have much less than 5 subjects. What kind of cracker would you avoid?


Part II. Binary Logistic Regression

Summary:

Logistic regression is a powerful way to relate one or more explanatory variables to a binary (categorical) outcome. It mirrors linear regression, but the linear combination of coefficients and explanatory variables, β0+β1X1+…+βpXp, instead of representing the mean outcome, represents the log odds of the probability of “success”.

The odds equal probability/(1-probability). Odds run between 0 and infinity. Log odds run between –infinity and +infinity, so every possible linear combination corresponds to a valid probability between 0 and 1. To convert from the linear combination (η) to probability, the formula is exp(η)/(1+exp(η)).

When making predictions with logistic regression equations, we have “additive” effects on the log odds scale. E.g., b1=2, then we estimate that the log odds of success increase by 2 for each one-unit increase in x1. If we want to work on the odds scale, the properties of logs tell us that we now have multiplicative effects. Using the same example, exp(2)=7.39, so the odds of success get multiplied by 7.39 for each one-unit increase in x1. There is no easy way to express the change in probability of success for a one-unit increase in x1; the best you can do is to calculate the probability predicted by the model for several meaningful values of x1.

The assumptions of logistic regression include that for any group of subjects with some fixed combination of explanatory variables, the outcome follows the binomial distribution bin(n,p) where n is the group size and p is the predicted probability of “success”. (The binomial distribution is just that of flipping an unfair coin with heads probability equal to p.) Since we often don’t have any groups of subject with identical explanatory variables, the Hosmer-Lemeshow goodness of fit test makes groups of similar subjects then tests that the groups are consistent with the binomial distribution. A low p-value suggests that the model is suspect, either through an inappropriate selection of explanatory variables (e.g., a missing interaction) or through an outcome probability process that is not inherently binomial.

a.  Generalized Linear Model approach looks like regression but the linear combination of coefficients and explanatory variables is related to the outcome through a “link function”, usually symbolized as g( ). Explanatory variables can be continuous, coded categorical, transformed for non-linearity, and multiplied across variables for interaction.

b.  Let η=β0+β1X1+…+βkXk where η is pronounced “eta”, and p is the number of explanatory variables (including expansions of categorical explanatory variables with more than two levels). Remember, in ordinary regression, μ(Y|X)= η.

c.  Let π=Pr(Y=1|X) be the probability of “success” for any combination of explanatory variables.

d.  The link function for logistic regression is g(π)=log(π/(1-π)) which is the log odds of success or “logit” of the success probability.

e.  Note: Odds(Y)=Pr(Y)/(1-Pr(Y)), e.g. p=0.2, 0.5, 0.75, odds=0.25, 1, 3. Also, log(odds)=--1.30, 0, 1.10.

f.  In logistic regression g(π)=η or log(π/(1-π)) = β0+β1X1+…+βkXk. Here is a plot of the relationship between one explanatory variable and the probability of success when all of the other explanatory variables are held constant:

g.  To estimated the probability of success for any combination of explanatory variables, calculate the log odds of success, η, then use the logistic formula, p = exp(η) / (1+ exp(η)). Note: We can break this down to odds=exp(log odds) and probability=odds/(1+odds).

h.  Assumptions:

i.  Binomial outcome,

ii. logistic relationship for Pr(Y=1) vs. each X,

iii.  variance(Y|X)=Pr(Y=1|X)*(1-Pr(Y=1|X)),

iv.  “fixed” X,

v. independent errors.

i.  Summary: Logistic regression handles binary outcomes by working on the scale of “log odds of success”.

j.  SAS results (Statistics/sRegression/Logistic)

Task 2: Logistic regression

a.  Load donner.txt into SAS. Perform some appropriate EDA.

b.  Scientific hypothesis: women are better able to survive harsh conditions than men (children excluded from analysis)

c.  Statistical model: LogOdds(survival | age, gender) = β0 + βageAge + βfemaleFemale,

H0: βfemale=0

βfemale is the change in log odds of success when comparing a male to a female of any age.

βage is the change in log odds of success when comparing a person to another 1 year older.

d.  Perform the logistic regression to model the log odds of survival on the age and gender of the adult pioneers. Use Regression/Logistic. Set the Dependent variable Survived and both Covariates, Female and Age as Quantitative. Set Model Probability to 1.

e.  Output Analylsis:

Model Information

Data Set _PROJ_.DONNER

Response Variable SURVIVED

Number of Response Levels 2

Number of Observations 45

Model binary logit

Optimization Technique Fisher's scoring

Response Profile

Ordered Total

Value SURVIVED Frequency

1 1 20

2 0 25

Probability modeled is SURVIVED='1'.

Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics

Intercept

Intercept and

Criterion Only Covariates

AIC 63.827 57.256

SC 65.633 62.676

-2 Log L 61.827 51.256

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 10.5703 2 0.0051

Score 9.0965 2 0.0106

Wald 6.8627 2 0.0323

f.  The Variables in the Equation box has a line for the constant (intercept) and each explanatory variable. Check the columns for the coefficient, its standard error and the Wald statistic (B/SE(B))2. Note that two p-values are significant. The intercept coefficient represents the log odds of survival for zero year old males. Why is this not a worthwhile quantity to interpret?

Analysis of Maximum Likelihood Estimates

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 1.6331 1.1102 2.1637 0.1413

AGE 1 -0.0782 0.0373 4.3988 0.0360

FEMALE 1 1.5973 0.7555 4.4699 0.0345

The LOGISTIC Procedure

Odds Ratio Estimates

Point 95% Wald

Effect Estimate Confidence Limits

AGE 0.925 0.860 0.995

FEMALE 4.940 1.124 21.716

Association of Predicted Probabilities and Observed Responses

Percent Concordant 73.0 Somers' D 0.492

Percent Discordant 23.8 Gamma 0.508

Percent Tied 3.2 Tau-a 0.248

Pairs 500 c 0.746

g.  The Exp(B) represents the change in odds when the explanatory variable goes up one unit. (♠2) What is a one unit change for each of the two variables?

1.  Prediction example: 21 year old male

Eta: η = 1.633 - 0.078(21) = -0.005

Odds(survival) = exp(-0.005)=0.995 (1 survives for every 1 that dies)

Pr(survival) = 0.995 / (1+0.995) = 0.499

2.  Prediction example: 21 year old female

Eta: η = 1.633 - 0.078(21) +1.597 = 1.592

Odds(survival) = exp(1.592)=4.914 (5 survive for every 1 that dies)

Pr(survival) = 4.914 / (1+4.914) = 0.831

3.  Comparing odds (gender): exp(1.597)=4.938 0.995*4.938=4.914

4.  Comparing odds (per year): exp(-0.078)=0.925, so the odds are 0.925 times as large (for either gender) for each 1 year increase in age.

5.  Comparing odds (per decade): exp(-0.78)=0.458, so the odds are 0.458 times as large (for either gender) for each 10 year increase in age.

h.  Now rerun logistic regression with recoded FEMALE as a class. Be sure to change (recode) the values so that “Female” coefficient will reflect female relative to male, not vice versa. Under Model select Backward Elimination (this is for automatic model selection for a best model). Under Statistics choose Goodness-of-fit. Under Prediction, Predict original sample and save them.

i.  First look at the Class Level Information. Design variables 1 is a dummy-type code for females and -1 for females. Now in Analysis of Maximum Likelihood Estimates look for the coefficients of the model. Compare those to the first model we ran? Now the interpretation is different?

1.  For example: Prediction example: 21 year old male

Eta: η = 2.4318 - 0.078(21) +0.796(-1)= -0.005

j.  The classification table is worthwhile if classification of new cases is your goal (but future cases tend to be classified not as well as current cases). Here our main interest is interpretation of the coefficients to test the hypothesis that females survive better than males after correcting for age.

k.  Graphical summary of model results

l.  Calculate the eta (η) values, which are the log odds of survival. The calculate exp(η) to get the odds of survival. Then calculate odds/(1+odds) to find the probability of survival.

Age / Female / LogOdds / Odds / Probability
25 / 0
25 / 1
50 / 0
50 / 1

9