Count Data – Generalized Linear Models –Poisson Regression

Poisson Regression

Data:

Units: NASCAR Winston Cup Races (1975-1979 = n=151 races)

Dependent Variable: Y = # of Caution Flags (e.g. crashes) (CAUTIONS)

Independent Variables:

  • X1 = # of Drivers in the Race (DRIVERS)
  • X2 = Circumference of the Track (TRKLENGTH)
  • X3 = # of Laps in the Race (LAPS)

Random Component: Poisson Distribution for the # of accidents

Density Function:

Link Function: g() = log()

Systematic Component:

Testing for the overall model:

H0: 1 = 2 = 3 = 0 (# of crashes is independent of all predictors)

HA: Not all i = 0

Test Statistic: Xobs2 = -2(L0-L1)

Rejection Region: Xobs2

Where L0 is the maximized log likelihood function under H0 and L1 is the maximized log likelihood under HA.

  • H0: g( L0 = 410.8784 (SAS Code and output below)
  • HA: g() = 1X1 + 2X2 + 3X3 L1 = 433.0160
  • Xobs2 = -2(410.8784-433.0160) = 44.28
  • Strong evidence that number of crashes is associated with predictors

Testing for individual (partial) regression coefficients:

  • H0: j = 0 HA: j 0 (can clearly do > or < as well)
  • Test Statistic: (some software packages report z2)
  • Rejection Region: |zobs|  z/2 (obvious adjustment for 1-sided tests)

Predictor

/ bj / SE(bj) / Zobs
Drivers / 0.0365 / 0.0125 / 2.92
Trklength / 0.1145 / 0.1684 / 0.68
Laps / 0.0026 / 0.0008 / 3.25

Note that as the number of drivers increases, number of crashes increases, similarly, as the number of laps increases, crashes increase. Controlling for Drivers and Laps, Trklength is not associated with crashes. Simpler Model:

Testing for Goodness of Fit for the Poisson Model

If only a few categories (combinations of levels are such that exceeds 5 in most cells) of the independent variables, the goodness of fit can be computed directly from the and likelihood ratio goodness of fit statistics as:

These are approximately chi-square for fixed number of categories under the restriction on fitted values above. The degrees of freedom are the total number of cstegories minus the number of model parameters.

When there are many distinct levels of the independent variable(s) with few observations within each category, data need to be collapsed into 10 (say) groups, ordered based on their fitted values. Then the observed and fitted counts can be compared within each category.

Yhat (Races) / Obs Crashes / Fit Crashes / Pearson Res / Chi-Square
<3.50 (15) / 37 / 46.16 / -1.35 / 1.82
3.50-3.80 (15) / 67 / 54.27 / 1.73 / 2.99
3.80-4.08 (16) / 62 / 63.78 / -0.22 / 0.05
4.08-4.25 (15) / 53 / 62.97 / -1.26 / 1.58
4.25-4.42 (18) / 69 / 78.12 / -1.03 / 1.06
4.42-5.15 (17) / 100 / 81.77 / 2.02 / 4.07
5.15-5.50 (15) / 88 / 78.74 / 1.04 / 1.09
5.50-6.25 (15) / 91 / 87.84 / 0.34 / 0.11
6.25-6.70 (14) / 94 / 91.36 / 0.28 / 0.08
>6.70 (11) / 63 / 79.01 / -1.80 / 3.24
Total (151) / 724 / 724.00 / ---- / 16.08

The critical chi-square value (with 10-3 = 7 df) is approximately 14.07.

Since the chi-square statistic exceeds that, we do not feel confident the model provides a good fit.

Note: See Agresti (1996) Section 4.4 for details.

SAS Code and Output:

options ps=54 ls=76;

data one;

input serrace 6-8 year 13-16 searace 23-24 drivers 31-32 trklength 34-40 laps 46-48 road 56 cautions 63-64 leadchng 71-72;

cards;

1 1975 1 35 2.54 191 1 5 13

...

151 1979 31 37 2.5 200 0 6 35

;

run;

/* Data set one contains the data for analysis. Variable names and

column specs are given in INPUT statement. I have included ony

first and last observations */

/* The following model fits a Generalized Linear model,

with poisson random component, and a constant mean:

g(mu)=alpha is systematic component,

g(mu)=log(mu) is the link function:

mu=e**alpha */

proc genmod;

model Cautions = / dist=poi link=log;

run;

/* The following model fits a Generalized Linear model,

with poisson random component,

g(mu)=alpha + beta1*drivers + beta2*trkength + beta3*laps is systematic component,

g(mu)=log(mu) is the link function:

mu=e**alpha + beta1*drivers + beta2*trkength + beta3*laps */

proc genmod;

model Cautions = drivers trklength laps / dist=poi link=log;

run;

quit;

Poisson Model – g(mu)=log(mu) = ea

The GENMOD Procedure

Model Information

Data Set WORK.ONE

Distribution Poisson

Link Function Log

Dependent Variable cautions

Observations Used 151

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 150 215.4915 1.4366

Scaled Deviance 150 215.4915 1.4366

Pearson Chi-Square 150 201.6050 1.3440

Scaled Pearson X2 150 201.6050 1.3440

Log Likelihood 410.8784

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Chi-

Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 1.5675 0.0372 1.4947 1.6404 1778.93 <.0001

Scale 0 1.0000 0.0000 1.0000 1.0000

Poisson Model – g(mu)=log(mu) = ea+b1X1+b2X2=B3x3

The GENMOD Procedure

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 147 171.2162 1.1647

Scaled Deviance 147 171.2162 1.1647

Pearson Chi-Square 147 158.8281 1.0805

Scaled Pearson X2 147 158.8281 1.0805

Log Likelihood 433.0160

Algorithm converged.

Analysis Of Parameter Estimates

Standard Wald 95% Chi-

Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 -0.7963 0.4117 -1.6032 0.0107 3.74 0.0531

drivers 1 0.0365 0.0125 0.0120 0.0610 8.55 0.0035

trklength 1 0.1145 0.1684 -0.2156 0.4446 0.46 0.4966

laps 1 0.0026 0.0008 0.0010 0.0041 10.82 0.0010

Scale 0 1.0000 0.0000 1.0000 1.0000

NOTE: The scale parameter was held fixed.

Sources:

Agresti, A. (1996). An Introduction to Categorical Data Analysis. Wiley, NY.

Computational Approach to Obtaining Poisson Regression Analysis

Poisson probability density function:

Sytematic Component:

Link Function: g() = log()  = eg(

For subject i, g(i) = xi’

Where: X = xi =  =

With Likelihood Function:

and log-likelihood function and derivative wrt :

Setting the vector of first derivatives to 0 gives the equations:

where Y and  are nx1 vectors and 0 is a 4x1 vector of 0’s (its length is the number of parameters in the regression model).

The second derivative matrix of the log likelihood and its negative (Hessian matrix) are obtained:

where W is a nxn diagonal matrix with elements i

Defining g as the negative of the first derivative of the log likelihood function, we can iteratively estimate  via the Newton-Raphson Algorithm. In matrix form:

Beginning with a coefficient vector of [5 0 0 0]’

The algorithm converges to those given above in 8 steps (where convergence is determined by all elements of the estimated coefficient matrix changing by less than 0.0001 in absolute value between iterations.

The (approximate) large-sample estimated variance-covariance matrix is obtained by plugging the final estimated regression coefficients into W and computing

Estimated standard errors are obtained by taking square roots of the diagonal elements.

Results from running analysis in SAS PROC/IML (the log likelihood evaluation does not include the log(yi!) terms:

BETA_NEW SEBETA

-0.79627 0.4117015

0.0365253 0.0124934

0.1144986 0.1684265

0.0025963 0.0007893

VBETA LOGLIKE

0.1694981 -0.001558 -0.027334 -0.000207 433.01601

-0.001558 0.0001561 -0.001549 -5.142E-6

-0.027334 -0.001549 0.0283675 0.0001198

-0.000207 -5.142E-6 0.0001198 6.23E-7