Count Data – Generalized Linear Models –Poisson Regression
Poisson Regression
Data:
Units: NASCAR Winston Cup Races (1975-1979 = n=151 races)
Dependent Variable: Y = # of Caution Flags (e.g. crashes) (CAUTIONS)
Independent Variables:
- X1 = # of Drivers in the Race (DRIVERS)
- X2 = Circumference of the Track (TRKLENGTH)
- X3 = # of Laps in the Race (LAPS)
Random Component: Poisson Distribution for the # of accidents
Density Function:
Link Function: g() = log()
Systematic Component:
Testing for the overall model:
H0: 1 = 2 = 3 = 0 (# of crashes is independent of all predictors)
HA: Not all i = 0
Test Statistic: Xobs2 = -2(L0-L1)
Rejection Region: Xobs2
Where L0 is the maximized log likelihood function under H0 and L1 is the maximized log likelihood under HA.
- H0: g( L0 = 410.8784 (SAS Code and output below)
- HA: g() = 1X1 + 2X2 + 3X3 L1 = 433.0160
- Xobs2 = -2(410.8784-433.0160) = 44.28
- Strong evidence that number of crashes is associated with predictors
Testing for individual (partial) regression coefficients:
- H0: j = 0 HA: j 0 (can clearly do > or < as well)
- Test Statistic: (some software packages report z2)
- Rejection Region: |zobs| z/2 (obvious adjustment for 1-sided tests)
Predictor
/ bj / SE(bj) / ZobsDrivers / 0.0365 / 0.0125 / 2.92
Trklength / 0.1145 / 0.1684 / 0.68
Laps / 0.0026 / 0.0008 / 3.25
Note that as the number of drivers increases, number of crashes increases, similarly, as the number of laps increases, crashes increase. Controlling for Drivers and Laps, Trklength is not associated with crashes. Simpler Model:
Testing for Goodness of Fit for the Poisson Model
If only a few categories (combinations of levels are such that exceeds 5 in most cells) of the independent variables, the goodness of fit can be computed directly from the and likelihood ratio goodness of fit statistics as:
These are approximately chi-square for fixed number of categories under the restriction on fitted values above. The degrees of freedom are the total number of cstegories minus the number of model parameters.
When there are many distinct levels of the independent variable(s) with few observations within each category, data need to be collapsed into 10 (say) groups, ordered based on their fitted values. Then the observed and fitted counts can be compared within each category.
Yhat (Races) / Obs Crashes / Fit Crashes / Pearson Res / Chi-Square<3.50 (15) / 37 / 46.16 / -1.35 / 1.82
3.50-3.80 (15) / 67 / 54.27 / 1.73 / 2.99
3.80-4.08 (16) / 62 / 63.78 / -0.22 / 0.05
4.08-4.25 (15) / 53 / 62.97 / -1.26 / 1.58
4.25-4.42 (18) / 69 / 78.12 / -1.03 / 1.06
4.42-5.15 (17) / 100 / 81.77 / 2.02 / 4.07
5.15-5.50 (15) / 88 / 78.74 / 1.04 / 1.09
5.50-6.25 (15) / 91 / 87.84 / 0.34 / 0.11
6.25-6.70 (14) / 94 / 91.36 / 0.28 / 0.08
>6.70 (11) / 63 / 79.01 / -1.80 / 3.24
Total (151) / 724 / 724.00 / ---- / 16.08
The critical chi-square value (with 10-3 = 7 df) is approximately 14.07.
Since the chi-square statistic exceeds that, we do not feel confident the model provides a good fit.
Note: See Agresti (1996) Section 4.4 for details.
SAS Code and Output:
options ps=54 ls=76;
data one;
input serrace 6-8 year 13-16 searace 23-24 drivers 31-32 trklength 34-40 laps 46-48 road 56 cautions 63-64 leadchng 71-72;
cards;
1 1975 1 35 2.54 191 1 5 13
...
151 1979 31 37 2.5 200 0 6 35
;
run;
/* Data set one contains the data for analysis. Variable names and
column specs are given in INPUT statement. I have included ony
first and last observations */
/* The following model fits a Generalized Linear model,
with poisson random component, and a constant mean:
g(mu)=alpha is systematic component,
g(mu)=log(mu) is the link function:
mu=e**alpha */
proc genmod;
model Cautions = / dist=poi link=log;
run;
/* The following model fits a Generalized Linear model,
with poisson random component,
g(mu)=alpha + beta1*drivers + beta2*trkength + beta3*laps is systematic component,
g(mu)=log(mu) is the link function:
mu=e**alpha + beta1*drivers + beta2*trkength + beta3*laps */
proc genmod;
model Cautions = drivers trklength laps / dist=poi link=log;
run;
quit;
Poisson Model – g(mu)=log(mu) = ea
The GENMOD Procedure
Model Information
Data Set WORK.ONE
Distribution Poisson
Link Function Log
Dependent Variable cautions
Observations Used 151
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 150 215.4915 1.4366
Scaled Deviance 150 215.4915 1.4366
Pearson Chi-Square 150 201.6050 1.3440
Scaled Pearson X2 150 201.6050 1.3440
Log Likelihood 410.8784
Algorithm converged.
Analysis Of Parameter Estimates
Standard Wald 95% Chi-
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq
Intercept 1 1.5675 0.0372 1.4947 1.6404 1778.93 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
Poisson Model – g(mu)=log(mu) = ea+b1X1+b2X2=B3x3
The GENMOD Procedure
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 147 171.2162 1.1647
Scaled Deviance 147 171.2162 1.1647
Pearson Chi-Square 147 158.8281 1.0805
Scaled Pearson X2 147 158.8281 1.0805
Log Likelihood 433.0160
Algorithm converged.
Analysis Of Parameter Estimates
Standard Wald 95% Chi-
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq
Intercept 1 -0.7963 0.4117 -1.6032 0.0107 3.74 0.0531
drivers 1 0.0365 0.0125 0.0120 0.0610 8.55 0.0035
trklength 1 0.1145 0.1684 -0.2156 0.4446 0.46 0.4966
laps 1 0.0026 0.0008 0.0010 0.0041 10.82 0.0010
Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE: The scale parameter was held fixed.
Sources:
Agresti, A. (1996). An Introduction to Categorical Data Analysis. Wiley, NY.
Computational Approach to Obtaining Poisson Regression Analysis
Poisson probability density function:
Sytematic Component:
Link Function: g() = log() = eg(
For subject i, g(i) = xi’
Where: X = xi = =
With Likelihood Function:
and log-likelihood function and derivative wrt :
Setting the vector of first derivatives to 0 gives the equations:
where Y and are nx1 vectors and 0 is a 4x1 vector of 0’s (its length is the number of parameters in the regression model).
The second derivative matrix of the log likelihood and its negative (Hessian matrix) are obtained:
where W is a nxn diagonal matrix with elements i
Defining g as the negative of the first derivative of the log likelihood function, we can iteratively estimate via the Newton-Raphson Algorithm. In matrix form:
Beginning with a coefficient vector of [5 0 0 0]’
The algorithm converges to those given above in 8 steps (where convergence is determined by all elements of the estimated coefficient matrix changing by less than 0.0001 in absolute value between iterations.
The (approximate) large-sample estimated variance-covariance matrix is obtained by plugging the final estimated regression coefficients into W and computing
Estimated standard errors are obtained by taking square roots of the diagonal elements.
Results from running analysis in SAS PROC/IML (the log likelihood evaluation does not include the log(yi!) terms:
BETA_NEW SEBETA
-0.79627 0.4117015
0.0365253 0.0124934
0.1144986 0.1684265
0.0025963 0.0007893
VBETA LOGLIKE
0.1694981 -0.001558 -0.027334 -0.000207 433.01601
-0.001558 0.0001561 -0.001549 -5.142E-6
-0.027334 -0.001549 0.0283675 0.0001198
-0.000207 -5.142E-6 0.0001198 6.23E-7