Working with Count Data

Working With Count Data

Theory Construction and Research Methodology Workshop

November 16th, 2005

Phoenix, AZ

Working with Count Data:

Practical demonstration of Poisson, Negative Binomial

and Zero-Inflated Regression Models

Yoshie Sano, Ph.D.

Yu-Jin Jeong, M.A.

Alan C. Acock, Ph.D

OregonStateUniversity

Human Development and Family Sciences

Anisa M. Zvonkovic, Ph.D.

TexasTechUniversity

Human Development and Family Sciences

Discussant:Dr. Manfred Van dulmen, KentStateUniversity

Office phone: (330) 672-2504

Email address:

A Gentle Guide to Working with Count Dependent Variables

Workshop Overview

Alan C. Acock

Department of HDFS

OregonStateUniversity

TCRM Meetings, Phoenix, November, 2005

I will give a brief overview of the issues and methods for working with dependent variables that involve counts. Then we will have two examples of applications of these techniques followed by an open question and answer session. I can also illustrate the software capabilities.

Why Should I Worry About This?

Many count variables involving relatively rare events are important in family studies. Here are some examples,

Number of marriages
Number of children
Number of times a parent is arrested
Number of times/month a spouse is violent
Number of days a mother has job related overnight travel
Number of times a couple had an argument in the last month
Number of times an adolescent skipped school in last year
Number of positive statements one partner makes about his/her partner in a half hour discussion

You can probably think of others in your own area of research. Many national surveys include count variables. Observational research is often based on count variables. Often, these distributions violate normality so flagrantly that statistical models that assume normal outcome variables make little sense. We will illustrate this with the number of days an adult travels. For starts, we will just examine men. The mean number of days men travel is 3.45 and the standard deviation is 10.14.

This is so skewed, Skewness = 4.61, p < .001, that we have a serious problem, but the kurtosis is even more of a problem, Kurtosis = 27.60, p < .001, with 71.8% of the men report not being away from home even a single night during the last year. How can we treat such a dependent variable to do a meaningful analysis?

Traditional Solutions

Ignore the variation

One of the most widely used approaches collapses everybody who scored 1 or more into a single category. In our example we would compare men who travel, ignoring the amount of travel, to those who never travel. If this is your only interest, then this is Okay. But it is ignoring potentially rich variation among those who travel.

Here are some examples.

Variables that predict whether an adolescent ever had a drink may be different from variables that predict the frequency with which the adolescent drinks. Having ever had a drink is not deviant behavior for an adolescent, but drinking 30 days a month is a serious problem. What accounts for ever having a drink may be different than what accounts for having a serious drinking problem.

The idea is that there are two dependent variables.

First, one set of variables predicts whether a behavior occurs at all or does not.
Second, another set of variables predicts how often the behavior occurs.

Sometimes we are more interested in the first of these (have you ever punched your spouse) because ever having done this is a serious problem. Sometimes we are interested in the second of these (How many times do you praise your partner in a 30 minute discussion) because how often this is done is important. Usually, we are interested in both questions and need a method to address both of them.

Transformations

There are two ways of implementing a transformation. The purpose of this is to reduced the amount of skewness in the distribution so OLS regression works.

Log transformation. When we have a variable that is skewed right, we often take the log. We would have to add a small number, say .5 to each value because the log(0) is undefined. This would pull the tail in, but would do nothing about the fact that 71.8% of the men do not travel at all.
Censored regression. We could use censored regression saying the distribution is censored at the low end. This actually predicts a transformed variable in which the zero values are treated as negative values using an assumed underlying normal distribution. If the data were normally distributed and there was some reason we truncated the distribution from below, this would make sense. However, this would not make sense with count dependent variables because the zero values are absolutely meaningful and negative values would make no sense at all. Censored regression makes sense in many situations, but not here.

Neither log transformation nor censored regression address the fact that we have two questions: occurrence of a behavior vs. frequency of a behavior.

Poisson Regression

Poisson regression is available in better statistical packages such as Stata, SAS, and Mplus. Count variables that involve a rare event are often modeled using Poisson regression. A Poisson distribution is very simple. It has only a single parameter, lambda, where lambda is both the mean and the variance. In other words, the mean and variance are equal. The probability of a particular outcome under a Poisson distribution is:

How well does a Poisson model assumption fit our data? The Poisson distribution of how often a man travels overnight when the mean number of 3.45 nights is:

Although a Poisson distribution is widely used for counts of rare events and clearly more appropriate than a normal distribution, it is still inappropriate for our situation. For our size of sample of men we would expect none to travel more than 10 nights (our maximum is 90 nights) and would expect relatively few men to travel zero nights.

This graph does not show an extreme skew to the distribution because the mean is relatively large for a rare event, perhaps pulled higher by a few men who travel a lot. When the mean is smaller the Poisson distribution is much more skewed as the following graph based on Long and Freese (2003) illustrates:

Poisson models can be estimated with Stata, SAS, Mplus, and a few other statistical packages. They recognize that a count variable of a rare event will be skewed and they use an estimation model based on this recognition. Software programs vary in how they estimate the model. Mplus, for example, has a default of robust maximum likelihood and Stata has a default of maximum likelihood. These will yield the same parameter estimates, but can yield different standard errors, sometimes very different standard errors. Mplus can be forced to use maximum likelihood estimation. Stata can be forced to use robust maximum likelihood estimation, robust clustered maximum likelihood estimation, and bootstrap methods to estimate standard errors. Robust maximum likelihood may be a reasonable solution, especially when the observed distribution is more skewed than a Poisson distribution would be.

Negative Binomial Distribution

Negative binomial regression models are available in Stata and standard errors can be estimated using maximum likelihood estimation, robust maximum likelihood estimation, and bootstrapping.

The problem with using the Poisson distribution on our data is that the Poisson distribution assumes the mean and variance are equal. Our mean is 3.45, but our variance is 102.82—not even close to 3.45. We call this over dispersion and this is quite common for several reasons. First, when you have an excess of zeros you will have over dispersion. Second, people who travel tend to have to travel more than a Poisson process would generate. Over night travel events are not independent.

One solution is to use a negative binomial distribution. The negative binomial distribution has two parameters, namely one parameter for the mean and a second parameter to represent the degree of over dispersion. This would work fine for our purposes, if we wanted to model the count. Stata provides a test of whether the negative binomial model should be used in place of the Poisson model (Long & Freese, 2003). When we have a count variable of a rare event but also have over dispersion where the variance is greater than the mean, the use of the Negative Binomial distribution is an excellent way of answering our second question—how often does the event occur. By itself, it does not answer the question of what predicts whether the event occurs at all or not.

A New Approach to Predicting Count Variables

Zero Inflated Models

It is often interesting to think of two possible outcome variables that may have the same or different covariates. This is especially important to adequate modeling when different covariates predict each outcome variable.

One outcome is whether a person travels or not. This is whether a person has a count of zero or a count of greater than zero. We can think of this as a latent class analysis where we are predicting which class or group a man is in—those who travel over night one or more times vs. those who do not travel over night at all as part of their job. We will call this variable u, where u = 1 if they traveled and u = 0 if they never traveled. Here is the observed distribution for the variable u

The second outcome is how often a person travels over night because of their job. We can redraw our original observed distribution by dropping out the people who do not travel at all. The distribution of those who do travel is as follows:

Why would you want to make this distinction between the two variables? What predicts who will travel more rather than less may be different from what predicts who will travel at all or not at all. The following figure shows how we could draw a simple model with these two outcomes.

The variable, u#1, is in an oval because it is a latent class—a special type of latent variable. It represents whether the event occurs or not.
The variable, Y, is the observed number of nights away from home for those that do travel.
X1, X2, X3 directly influence the likelihood of any travel and, for those who travel, X1, X2, X3 directly influences how often they travel.
X4 is a covariate that directly influences whether the person travels or not, but does not directly influence how often they travel.
X5 is a covariate that directly influences how often a person travels, but not whether they travel at all or not.

When we have an excess of zeros, we call it a zero inflated model. To the extent that this is the source of the over dispersion, we can do a zero inflated Poisson distribution. When this is not a sufficient adjustment for over dispersion, we can do a zero inflated negative binomial regression. Mplus can do a zero inflated Poisson regression. Stata can do both a zero inflated Poisson regression and a zero inflated negative binomial model.

Estimating the Models with Stata

The commands to run these models in Stata are simple. Here is an example where we are predicting the number of days an adolescent drinks more than 5 drinks in the last month. This opens the NLSY97 dataset, and runs four models. Long and Freese (2003) provide additional detail. There is a menu that can be used for these commands and it offers maximum likelihood estimation, robust maximum likelihood estimation, clustered robust estimation, bootstrap estimation, and jackknife estimation.

* count dependent variable, days drink last month

use "C:\StataBook\Data\nlsy97 selected variables working.dta", clear

* Poisson regression model:

poisson dr5day fun97 pdrink97 pcoll97

listcoef

* Negative binomial regression model:

nbreg dr5day fun97 pdrink97 pcoll97

listcoef

* Zero inflated Poisson regression model:

zip dr5day fun97 pdrink97 pcoll97, inflate(fun97 pdrink97 pcoll97)

listcoef

* zero inflated negative binomial model:

zinb dr5day fun97 pdrink97 pcoll97, inflate(fun97 pdrink97 pcoll97)

Here is an example of the zero inflated Poisson regression model results:

zip (N=626): Factor Change in Expected Count

Observed SD: 3.6211579

Count Equation: Factor Change in Expected Count for Those Not Always 0

------

dr5day97 | b z P>|z| e^b e^bStdX SDofX

------+------

fun97 | -0.09368 -4.762 0.000 0.9106 0.8353 1.9216

pdrink97 | 0.06599 2.567 0.010 1.0682 1.0862 1.2528

pcoll97 | -0.11548 -3.759 0.000 0.8909 0.8817 1.0899

------

Binary Equation: Factor Change in Odds of Always 0

------

Always0 | b z P>|z| e^b e^bStdX SDofX

------+------

fun97 | 0.03102 0.666 0.505 1.0315 1.0614 1.9216

pdrink97 | -0.37366 -5.291 0.000 0.6882 0.6262 1.2528

pcoll97 | -0.01062 -0.134 0.893 0.9894 0.9885 1.0899

------

It is interesting to see how only one of the variables predicts the binary outcome of whether they never drink (Always 0) or not. Having fun with your family and how many of your peers plan to go to college has little to do with whether you ever drank in the last 30 days or not.

By contrast, all three variables have strong relationships with how often you drink. Those who have fun more often with their family drink less often as do those who have more peers who plan to attend college. Notice that the role of peers drinking is much more important to whether they drink or not than to how often they drink. Family and positive peers are much more important to how often they drink, but not to whether they drink or not.

Estimating the Models with Mplus

People think of Mplus as an SEM program, but it is much more general than that. Mplus can estimate Poisson models and zero inflated Poisson models. It defaults to robust maximum likelihood estimation, but can be forced to do maximum likelihood estimation. Mplus cannot read files directly from other programs (SPSS, SAS, Stata). I use a user written Stata command, stata2mplus, creates data in the format Mplus likes it and even writes out a basic Mplus program that you can modify to do what you want. Here is an Mplus program to do zero inflated Poisson regression using robust maximum likelihood estimation.

The program for doing the zero inflated Poisson model is quite simple:

TITLE:

Stata2Mplus convertsion for f:\flash\NCFR\male.dta

List of variables converted shown below (left out here)

DATA:

File is i:\flash\NCFR\male.dat ;

VARIABLE:

Names are

qeb49 qpd1 red5 hrdif autonomy pressure jobsat

neghspil cc_major qwc16r magsup sng_coh sng_leg

qeb31br perinc;

Usevariables are qeb49 qpd1 red5 cc_major qwc16r perinc

magsup sng_coh sng_leg ;

Count are qeb49 (i);

Missing are all (-999) ;

MODEL:

qeb49 ON qpd1 red5 cc_major qwc16r perinc magsup sng_coh

sng_leg;

qeb49#1 ON qpd1 red5 cc_major qwc16r perinc magsup

sng_coh sng_leg;

OUTPUT: sampstat;

I apologize for the variable names. These were used by the people who collected the data. I have a listing of the names and variable labels at the end.

Mplus programs have commands grouped in sections. The TITLE section is a long comment describing what we are doing and the variables (not shown here). In the VARIABLE section we list the variables in the order they appear in the dataset and the dependent variables that are counts (rather than being continuous or categorical variables). The only dependent count variable is qeb49t. Notice the “(i)” following the variable name. The people who wrote Mplus must hate to type because they use shortcuts like this. The “(i)” stands for inflation model. The MODEL section has two dependent variables, qeb49, is our count for those who travel. That is, it is the Y variable in our figure. The qeb49#1 is the inflation variable for whether the person travels or not.

Here is partial output:

SAMPLE STATISTICS

Means

QPD1 RED5 CC_MAJOR QWC16R PERINC

______

41.128 3.168 0.110 3.483 10.803

Means

MAGSUP SNG_COH SNG_LEG

______

3.233 0.043 0.846

MODEL RESULTS

Estimates S.E. Est./S.E. e^B

QEB49 ON

QPD1 -0.009 0.014 -0.622 .99

RED5 -0.203 0.097 -2.085 .82

CC_MAJOR -0.405 0.362 -1.119 .67

QWC16R -0.220 0.195 -1.128 .80

PERINC 0.087 0.180 0.487 1.08

MAGSUP 0.054 0.164 0.327 1.06

SNG_COH -0.622 0.612 -1.017 .54

SNG_LEG -0.426 0.359 -1.189 .65

QEB49#1 ON

QPD1 0.046 0.014 3.273 1.05

RED5 -0.479 0.122 -3.925 .62

CC_MAJOR 0.066 0.497 0.132 1.07

QWC16R -0.542 0.177 -3.061 .68

PERINC -0.847 0.249 -3.405 .43

MAGSUP 0.279 0.159 1.758 1.32

SNG_COH 0.040 0.813 0.049 1.04

SNG_LEG -0.448 0.501 -0.895 .64

Intercepts

QEB49#1 11.096 2.601 4.267

QEB49 3.604 2.405 1.498

The estimates are the unstandardized regression coefficients. The Est./S.E. can be treated as a z-test. Because Mplus does not compute the odds ratio, eB, I’ve added a column showing these values. Those familiar with logistic regression know it is difficult to interpret the estimates directly. We frequently compute odds ratios because these are easier to interpret. I exponentiate the B’s using Stata’s calculator command, e.g., display exp(.046) yields 1.05.

There are two questions. Do they stay at home? How often do they travel? All zero inflated programs answer the first question by saying what predicts Always 0. Mplus labels this QEB49#1. Several variables are predictors of this. QPD1 is age in years. Each year older a person gets, the odds of not traveling at all increase by a factor of 1.05, or 5%, p < .001. A 10 year difference in age would increase the odds of not traveling at all by 1.0510 = 1.62, or 62%!

By contrast, RED5 (education) has the opposite effect. For each unit change in education, the odds of not traveling decrease by a factor of .62, or 38%. Education is on a 5 point scale so a 1 point higher score is a substantial difference.

These results show that when we are predicting how often people travel that we get different predictors. Only a single predictor is significant.

Complications

Selecting the right model is not always easy. The papers will give more information on that.
Using the Poisson model or the negative binomial model can produce very different results.
Using different estimation procedures can produce very different results. When the Mplus model is estimated using maximum likelihood estimation rather than robust maximum likelihood you get the same parameter estimates, but very different tests of significance. Using bootstrap estimates you get yet a different set of standard errors.
Extremely complex models, an extreme excess of zeros, and extreme over inflation can easily defeat the estimations.

References