Duration (Survival) Models for Time to Event Data

DURATION (SURVIVAL) MODELS FOR TIME TO EVENT DATA

INTRODUCTION

A relatively new area in econometrics is the analysis of duration data (also called time to event data). The econometrics literature on the analysis of duration data draws heavily from statistical methods that have been developed by industrial engineers and biomedical researchers, who use these methods to analyze such phenomena as the useful lives of machines and survival times of patients after a particular type of operation.

Dependent Variable

In duration analysis, the dependent variable being studied is a duration. Duration is defined as:

i) The amount of time that elapses until some event occurs,

ii) The amount of time that elapses until measurement is taken before the event actually occurs.

Duration is often called time to event (e.g., time to death, time to machine failure, time to employment). If an observed duration corresponds to i, it is said to be uncensored. If the observed duration corresponds to ii, it is said to be censored. The following points should be noted about a duration variable:

1. A duration variable is always measured in units of time (e.g. minutes, days, weeks, months).

2. A duration variable must be non-negative (you can’t have a negative time to event).

Censoring

It is usually the case that some of the observations on a duration variable are censored. An observation is said to be censored when it is measured from the beginning of the period of interest until some point before the event takes place. For example, suppose that the duration variable is time to death after a heart transplant. Suppose that this variable is measured for a sample of 30 persons. Suppose that when the measurement is taken 20 of these individuals have died, but 10 are still alive. The 10 observations for the individuals still alive are censored observations.

Duration Data

No Censoring

Let the variable duration be denoted by T. Duration is a random variable that measures time to event. Because T is a random variable, its behavior can be described by a probability distribution ¦(T). Let

t1, t2, …tn be a random sample of n-observations on the random variable T. The sample will usually consist of a cross-section of n times to event (durations) on individuals, firms, machines, etc.

Censoring

Let T* be the value of duration in the absence of censoring. Let T be the observed value of duration. Let c be the value of duration when it is censored at time c. The observed value of duration is given by

T = T* if T* < c

T = c if T* ³ c

The censoring time, c, can either be a known constant or a random variable. If c is a random variable, then it must be independent of T*. To indicate whether an observation is censored, a censor status variable is usually created. This variable is an indicator variable that takes a value of 1 if the observation is not censored and 0 if the observation is censored.

Approaches to Analyzing Duration Data

Three alternative approaches can be used to analyze duration data. These are:

1. Parametric approach

2. Semiparametric approach

3. Nonparametric approach

The parametric approach makes assumptions about the probability distribution of T. This allows you to analyze duration data using regression models or regression-like models. The semiparametric approach makes only minimal assumptions about the probability distribution of T. The nonparametric approach makes no assumptions about the probability distribution of T.

PARAMETRIC APPROACH

There are two major types of parametric models of duration. These are:

1. Regression models

2. Regression-like models

Regression Models

A regression model is the appropriate model to use when your objective is to better understand how a set of variables, X1, X2, …, Xn influence the expected (average) time to event, E(T).

Example

Let T be the amount of time an individual is unemployed measured in weeks. Thus, the event of interest is finding a job. Let X1 be the level of unemployment benefits in hundreds of dollars per month, X2 be years of work experience, and X3 be marital status; X3 =1 if single, X3 = 0 if married. You have a sample of 200 individuals. Some of these observations are censored at T = 80 weeks. Suppose your objective is to better understand how the level of unemployment benefits, work experience, and marital status influence the average amount of time an individual is unemployed. One way to proceed is to estimate the following classical linear regression model,

T = b0 + b1X1 + b2X2 + b3X3 + m

The coefficient b1 measures the effect of a one unit change ($100) in unemployment benefits on the average amount of time an individual is unemployed. The coefficients b2 and b3 have similar interpretations. However, there are 3 potential problems with this model.

1. The observations on T are censored. As a result, the OLS estimator will yield estimates of the coefficients that are biased and inconsistent. Thus, the appropriate model would be the censored regression model, which accounts for censored observations in the estimation procedure.

2. The classical linear regression model assumes that T has a normal distribution. There are a number of reasons to believe that duration (such as length of unemployment) does not have a normal distribution (the most obvious reason is that T is positive by construction). Jeffrey Wooldridge suggests that one way to deal with this problem is to use the logarithm of duration as the dependent variable; that is ln(T). This is because ln(T) usually has a distribution that is closer to a normal distribution than T itself. In this case, the slope coefficients (multiplied by 100) measures the approximate percentage change in T for a one unit change in X.

3. The dependent variable, T, measures a process that takes place over the length of time (0, t). Regression analysis assumes that the value of X does not change during the period that T is being observed. For example, suppose that an individual is unemployed for 12 months. Regression analysis assumes the level of unemployment benefits he received (X1) and his marital status (X3) did not change during this period of time. If either of these variables does change during the time an individual is unemployed, this greatly complicates the analysis.

Regression-Like Models

A regression-like model is the appropriate model to use when analyzing duration data if your objective is any of the following.

1. The probability that an event will occur before time t.

2. The probability that an event will occur after time t.

3. The probability that an event will occur between time t and time t+1.

4. The probability that an event will occur between time t and time t+1, given that it has not occurred up to time t.

Notice that we are not interested in expected duration, rather we are interested in the probability of duration. However, a regression-like model can also be used to analyze average duration or median duration.

Example

Let T be the amount of time an individual is unemployed. We might be interested in the following questions.

1. What is the probability that an individual will be unemployed for 6 months or less?

2. What is the probability that an individual will be unemployed for more than 6 months?

3. What is the probability that an individual will be unemployed between 6 and 7 months?

4. Given that an individual has been unemployed for 6 months, what is the probability that he will find a job within the next month?

5. Will the probability that an individual finds a job increase or decrease the longer he is unemployed?

6. What is the average or median amount of time an individual is unemployed?

Probability Distributions for a Duration Variable

A continuous duration random variable, T, can be described by four alternative probability distributions. These are:

1. Probability density function

2. Cumulative distribution function

3. Survival function

4. Hazard function

Once you choose a particular type of probability density function (e.g., normal, exponential, Weibull, etc.) you can derive the other three functions. Thus, all four functions have the same parameters and are simply different ways of describing the same system of probabilities.

Probability Density Function

Let T be a continuous duration random variable. Let t be a specific value of the random variable T. Let T have a probability density function given by ¦(t), where t is a specific value of T. The probability density function ¦(t) allows you to calculate the probability that T will fall in the interval between t1 and t2; that is,

Pr(t1£ T £ t2) = ò ¦(t)dt

Thus, the probability that T will fall in the interval between t1 and t2 is equal to the area under the curve of ¦(t) between the values t1 and t2. For example, if T is length of unemployment in weeks and t1=40 and t2=42, then you can find the probability that an individual will be unemployed between 40 and 42 weeks.

Cumulative Distribution Function

Given the probability density function ¦(t), the cumulative distribution function F(t) can be derived as follows, t

F(t) = Pr(T £ t) = ò ¦(t)dt

Thus, the probability that T will take a value that is less than or equal to t is equal to the area under the curve of ¦(t) between 0 and t. For example, if T is the length of unemployment in weeks and t=52, then you can find the probability that an individual will be unemployed for 52 weeks or less.

Survival Function

Given the cumulative distribution function F(t), the survival function S(t) can be derived as follows,

S(t) = Pr(T ³ t) = 1 – F(t)

Thus, the probability that T will take a value that is greater than or equal to t is equal to one minus the area of the curve of ¦(t) between 0 and t. This is equal to the area under the curve of ¦(t) between t and the maximum value of t. For example, if T is the length of unemployment in weeks and t=52, then you can find the probability that an individual will be unemployed for 52 weeks or more; that is, you can find the probability that an individual will be unemployed for at least 52 weeks.

Hazard Function

Given the probability density function ¦(t) and the survival function S(t), the hazard function h(t) can be derived as follows,

h(t) = ¦(t) / S(t)

The hazard function is a particular type of conditional probability function. It tells you the probability that an event will occur in the next short interval of time, given that it has not occurred up to time t. Roughly speaking, it tells you the rate at which the event will occur at time t. For example, if T is length of unemployment in weeks and t = 52, then you can find the probability that individual will find employment during the next week, given that he has been unemployed for 52 weeks. That is, the hazard function tells you the rate at which individuals who have been unemployed for 52 weeks are finding jobs. For example, a hazard rate of 0.05 at t = 52 implies that 5 of 100 individuals who are unemployed for 52 weeks are expected to find a job shortly after that time.

The Hazard Function and Duration Dependence

Often times we are interested in questions like the following.

1. Is it more likely, less likely, or equally likely that an individual will find a job the longer he is unemployed?

2. Is it more likely, less likely, or equally likely that an a strike will end the longer it lasts?

3. Is it more likely, less likely, or equally likely that a patient will die the longer he has survived after open heart surgery?

The answer to these questions depends upon the slope of the hazard function. We have the following definitions.

1. If the hazard function has a positive slope, then the distribution of the duration variable has positive duration dependence. In this case, the longer the duration (e.g., unemployment) the greater the probability the event will occur in the next short period (e.g., the greater the probability the individual will find employment).

2. If the hazard function has a negative slope, then the distribution of the duration variable has negative duration dependence. In this case, the longer the duration the smaller the probability the event will occur in the next short period.

3. If the hazard function has a constant slope, then the probability that an event will occur in the next short period does not depend upon duration. In this case, the event is said to have no memory.

Duration Dependence and Functional Form

If economic theory unequivocally indicates that a duration variable has, for example, negative duration dependence, then you should choose a functional form for the probability distribution that imposes this structure on the data. However, if economic theory allows a duration variable to have, for example, positive or negative duration dependence, then you should not choose a functional form for the probability distribution that imposes either positive or negative duration dependence on the data. If you do you create a fete accompli. In this case, you must choose a functional form for the probability distribution that is flexible enough to allow for both positive and negative duration dependence, and allow the data to determine the outcome.

Choosing a Functional Form for the Probability Distribution of a Duration Variable