Driver fatality rates in the U.S. Statistics and Data Analysis

Driver Fatalities in the United States

Statistics and Data Analysis

Introduction

As you’re driving past an automobile accident, do you ever consider your chances of being killed in one? Over the past twenty years, an average of 45,000 people lost their lives each year in vehicular accidents, approximately 23 people for each 100,000 licensed drivers. Using various data for each state, we would like to see if we are able to predict an individual state’s driver fatality rate. In this analysis, we are defining driver fatality rate as the number of driver fatalities per 100,000 licensed drivers.

The federal government includes data on passenger and pedestrian fatalities when computing fatality rates. As we are only examining driver fatalities we will therefore remove these victims from our data. The source for our fatality data is the National Highway Traffic Safety Administration’s Traffic Safety Facts reports for the years 1997 and 1998 (www.nhtsa.dot.gov).

In addition, we are using two years of data for each state to ensure that there is a true relationship between variables, rather than a state having a particularly good or bad year, in terms of driver fatalities. We have also made an effort to use state data per capita or per driver in order to make the data more comparable.

Predictor Data

ü  The percentage of licensed drivers aged 29 and younger, 30 to 59 years, and 60 and older (variable names: 29yrs under, 30-59yrs, 60+yrs). While examining the data, we observed that a large proportion of fatalities were either quite young or quite old. This makes sense if you consider the lack of skills and excessive risk-taking of the younger drivers, and the loss of vision /hearing/reflexes in the older generation. In fact, automobile accidents in which they are driving are the number one killer of children ages 15 to 20 years old. Using data provided by the Federal Highway Administration (www.fhwa.dot.gov), we computed the percentages of drivers in each age group per state and would expect to find that states with greater percentages of younger or older drivers will have a higher fatality rate. Unfortunately, these three variables have to sum to 100% so they will no doubt be correlated with each other (a higher percentage of youths results in lower percentage of adults and/or seniors).

As we can see below, 29yrs and under seems to have a positive relationship with the fatality rate. However, there are two states whose data are leverage points: Mississippi and Utah (MS has the highest fatality rates of 36/100k and 31/100k, and UT has the largest percentage of younger drivers). MS is an outlier/leverage point in many of these scatter plots, so we will probably be forced to remove it. In 30-59 yrs and 60+ yrs, MS still remains an outlier and Alaska becomes a leverage point with the highest percentage of 30-59 yrs and the lowest percentage of 60+ yrs. Therefore, it may make sense to remove AK as well as MS.

ü  Number of drivers per square mile (drivers/sqmi). Obviously, we expect states with a greater density of drivers to have a greater fatality rate. Licensed driver information provided by NHTSA and state area in square miles provided by the U.S. Census Bureau (www.census.gov). As you can see in the graph below, there is a huge outlier in the form of Washington DC with a whopping 5,700 drivers per square mile (compare this to a mean of 120 drivers/sqmi without DC). This data point will be excluded from our final analysis.

ü  Average state temperature and annual state precipitation (avg temp, annual precip). Using data provided by the National Climatic Data Center (www.ncdc.noaa.gov), we expect to find states with lower average temperatures (more snow/ice/sleet) and/or greater annual precipitation to have a higher driver fatality rate. Two data points from DC are missing from both graphs (the data was not available) but this will not be an issue since DC will be removed, as previously mentioned.

ü  State highway spending (total) and state highway safety spending, both per driver (funds/driver, safety funds/driver). We would expect states with greater highway spending per driver and/or greater highway safety spending per driver to have lower fatality rates. Data provided by the Federal Highway Administration’s state highway budgets for 1997 and 1998. There does not seem to be much of a relationship between spending and the fatality rate, judging from the scatter plots. MS and AK continue to be outliers. AK has the highest highway spending per driver of all the states, most likely due to the low absolute number of drivers, low temperatures and the cost of snow/ice removal, road repair, etc.

ü  State alcohol consumption per capita, in gallons per year (alcohol consumption). To get a sense of DWI/DUI risks, we decided to look at states’ per capita alcohol consumption. Using DWI/DUI arrest rates would be rather meaningless, since laws vary from state to state (and since roughly 40% of all fatalities involve alcohol – too late to arrest those drivers.) Alcohol consumption data provided by the National Institute on Alcohol Abuse and Alcoholism (www.niaaa.nih.gov). MS and DC continue to be outliers, and DC is joined by Nevada and New Hampshire in high alcohol consumption (NV is probably skewed higher due to Las Vegas and its drinking gamblers, and there is probably little else to do in the NH winters except drink). There does not seem to be a clear relationship between the fatalities and per capita alcohol consumption, so we will probably leave NV and NH in the dataset.

ü  Per capita personal income, by state (per capita pers income). Personal income data provided by the U.S. Census Bureau. We were expecting to see a relationship between higher incomes and lower fatality rates. Perhaps states with higher per capita incomes mean the residents of those states are able to purchase more expensive (read: larger, safer?) automobiles and therefore the state may have a lower fatality rate. Judging by the scatter plot, there does seem to be a negative relationship between the data, but the pattern of scatter suggests a log adjustment to the personal income data (the non-logged histogram shows a small right tail). The logged personal income data (log PI) does not look much improved so we will continue to use the non-logged data.

ü  Sample Year (year). Since we are using two years of data, we decided to add a zero-one variable to make sure the two years’ data are statistically similar. We wanted to ensure that, by using two years data, there is a sustained relationship between fatalities and the predictor variables, rather than having a state with a particularly good or bad year (in terms of fatality rate) show a false relationship. To get a quick idea of how comparable the response variables are, we’ll examine a box plot of fatalities for each year. Ignoring the outlier MS, the box plots look fairly comparable. The sample years are pretty close, so we should be able to exclude this variable from the final regression.

Data Assessment

Let’s take a quick look at the overall fatality rate data. First, a histogram and a box plot to assess normality:

The fatality rate looks pretty normal with the exception of an outlier or two (MS). No log adjustment seems necessary. Now we’ll remove the outliers that were mentioned while examining the predictors. Those outliers are: Washington DC, Mississippi, Alaska, and Utah. Here are the scatter plots after this data is removed:

29 and under looks positively related, while 30-59 looks like it has a negative relationship.

There does not seem to be any pattern in the 60+yrs data. Drivers/sqmi unexpectedly appears to have a negative relationship with fatality rate. Additionally, it has a long right tail (expected, since there cannot be negative drivers/sqmi). We’ll log drivers/sqmi later in this section.

Average temperature and annual precipitation look fairly random.

No apparent pattern in either highway spending variable. We wouldn’t expect either variable to add predictive value to the regression.

We may have to remove NV and NH from the alcohol data to get a better look at any pattern. Personal income definitely has a negative relationship, which we’ll explore later.

None of the scatter plots show much in the way of non-constant variance – very good. Now, we can address the problems in drivers/sqmi and alcohol consumption that were mentioned above. First, we need to normalize the drivers/sqmi data. Here are the histograms before and after the log adjustment:

Clearly, logging drivers/sqmi makes the data more normal. This also results in a more distinct scatter plot versus fatality rate:

Our first suspicion was true: there is a negative relationship between driver population density and the fatality rate. There might be a small amount of non-constant variance, since it looks like variance is wider where the log of drivers/sqmi is low. Now, we need to check the alcohol consumption relationship without the leverage points from NV and NH:

There is no apparent pattern in the alcohol consumption data, so we will include NV and NH in our final dataset.

Regression Analysis

Now that we’ve analyzed the patterns and normality of our dataset, let’s start to examine some regression equations. First, we’ll examine the descriptive statistics for each variable before removing the outliers.

Variable N Mean Median TrMean StDev SE Mean

Fatality 102 15.015 14.260 14.759 5.643 0.559

29 yrs u 102 23.801 23.800 23.677 2.471 0.245

30-59 yr 102 56.722 56.700 56.702 2.777 0.275

60+ yrs 102 19.481 19.650 19.596 2.473 0.245

Drivers/sqmi 102 230.9 62.1 107.5 800.9 79.3

log driv 102 1.7760 1.7933 1.7771 0.6647 0.0658

Avg. Tem 100 52.375 51.350 52.027 8.492 0.849

Annual precip 100 38.37 37.64 38.38 16.31 1.63

Funds/dr 102 603.8 562.6 587.6 198.9 19.7

Safety f 102 49.08 45.60 47.00 25.44 2.52

Alcohol 102 2.2771 2.2100 2.2232 0.5331 0.0528

Per capi 102 25766 25512 25524 3962 392

Variable Minimum Maximum Q1 Q3

Fatality 5.143 36.007 11.121 19.021 ß MS outlier

29 yrs u 18.800 34.700 22.300 25.150 ß UT outlier

30-59 yr 50.100 65.100 54.475 58.825 ß AK outlier

60+ yrs 9.900 24.600 18.200 21.125 ß AK outlier

Drivers/sqmi 0.8 5797.1 27.4 130.9 ß DC outlier

log driv -0.1068 3.7632 1.4382 2.1169

Avg. Tem 37.300 77.200 45.850 56.850

Annual precip 4.67 74.62 27.38 52.05

Funds/dr 315.6 1468.9 461.2 693.0

Safety f 13.78 129.55 29.09 57.50

Alcohol 1.2900 4.1600 1.9300 2.4100

Per capi 18885 37714 22711 27916

After removing AK, DC, MS, and UT:

Variable N Mean Median TrMean StDev SE Mean

Fatality 94 14.955 14.719 14.886 4.972 0.513

29 yrs u 94 23.505 23.650 23.524 1.971 0.203

30-59 yr 94 56.657 56.700 56.689 2.369 0.244

60+ yrs 94 19.840 19.750 19.854 1.991 0.205

Drivers/sqmi 94 127.1 65.6 101.9 167.9 17.3

log driv 94 1.7899 1.8170 1.7961 0.5584 0.0576

Avg. Tem 94 52.447 51.350 52.080 8.429 0.869

Annual precip 94 37.80 37.19 37.89 15.69 1.62

Funds/dr 94 591.5 558.3 583.3 167.8 17.3

Safety f 94 49.56 45.60 47.34 26.03 2.68

Alcohol 94 2.2602 2.2100 2.2026 0.4766 0.0492

Per capi 94 25713 25512 25519 3616 373

Variable Minimum Maximum Q1 Q3

Fatality 5.143 26.741 11.481 19.021

29 yrs u 18.800 27.500 22.175 24.725

30-59 yr 51.600 61.500 54.575 58.625

60+ yrs 15.500 24.600 18.500 21.300

Drivers/sqmi 3.6 751.6 28.3 130.9

log driv 0.5605 2.8760 1.4523 2.1169

Avg. Tem 37.300 77.200 45.950 56.550

Annual precip 4.67 68.27 27.39 49.99

Funds/dr 315.6 1002.9 465.4 693.0

Safety f 13.78 129.55 29.56 57.50

Alcohol 1.6000 4.1600 1.9300 2.4000

Per capi 19388 37452 23188 27693

As you can see from our highlighted data, the overall mean and standard deviation of the fatality rate have dropped, and the exclusion of outliers has lowered the standard deviations across all predictor variables. In particular, looking at the drivers/sqmi variable, the mean declines to 127 from 231, and the standard deviation drops to 168 from 801! The change in the log of drivers/sqmi is not as dramatic, but the range between minimum and maximum shrinks considerably.

For our first regression, we will include all predictor variables with the outliers removed:

Fatality/100k drivers = 1360 - 13.1 29 yrs under - 13.4 30-59 yrs

- 13.2 60+ yrs - 5.38 log drivers/sqmi + 0.0818 Avg. Temp

+ 0.0637 Annual precip - 0.00413 Funds/drivers

+ 0.0095 Safety funds/drivers - 1.26 Alcohol consumption

-0.000338 Per capita pers income + 0.648 Year

Predictor Coef SE Coef T P VIF

Constant 1360.1 770.4 1.77 0.081

29 yrs u -13.078 7.688 -1.70 0.093 2352.3

30-59 yr -13.393 7.709 -1.74 0.086 3417.3

60+ yrs -13.246 7.718 -1.72 0.090 2418.6

log driv -5.379 1.187 -4.53 0.000 4.5

Avg. Tem 0.08180 0.04547 1.80 0.076 1.5

Annual p 0.06375 0.02718 2.35 0.021 1.9

Funds/dr -0.004127 0.002767 -1.49 0.140 2.2

Safety f 0.00949 0.01513 0.63 0.532 1.6

Alcohol -1.2634 0.7840 -1.61 0.111 1.4

Per capi -0.0003377 0.0001506 -2.24 0.028 3.0

Year 0.6480 0.7046 0.92 0.360 1.3

S = 3.013 R-Sq = 67.6% R-Sq(adj) = 63.3%

Analysis of Variance

Source DF SS MS F P

Regression 11 1554.30 141.30 15.57 0.000

Residual Error 82 744.27 9.08

Total 93 2298.57

Well, that’s a pretty good R2 and a significant F-value. 67.6% of the variation in the response variable is explained by the variation in the predictor variables. Unfortunately, the adjusted R2 is 63.3%, implying there is some random noise in this regression. Also, as we expected, there is a tremendous collinearity problem among 29yrs and under, 30-59yrs and 60+yrs and these same predictors are not significant at the .05 level. We will try removing one at a time.

Predictor Coef SE Coef T P VIF

Constant 49.96 16.79 2.98 0.004

30-59 yr -0.2882 0.2516 -1.15 0.255 3.6

60+ yrs -0.1248 0.2414 -0.52 0.607 2.3

log driv -5.568 1.195 -4.66 0.000 4.5

Avg. Tem 0.08282 0.04598 1.80 0.075 1.5

Annual p 0.07347 0.02687 2.73 0.008 1.8

Funds/dr -0.004490 0.002790 -1.61 0.111 2.2

Safety f 0.00862 0.01530 0.56 0.575 1.6