December 13, 2004
Violent Crime in America
Introduction
Violent crime in the United States is an important subject, particularly in New York City where people perceive the risk of being victimized by crime to be relatively high. As residents of New York City, the risk of violent crimes affects the way we live our lives, whether or not we actually become a victim of a crime. We have to think twice about traveling alone on the subway late at night, or jogging in Central Park after dark. Therefore, in thinking about the quality of our lives here, we wonder what societal factors must be in place in order to live in a more peaceful world, where the risks of being a violent crime victim would be lower (or maybe we should just move out of the city).
Our data analysis project analyzes violent crime in America. We will determine the most important statistical drivers of violent crime over the period1970-2002. We are interested in other environmental/societal factors that fluctuate year to year that may be correlated with the rate of violent crime. Are there factors that we assume are correlated but are really not? Are there factors that we assume to have no association with violent crime but that really do? We aim to draw conclusion about what factors must be in place in order for violent crime to be reduced over the next 30years.
The Data
To analyze the violent crime rate, and its drivers, we have collected the following data:
Data / Source / Frequency / TimeframeViolent crime rate
(target variable) / Bureau of Justice Statistics / Annual / 1960-2002
Unemployment rate / Census Bureau / Monthly / 1960-2002
Federal Prison population / Federal Bureau of Prisons / Annual / 1970-2002
Poverty rate / Census Bureau / Annual / 1960-2002
Economic growth – GDP / Bureau of Economic Analysis / Annual / 1960-2002
In collecting the data, we have already faced several issues. First, we had expected to analyze data for 1964-2003, however several of the data series are not available as far back as 1964 so we will limit it to 1970-2002. Specifically, were unable to find data on prison population, going back to the 1960’s; therefore, we have chosen to use the Federal Prison Population instead, since this data series extends back to 1970. Although less ideal than the total U.S. prison population, we believe the Federal data series may add to our understanding. The second issue we faced is that we have chosen to analyze annual observations; however, the unemployment data seems to be only available on a monthly basis therefore we had to transform it to annual data. Since we don’t have weightings, we have annualized it by calculating an unweighted mean of the monthly data. This transformation could potentially have a negative affect on the validity of our conclusions. We also expected an issue with data for 2001 if victims of 9/11 were counted as victims of violent crime, but upon further analysis, they were not. Were that the case, looking at our other variables in 2001 would not have been as relevant as it is in other years. Finally, our data is in different units: some are rates (crime rate, unemployment rate) that fluctuate over time while some are absolute numbers (prison population) that tend to grow over time. We may need to transform our data to make regression analysis more meaningful.
Expected Outcome
Through an analysis of the data we expect to find that the unemployment rate is correlated to the violent crime rate and that higher unemployment produces higher violent crime. This is because unemployment produces lower income which may drive crime related to robbery. We expect that a higher poverty rate will be associated with higher crime for the same reason. We expect that when GDP is lower or falling, violent crime will rise. We expect higher prison population to be associated with lower violent crime because those most likely to commit violent crime are incarcerated.
We believe that by statistically analyzing the violent crime rate and its potential drivers, we can increase our understanding of crime and what factors are associated with a lower incidence of it.
General Observation of the Variables
We will begin our analysis with an examination of the descriptive statistics as well as a histogram for each of our variables. This will enable us to determine whether or not the data is normally distributed and to see if there are any variables they may cause problems when we go deeper into our statistical analysis of the data. The descriptive statistics are as follows:
Descriptive Statistics
Variable Mean SE Mean StDev Minimum Q1 Median Q3
Violent Crime ra 565.9 18.8 107.9 363.5 491.2 556.6 636.9
Total Prison Pop 48598 5937 34105 19023 21654 30104 75453
Avg Annual Unemp 6.285 0.244 1.404 4.000 5.350 6.000 7.200
Poverty for Fami 10.361 0.190 1.091 8.700 9.300 10.300 11.300
GDP in billions 4883 511 2937 1039 2163 4463 7235
Variable Maximum
Violent Crime ra 758.1
Total Prison Pop 128090
Avg Annual Unemp 9.700
Poverty for Fami 12.300
As is apparent in the data above, some of the variables seem to be fairly normally distributed as the mean and median for the variables are similar to each other. This fact is supported by each of the histograms we looked at as well. The exceptions to this are the variables Total Prison Population and GDP, which both have a higher mean relative to the median. This lack or normality is apparent in the histograms of each of these variables as seen below.
Because Total Prison Population has a long right tail, we decided to perform a transformation by taking a log base 10 of the data in order to see if that would help create a more normal distribution. We also logged the GDP data, since it is money data. As is apparent from the histograms of the logged data, this transformation did not seem to sufficiently affect the distribution of the data.
This may have to do with the fact that these are time series data, fixing which is beyond the scope of this project. While taking the logs for Prison Population and GDP did not make them normally distributed, we decided to continue using this logged data in the rest of our analysis.
We also examined correlations among our variables, substituting our two transformed variables for their original variables. The best regressions arise when the predictor variables are highly correlated with the target variable but not with each other. In our data, the poverty rate and log of GDP are highly correlated with the violent crime rate; however, several pairs of predictor variables are highly correlated with one another.
Correlations
Violent Crim Avg Annual U Poverty for LogT Prison
Avg Annual U 0.129
Poverty for 0.656 0.596
LogT Prison 0.395 -0.516 0.017
LogT GDP 0.647 -0.277 0.284 0.888
Single Variable Regressions
While we are ultimately concerned with how all the variables together predict Violent Crime, we are first going to examine how each one, on its own, relates to our target. To do this, we created a scatter plot with a fitted regression line for each of the predictor variables against the target of violent crimerate, as displayed below.
In looking at the slope of the fitted line, all of the variables appear to have a positive relationship with the target, indicating that as each variable increases, the violent crime rate increases as well. That being said, however, it seems that no one variable alone has a very strongcorrelation with the violent crime rate. For instance, the variability between the violent crime rate and the log of GDP is increasing over time. We can therefore conclude at this point that each variable on its own is not a good predictor of violentcrime. It is our hope that when these variables are acting together, the relationship will be stronger and as a group perhaps they will be better predictors of the violentcrime. In order to determine this, we will move on to our next step in analyzing the data, that of a multiple regression model.
Initial Multiple Regression
Next we ran a multiple regression of the violentcrime rate and our four predictor variables (Avg Annual Unemployment Rate, Poverty for Families, log of GDP Current Dollars, and log of Federal Prison Population). The regression equation is given below.
Regression Analysis
The regression equation is
Violent Crime rate = - 96 - 16.0 Avg Annual Unemployment Rate
+ 52.8 Poverty for Families - 200 LogT Prison Pop
+ 316 LogT GDP
Predictor Coef SE Coef T P
Constant -96.2 302.4 -0.32 0.753
Avg Annual Unemployment Rate -16.05 13.16 -1.22 0.233
Poverty for Families 52.77 15.87 3.32 0.002
LogT Prison Pop -199.9 110.7 -1.81 0.082
LogT GDP 315.51 97.71 3.23 0.003
S = 63.1435 R-Sq = 70.0% R-Sq(adj) = 65.7%
In looking at the coefficients of this regression equation, we learn for example that holding all else fixed, a one point increase in the poverty rate is associated with a 52.77 point increase in the violent crime rate. Similarly, the coefficient of the log of the prison population tells us that every one point increase in the log of the prison population is associated with a negative 199.9 point impact on the violent crime rate. Interestingly, an increase in the unemployment rate is associated with adecrease in the violent crime rate, and an increase in the logged GDP is associated with an increase in the violent crime rate. Next, the regression model succeeded in reducing the noise in the violent crime rate from 107.9 before the regression to a standard error of regression of63.1. This means that we are confident that 95% of the time our regression model can predict the crime rate to within 2*63.1. This is an indication that a prediction of violent crime using this regression equation would be much more accurate than an estimate based solely on its historical mean and variance.In addition to looking at the standard error, it is also important to examine the degree to which these four variables explain the variance in the violent crime rate. To do this we looked at the adjusted R-Sq. The adjusted R-Sq indicates that the fourpredictor variables account for 65.7% of the variance in the violent crime rate. It is difficult for us to tell at this time whether this R-Sq is better or worse than other models that attempt to explain crime.
Finally we considered the T and P values of the predictor variables to determine if each is significant to the regression equation. There are two variables for which the P-value is above 0.05 (the log of the prison population and the unemployment rate); therefore, these variablesappear statistically insignificant to the model. This indicates that perhaps these variables could be removed without much reduction in model power.
Assumptions
Linear regression involves four major assumptions, and this regression violates twoof the four. The first assumption is that the expected value of the error terms for all observations is equal to zero. Judging by the Residuals Versus the Fitted Values plot below, the expected value of the error terms appears approximately equal to zero. Also, there are no known subgroups whose fitted values are systematically above or below the regression line. We believe this first assumption holds. The second assumption is homoscedasticity, that the regression relationship is equally strong throughout the population. That assumption does not hold in this regression.The Residuals Versus the Fitted Values plot shows that the variance is not constant– the variance is larger for larger fitted values. The third assumption is that the residual of one term tells us nothing about the residual of another term. This assumption is violated in this regression, as it is in many regressions of time series data. The Residuals Versus the Order of the Data plot shows that each residual is related to the residual of the prior observation. The fourth assumption of linear regression is that the residuals are normally distributed. The plots Normal Probability Plot of the Residuals and Histogram of the Residuals show that the residuals are approximately normal; therefore this assumption holds for this regression.
In addition to considering the four assumptions, we also looked for any outliers in the data by more closely examining the Normal Probability Plot of the Residuals. We noticed a couple of outliers toward the very top of the graph. Upon analysis of these outliers, we believe they occurred due to the relative increase in the crime rate during the early 1990s and do not feel it necessary to remove the data points from our model at this time.
Improving the Model
Several factors indicate that our initial model may not be the optimal model possible with our predictor variables. First, two variables, the unemployment rate and the log of prison population, have p-values below 0.05. Second, our model violates three of the four assumptions of linear regression. To improve the model, we ran a “best subsets” regression, the output of which follows.
Best Subsets Regression
Response is Violent Crime rate
A=Avg Annual Unemployment Rate
B=Poverty for Families
C=LogT Prison Population
D=LogT GDP
Mallows
Vars R-Sq R-Sq(adj) C-p S A B C D
1 43.1 41.2 24.2 82.686 X
2 66.2 63.9 4.6 64.792 X X
3 68.4 65.2 4.5 63.672 X X X
4 70.0 65.7 5.0 63.143 X X X X
The best subsets analysis indicates that only two variables are necessary to have an adjusted R-Sq of 63.9%, whereas our four-variable equation had an adjusted R-Sq of 65.7%, a very small difference. The two variables that add so little power to the model are the unemployment rate and the log of the prison population; these are the same two variables with low p-values in our initial regression. We believe that by eliminating these two variables, the model will maximize the trade-off between model power and complexity. Our optimal model then is as follows.
Regression Analysis
The regression equation is
Violent Crime rate = - 592 + 50.8 Poverty for Families + 176 LogT GDP
Predictor Coef SE Coef T P
Constant -591.8 153.2 -3.86 0.001
Poverty for Families 50.81 10.94 4.64 0.000
LogT GDP 175.59 38.79 4.53 0.000
S = 64.7915 R-Sq = 66.2% R-Sq(adj) = 63.9%
This new model explains 63.9% of the variance in the violent crime rate (as indicated by the adjusted R-Sq). The original noise in our target variable was 107.9; our model reduces noise in the target variable to 64.8 (the standard error of regression). Both predictor variables are significant to the model (as indicated by p-values less than 0.05). The equation tells us that, all else held constant, a one point increase in the poverty rate is associated with a 50.81 point increase in the violent crime rate. Similarly, a one point increase in the log of GDP is associated with a 175.59 point increase in the violent crime rate.
This new model conforms to the four assumptions of linear regression better than our initial model did. It does not violate the first assumption (expected value of error terms equal to zero), as seen in the below plot. This regression does violate the second assumption (homoscedasticity) since variance of the residuals is higher for larger fitted values, but the variance is more constant than in our initial model. This regression also violates the third assumption (residuals tell us nothing about one another) since it is a time series. The fourth assumption (normality of residuals) is not violated by this regression equation. While not exactly normal, the residuals are approximately normal and certainly more normal than the residuals of our initial regression equation. In sum, our improved model violates two of the four linear regression assumptions, whereas our initial model violated three of the four.
Initial Conclusion and Original Expectations
First let us take a look at the nature of the relationship of the national violent crime rate with each of the predictor variables, based on the multiple regression model we ran. In half of the cases the direction of the relationship matched our expectations, and in the other half the relationship was the opposite of what we had expected. As stated earlier, we had assumed that an increase in GDP would be associated with a decrease in the crime rate, this does not seem to be the case based on the positive coefficient for the loggedGDP. It seems that there is actually a positive rather than negative relationship between the two—an increase in GDP is associated with an increase in the violent crime rate. Additionally, we had expected that an increase in the unemployment rate would be associated with a decrease in the violent crime rate. However, based on the negative coefficient for unemployment, it seems that an increase in unemployment, in our model, isactually associated with a decrease in violent crime. The other two variables do in fact have the relationships we assumed they would have. An increase in the poverty rate correlates with an increase in the violent crime rate as interpreted by the positive coefficient for the poverty rate. In addition, as we had assumed, an increase in the prison population is associated with a decrease in the crime rate. These associations, of course. assume all other variables are held constant.
More importantly perhaps, we chose these four variables under the assumption, prior to statistically analyzing the data, thatall four variables together would serve as a fairly good predictor of the national violent crime rate. After looking at the multiple regression model for the data, the results do not fully support our original expectations. To begin with, in order to strengthen our analysis we had to make the choice to completely remove two of the four variables, the unemployment rate and the prison population. We now believe that the national rate of violent crime for the period 1970-2000 is best explained by the poverty rate and the level of GDP. That said, violent crime is quite difficult to predict using the data we have analyzed thus far. Therefore, we decided to try one last thing in our effort to predict the national violent crime rate.
Incorporating a Lagged Variable
We considered the fact that the best predictor of the violent crime rate may be the violent crime rate of the prior year. To examine this we first ran a correlation between the violent crime rate and the lag (by one period) of the violent crime rate.
Correlations: Violent Crime rate, Lag of Violent Crime Rate
Pearson correlation of Violent Crime rate and Lag of Violent Crime Rate = 0.957
This very high correlation of 0.957 tells us that the violent crime in one period is likely to have predictive power in predicting the violent crime rate of the next period. We next constructed a second best subsets regression but this time included the lag variable.