II. Thinking About the Data

I. Introduction

Despite a payroll smaller than all but four other major league teams, the Oakland Athletics (commonly referred to as the A’s) won the fourth most games in Major League Baseball in 2003. This year was no fluke, however, as the A’s have now made the playoffs in four consecutive years while teams like the New York Mets and Texas Rangers have generally floundered while spending over $50 million more on their players. Recently, thanks to the publication of the best-selling book Moneyball, much has been made of the methods of the team’s general manager, who focuses on new and non-traditional statistics when evaluating talent rather than on traditional ones and scouting reports. In particular, Billy Beane, the GM, focuses on a player’s on-base percentage (OBP) and slugging percentage (SLG) when measuring offensive capability rather than on batting average (AVG), which has always been baseball’s most famous and well-published statistic. In addition, Beane eschews the practice of sacrifice bunts (SH) and stolen bases (SB) as he believes they actually decrease the chances of scoring runs (R), rather than increase them as is commonly believed. While there are certain circumstances where he feels these last two moves are appropriate, as a general rule he avoids attempting to sacrifice bunt or steal bases, and the A’s during Beane’s tenure (1998-present) are known to be near the bottom of the league in these categories[1].

While the A’s also have unconventional ways of measuring defense and a pitcher’s capability (and many fairly argue that the A’s pitching has more to do with their recent success than their offensive makeup), the offensive side has become the most controversial as it clearly conflicts with the traditional way of looking at things.

Based on my love of baseball and interest in the subject matter, I have decided to examine Billy Beane’s methods for my data analysis project. I intend to focus on offense from a team perspective. Since the sole purpose of a team offense is to score as many runs as possible, my target variable will be team runs scored per game. I then will try to predict this variable based on the following statistics: OBP, SLG, AVG, SH (per game) and SB (per game).[2] Since my instincts agreed with many of the ideas in Moneyball, I took many of its statistical findings at face value. However, I will now examine for myself whether OBP and SLG are more highly correlated to runs scored than the traditional AVG and whether sacrifice bunting and stealing bases are indeed poor strategies.

II. Thinking about the Data

A question that arose before collecting the data was how far back in time to examine. Since statistics are available dating back to the beginning of the 20th century, I had many potential observations to examine. However, to make the project less unwieldy, I took data from American League teams from the past eight seasons (1996-2003). I had decided to exclude National League teams since, without the presence of the designated hitter, the sacrifice bunt would (and should) be used more frequently when the pitcher is batting. Therefore, using National League teams might provide an inaccurate answer for the correlation between sacrifice bunts and team runs. Considering there have been 14 American League teams since 1993 (one team switched leagues in 1998, but was replaced with an expansion team), this approach will provide 112 observations.

III. Data Collection

While many sports sites can provide the data I desire, Major League Baseball’s official website (www.mlb.com) had perhaps the easiest to use statistical interface where I was able to obtain all the statistics I desired for my analysis. In fact, I gathered much more data than just team totals for R, OBP, SLG, AVG, SH, SB and games played (to determine the per game numbers for the cumulative statistics). I realized when collecting the data that other statistics could provide some interesting side analyses. While they may not factor into my completed project, for my own curiosity I would like to see how well some other offensive statistics are predictors to runs scored.

I am fortunate that I had no real problems collecting the data. The only problem I had, albeit a minor one, was that there was no easy way to download each year’s data. A simple “copy and paste” technique created some formatting errors, but these were easily corrected.

IV. First Look at the Data

I will start by analyzing the descriptive statistics for each of the variables. Since OBP, SLG and AVG are all measured in percentages, I have decided to multiply each observation by 1000. While this will have no effect on the regression analysis, it will make the interpretation of the coefficients a bit easier to understand, especially for baseball fans who are familiar with what the statistics represent. Having said that, let’s look at our data:

Descriptive Statistics:

Variable N Mean Median TrMean StDev SE Mean

R per G 112 5.0437 5.0216 5.0527 0.5474 0.0517

OBP 112 340.36 341.00 340.65 15.36 1.45

SLG 112 433.21 435.50 433.43 27.00 2.55

AVG 112 270.80 271.00 271.11 11.59 1.09

SB per G 112 0.6370 0.6285 0.6333 0.2003 0.0189

SH per G 112 0.23698 0.24383 0.23600 0.07297 0.00690

Variable Minimum Maximum Q1 Q3

R per G 3.5714 6.2671 4.6588 5.4228

OBP 300.00 374.00 329.25 352.00

SLG 375.00 491.00 411.25 453.75

AVG 240.00 293.00 263.00 279.00

SB per G 0.2284 1.2112 0.4884 0.7731

SH per G 0.06790 0.40994 0.18519 0.28571

An initial examination of the data does not reveal anything alarming. The mean and medians for each variable are very close to being equal, indicating that no variable appears to have a distribution that is skewed in either direction. To confirm, I examined a histogram for each variable and, as expected, all had roughly normal distributions. I also examined box plots for each variable, which only revealed one clear outlier: the maximum value for SB per game of 1.21 (the 1996 Kansas City Royals). Based on the above, I feel all the data looks relatively normal and there is no need to log any data or perform similar adjustments.

The next step is to find fitted line plots of each of my predictors against R per game. While these plots will not illustrate how these variables work together to predict my target variable, they can be useful in determining the general relationship between the predictors and the dependent variable:

While it is clear that OBP, SLG and AVG have a strong positive correlation with R per game, it is interesting to note the low correlation of SB and SH per game with R per game. In fact, each has a slightly negative correlation, indicating that increases in these variables actually are correlated with fewer runs scored (score one for Billy Beane!). However, the correlations for these two variables are so low that I do not put too much weight into the signs but rather just emphasize how poor they are in predicting R per game by themselves. Perhaps they add more value when working with the other variables.

V. Preliminary Multiple Regression Model

As mentioned, these plots only tell us part of the story. We now need to examine a regression output to see how these variables relate together in predicting our target variable.

Regression Analysis: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game

The regression equation is

R per Game = - 5.81 + 0.0230 OBP + 0.00868 SLG - 0.00291 AVG

+ 0.0725 SB per Game + 0.082 SH per Game

Predictor Coef SE Coef T P

Constant -5.8078 0.3530 -16.45 0.000

OBP 0.022958 0.002146 10.70 0.000

SLG 0.0086819 0.0009370 9.27 0.000

AVG -0.002914 0.002593 -1.12 0.264

SB per G 0.07245 0.07767 0.93 0.353

SH per G 0.0822 0.2053 0.40 0.690

S = 0.1518 R-Sq = 92.7% R-Sq(adj) = 92.3%

Analysis of Variance

Source DF SS MS F P

Regression 5 30.8164 6.1633 267.34 0.000

Residual Error 106 2.4437 0.0231

Total 111 33.2601

The adjusted R2 value of 92.3% indicates that the model accounts for much of the variability in runs scored per game. As we can see, holding all else constant, a one “point” increase in OBP (equivalent to one-tenth of one percent, or .001 in a team OBP) is associated with an increase in the expected R per game for the team of 0.022958. Of course, a .001 increase is not very meaningful, and thus such a small effect on R per game is not surprising. However, say a team is able to raise their OBP by 50 points. This would result in an expected increase of over a run a game (1.1479 to be exact). Needless to say, an additional run per game over the course of a season could easily be the difference between making the playoffs and finishing in the middle of the pack.

The standard error of the estimate of 0.1518 implies the model can predict R per game to within .3036 (2 x .1518) about 95% of the time. Considering that the range of R per game was about 2.7 R per game and the interquartile range was about 0.8 R per game, the model seems to be a highly useful predictor, which we would expect with such a high R2 value.

VI. Residual Plots and Checking Assumptions

We must examine the behavior of the residuals in order to identify any unusual observations and to determine whether the assumptions made on the terms in the model are appropriate.

The distribution of the residuals seems fairly normal. I next will plot the residuals versus each of the predicting variables to see if there is any apparent structure:

I do not see any apparent structures. I next will examine the normal plot of the residuals:

As the plot indicates, the residuals are roughly normally distributed. There appears to be an outlier or two (circled), but nothing to cause an assumption to be violated. We will address the outliers later in this paper, in section VIII, after we perform the “best” model selection process.

Now I will examine the residuals versus the fitted values plot to validate the next two assumptions:

Fortunately, there does not appear to be any structure to the errors as no apparent patterns are seen. In addition, the plot shows constant variance and thus the homoscedasticity assumption appears fine.

Although not really time-sequenced, the data does come from eight different baseball seasons. Thus, it will still be useful to plot the residuals versus the order of the data to ensure that no terms are related to each other and that the assumption is not violated:

Clearly, there do not appear to be any patterns. Each baseball season comprises 14 observations, and moving from left to right in the chart (from 2003 to 1996) does not show any distinct changes in the residuals. This makes sense, as while baseball has gone through “deadball” and offensive periods over time (and stadium size changes, etc.), the game has not fundamentally changed over the course of the previous 8 seasons. Thus, we can conclude that this assumption has not been violated.

VII. Model Improvement

As stated earlier, there are a few potential outliers and leverage points that need to be examined to see if their removal could improve the model. However, before doing so, I will first look at ways where we can improve my model by perhaps eliminating variables that add little predictive value. Following that, my next step will be to revisit the outliers. Lastly, I will have a final look at the residual plots to see if the regression assumptions hold. Let’s examine again the preliminary regression results:

Regression Analysis: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game

The regression equation is

R per Game = - 5.81 + 0.0230 OBP + 0.00868 SLG - 0.00291 AVG

+ 0.0725 SB per Game + 0.082 SH per Game

Predictor Coef SE Coef T P

Constant -5.8078 0.3530 -16.45 0.000

OBP 0.022958 0.002146 10.70 0.000

SLG 0.0086819 0.0009370 9.27 0.000

AVG -0.002914 0.002593 -1.12 0.264

SB per G 0.07245 0.07767 0.93 0.353

SH per G 0.0822 0.2053 0.40 0.690

S = 0.1518 R-Sq = 92.7% R-Sq(adj) = 92.3%

Analysis of Variance

Source DF SS MS F P

Regression 5 30.8164 6.1633 267.34 0.000

Residual Error 106 2.4437 0.0231

Total 111 33.2601

The t-statistics for the variables indicate which variables add the most given all the other variables. In this case, it appears that OBP and SLG clearly add the most while SB and SH per game, given the other variables, add the least, with AVG somewhere in the middle. The high p-values for SB, SH per game and AVG indicate that there may be potential to simplify the model by eliminating variables without losing much predictive power for my dependent variable of R per game. We can get a sneak preview of what simplified models may look like by using the “Best Subsets” functionality in Minitab:

Best Subsets Regression: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game

Response is R per Game

S S

B H

p p

e e

O S A r r

B L V

Vars R-Sq R-Sq(adj) C-p S P G G G G

1 86.4 86.3 87.7 0.20252 X

1 78.9 78.7 196.3 0.25252 X

2 92.5 92.4 2.2 0.15125 X X

2 86.6 86.4 86.7 0.20188 X X

3 92.6 92.4 3.2 0.15129 X X X

3 92.6 92.3 3.5 0.15146 X X X

4 92.6 92.4 4.2 0.15124 X X X X

4 92.6 92.3 4.9 0.15174 X X X X

5 92.7 92.3 6.0 0.15184 X X X X X

As suggested earlier, it definitely appears that OBP and SLG alone can provide a model with excellent predictive power for R per game. In fact, a model with just these two variables provides an adjusted R2 of 92.4%. The fact that the adjusted R2 remains virtually unchanged (in fact, it even increased by 0.01%) with the elimination of the other three variables indicates that these three are rather unimportant for a multiple regression fit. Thus, let us re-run the regression using only OBP and SLG: