Regression

The Problem:

Electrical Use:

Month / KWH
1 / 3,616
2 / 2,361
3 / 3,353
4 / 4,561
5 / 5,768
6 / 3,056
7 / 2,434
8 / 1,471
9 / 1,443
10 / 2,775
11 / 3,233
12 / 3,531
13 / 2,277
14 / 3,060
15 / 4,638
16 / 6,020
17 / 7,452
18 / 3,372
19 / 2,079
20 / 1,230
21 / 1,703
22 / 2,773
23 / 3,199

Degree Days is a measure of the need to heat or cool a home and a surrogate for weather.

Formula for Degree Days (DD):

Compute average of high and low for a day.

If average is between 65 and 70 DD = 0

If average is less than 65, DD = 65 – average

If average is above 65, DD = average – 65

Month / DD
1 / 248
2 / 120
3 / 263
4 / 481
5 / 603
6 / 226
7 / 212
8 / 75
9 / 82
10 / 273
11 / 369
12 / 476
13 / 213
14 / 248
15 / 455
16 / 676
17 / 858
18 / 382
19 / 203
20 / 68
21 / 250
22 / 422
23 / 503
Month / DD / KWH
1 / 248 / 3,616
2 / 120 / 2,361
3 / 263 / 3,353
4 / 481 / 4,561
5 / 603 / 5,768
6 / 226 / 3,056
7 / 212 / 2,434
8 / 75 / 1,471
9 / 82 / 1,443
10 / 273 / 2,775
11 / 369 / 3,233
12 / 476 / 3,531
13 / 213 / 2,277
14 / 248 / 3,060
15 / 455 / 4,638
16 / 676 / 6,020
17 / 858 / 7,452
18 / 382 / 3,372
19 / 203 / 2,079
20 / 68 / 1,230
21 / 250 / 1,703
22 / 422 / 2,773
23 / 503 / 3,199


Step 1 -- Decide what is x (the so called independent variable) and what is y (the so called dependent variable).

In business situations y is the variable you are trying to predict (or explain) and x (or x’s) is the variable you are using to predict or explain.

In our case we are trying to predict KWH usage (so this is y) using degree days as a predictor (so this is x).

The resulting graph (remember to use the X-Y plot option) is:


Step 2 -- Decide if there is a relationship between x and y.

It is complex to define a relationship. It is easier to describe no relationship.

Three forms:

a) Random Scatter


b) Flat Lines:


c) No predictive relationship:


Step 3 -- Decide if the relationship is a straight line

Mathematical Straight Line

Statistical Straight Line:


Mathematical Equation for a Straight Line:

Statistical Equation for a Straight Line:

where ei is a random quantity with mean 0 and standard deviation se.

Fitting a straight line means getting estimates of b0, b1, and se.


What if the relationship is not straight? Many relationships can be “straightened” out by generalizing the concept of a linear relationship to a “linearizable” relationship. This is done by defining a “linearizable” relationship as:

where h( ) and g( ) are functions. This is also called “transforming” the data.

For example, consider the following data plot:

If we take h(y) to be 1/y, and g(x) to be 1/x, and plot h(y) vs g(x), the resulting plot looks like:

The EXCEL file “transform.xls” contains several other examples of this situation.

If the original data is not straight, you should try one of the sixteen combinations of h(y) and g(x) obtained by picking one transform from each of the following columns:

Transform of x Transform of y

x y

log(x) log(y)

1/x 1/y

If none of these work, it is time to call a professional statistician.

Step 4 – Find the “Best” Fitting Equation

What does “Best” mean?

Consider the following plotted data:

One possible way of measuring “best” is to draw a line, measure the vertical distance of each point from the line, and add these up. Below is an example of a good fit:

Notice that the vertical distances are sometimes small and sometimes large but the large and small values tend not to be concentrated in any one part of the line.

As an example of a poor fit, consider the following line:

Here the biggest differences tend to be on opposite ends of the line. Also the total length of the lines is large than the previous graph.

Least Squares Criteria

We will utilize the Least Squares Criteria which picks that line which minimizes the squared vertical distances. Specifically, we wish to find b0, and b1 to minimize the square vertical distances.

Formally,

over all possible b’s.

By using standard maximization techniques, the solution is given by two equations:

The algebraic solution to these equations is given in the textbook. EXCEL simply gives you the numerical values of the coefficients.

The following data is found in the EXCEL file “regress.xls” located in the folder MBA Part 1.

In order to estimate the coefficients using EXCEL, you need to use the Analysis ToolPak by clicking on “Tools”, “Data Analysis”, and then click on Regression. Your screen should look like this:


Move your cursor to the “Input Y Range” box and click till you see the cursor. Then highlight the KWH column (including the column heading). Then move your cursor to the “Input X Range” box and click till you see the cursor. Then highlight the DD column (including the column heading). Finally, check the following options: Labels, Residuals, Residual Plots, and Line Fit Plots. Your screen should look like the following:


Then hit OK. You then should see the following screen:


The values of the coefficients are given in the column labeled “Coefficients”. In particular:

The variability around the regression line, se is obtained by taking the square root of the entry in the ANOVA table in the row labeled “Residual” and the column labeled “MS”. In particular:


These values are highlighted below:

The last stage of a regression analysis is to assess the adequacy of the model.

In order to do this, we need to examine the components of the model which are purely statistical, i.e. the residual terms ei. To see why this is necessary, consider the four data sets in the EXCEL worksheet “Tufte.xls” which you will find in the EXCEL folder MBA Part 1. All four of these data sets lead to exactly the same values of the regression coefficients and se. In addition the y values and x values all have the same means and standard deviations. However when you do the regression and examine the automatic residual graph you get very different results. Consider the residual plot below:


Now consider the next residual plot:


Now consider the third plot:


Finally consider the fourth residual plot:


For our electrical data, the residual plot looks like the following:


EXCEL also provides a plot of the actual values of y and the predicted values for each value of x. This is illustrated below:


Since our model looks like it fits the data, we can try to use it to predict kilowatt hour usage using degree days as a predictor.

The basic equation would be:

Predicted KWH usage = 903.515 + 7.089 x (degree days).

Thus for a billing period with 100 degree days (possibly a period in the spring or autumn in Dallas), the predicted KWH usage would be 1,612 KWH.

This “point” estimate ignores variability. However we can use the results discussed earlier about the “mound” rule and Chebyshev to incorporate variability into our forecasts to get what is called an “interval” forecast. In this case, the formula is given by:

In the case discussed above for a billing period of 100 degree days, the interval forecast would be:

1612 +/- 2*(605.9)

or approximately 400 KWH to 2,824 KWH.


Although the above method is the most practical way to assess the adequacy of a model for forecasting purposes, it is common to use a single descriptive measure called the “correlation coefficient” to describe the adequacy of the regression fit.

The correlation coefficient attempts to quantify how useful x is as a predictor of y.

If one were not to use x in the forecasting of y, then one would guess the mean value of y as the forecast of kilowatt hours for each period. If one defines the error made as:

and,

SST is a measure of the total errors made not using x as a predictor.

If we use x, then we can define the error made as:

and,


SSE is a measure of the total errors made using x as a predictor.


One can show that,


Therefore,


If we define,


then,


By rewriting the definition of R2 as:


One can interpret R2 as

the proportion of the variability in y explained (or eliminated) by using x as a predictor

As we shall see, all of the above generalizes to the case where one has more than one x as a predictor. In the case however of a single x predictor, one usually encounters the correlation coefficient r defined as:


In our case R2=.8538, and r=.9240. Both of these values can be found on the EXCEL as highlighted below:


A natural question is how big does R2 have to be in order for the regression analysis to be useful? I would suggest that the important measure is the usefulness of the interval forecast.

31