Anova, Regression, Correlation

252regr 2/26/07 (Open this document in 'Outline' view!)Roger Even Bove

G. LINEAR REGRESSION-Curve Fitting

1. Exact vs. Inexact Relations

2. The Ordinary Least Squares Formula

We wish to estimate the coefficients in . Our ‘prediction’ will be and our error will be so that .

(See appendix for derivation)

3. Example

1 0 0 0 0 0

2 2 1 2 1 4

3 1 2 2 4 1

4 3 1 3 1 9

5 1 0 0 0 1

6 3 3 9 9 9

7 4 4 16 16 16

8 2 2 4 4 4

9 1 2 2 4 1

10 2 1 2 1 4

sum 19 16 40 40 49

First copy and

Then compute means: .

Use these to compute ‘Spare Parts’: (Total Sum of Squares) .

Note that and must be positive, while can be either positive or negative.

We can compute the coefficients:

So our regression equation is or .

4. , the Coefficient of Determination

, is Regression (Explained) Sum of Squares.

is the Error (Unexplained or Residual) Sum of Squares, and is defined as , a formula that should never be used for computation.

An alternate formula, if no spare parts have been computed, is

.The coefficient of determination is the square of the correlation. Note that , and all have the same sign.

H. LINEAR REGRESSION-Simple Regression

1. Fitting a Line

2. The Gauss Markov Theorem

OLS is BLUE

3. Standard Errors –The standard error is defined as .

Or, if no spare parts are available,

Note also that if is available .

Using data from G3,and using our spare parts

4. The Variance of and .

and

I. LINEAR REGRESSION-Confidence Intervals and Tests

1. Confidence Intervals for .

The interval can be made smaller by increasing either or the amount of variation in .

2. Tests for .

To test use . Remember is most often zero – and if the null hypothesis is false in that case we say that is significant.

To continue the example in G3:or We have already computed , which implies that and .

The significance test is now . Assume that , so that for a 2-sided test and we reject the null hypothesis if is below –2.306 or above 2.306. Since is in the rejection region, we say that is significant. A further test says that is not significantly different from 1.

If we want a confidence interval . Note that this includes 1, but not zero.

Note that since , . This indicates that both the a large sample size,, and a large variance of will tend to make smaller and thus decrease the size of a confidence interval for or increase the size (and significance) of the t-ratio. To put it more negatively, small amounts of variation in or small sample sizes will tend to produce values of that are not significant. The common sense interpretation of this statement is that we need a lot of experience with what happens to when we vary to be able to put any confidence in our estimate of the slope of the equation that relates them.

3. Confidence Intervals and Tests for

We are now testing with .

. So . If we are testing . Since the rejection region is the same as in I2, we accept the null hypothesis and say that is not significant. A confidence interval would be

A common way to summarize our results is, . The equation is written with the standard deviations below the equation. For a Minitab printout example of a simple regression problem, see 252regrex1.

4. Prediction and Confidence Intervals for

The Confidence Interval is , where and the Prediction Interval is , where . In these two formulas, for some specific ,. For example, assume that so that for the results in G3,. Then and , so that the confidence interval is . This represents a confidence interval for the average value that will take when . For the same data and , so that the prediction interval is . This is a confidence interval for the value that will take in a particular instance when .

Ignore the remainder of this document unless you have had calculus!

Appendix to G2– Explanation of OLS Formula

Assume that we have three points: . We wish to fit a regression line to these points, with the equation and the characteristic that the sum of squares, is a minimum. If we imagine that there is a 'true' regression line we can consider to be estimates of .

Let us make the definition . Note that if we substitute our equation for , we find that , or . This has two consequences: First the sum of squares can be written as ; and second, that if we fit the line so that or the mean of and is the same we have . Now if we subtract the equation for from the equation for we find . Now let us measure and as deviations from the mean, replacing with and with . This means that or . If we substitute this expression in our sum of squares, we find that .

Now write this expression out in terms of our three points and differentiate it to minimize with respect to . To do this, recall that is our unknown and that the s and s are numbers (constants!), so that and .

If we now take a derivative of this expression with respect to and set it equal to zero to find a minimum, we find that:

But if , then or or , so that if we solve for , we find . But if we remember thatand , we can write this as or

Of course, we still need , but remember that , so that .