Statistics 312 – Dr. Uebersax

31 – Linear regression

1. Simple Linear Regression

Example

In an automated assembly line, a machine drills a hole in a certain location of each new part being made. Over time, the accuracy of the machine decreases. You have data measured at seven timepoints (hours of machine use) and degree of error (mm from target). You want to know if the data in the x–y scatter plot (left) can be fitted with a straight line (right)

`

Why do this?

– to test a hypothesis (is error a linear function of hours of machine use?)

– to predict of error for usage times not observed (interpolation or extrapolation)

Regression equation

At it's simplest level, linear regression is a method for fitting a straight line through an x-y scatter plot.

Recall from other math courses (e.g., high school analytical geometry) that in a straight line is described by the following formula:

(or, equivalently, )

where:

x = a value on the x axis

b = slope parameter

a = intercept parameter (i.e., value on y axis where x = 0 [not shown above])

= a predicted value of y

We can fit infinitely many lines through the points. Which is the 'best fitting' line> The criterion will be to choose those values of a and b for which our predictive errors (squared) will be minimized. In other words, we will minimize this function:

Badness of fit =

The difference is called a residual, and the sum above is called the residual sum of squares or sum of squared errors (SSE).

Here note that instead of a and b the parameters are called b0 and b1.

Okay, we have our criterion for 'best fit'. How do we estimate a and b? It turns out that we can use calculus to find the values of a and b that minimize . When we do so, we discover the following:

NOTE: as explained in Lecture 33, the original version I gave for this formula was incorrect. The above is the correct version (5 Dec 2013)

where r is the Pearson correlation coefficient (which we calculated in the preceding lecture). Once we know b, we can find a:

where and are the means of x and y, respectively.

Prediction

We now have our linear regression equation. One thing we can do with it is to predict the y value for some new value of x. For example, in our original example, what would be the predicated drilling error for a machine after 40 hours of use:

where a and b are the regression equation coefficients that we've estimated.

2. Linear Regression with JMP

The results are in the Parameter Estimates area:

a = Intercept

b = name of variable (e.g., lot size)

Homework

Here are the data from our original example:

Hours / 30 / 33 / 34 / 35 / 39 / 44 / 45
Error / 1.10 / 1.21 / 1.25 / 1.23 / 1.30 / 1.40 / 1.42

1. Download these data (machine.xls) from course website.

2. Perform simple linear regression using JMP (Be sure that Hours is X variable)

3. Identify parameter estimates (a and b)

4. Use equationto predict drilling error when machine use = 40 hours.

Video (optional)

http://www.youtube.com/watch?v=xIDjj6ZyFuw (first 3 minutes only)