Unit 5 – 2 Variable Quantitative Data Notes #2

Linear Regression

The Big Mac has been one of McDonalds’ signature sandwiches. One Big Mac provides 25 grams of protein – half the protein you would need in a day. It also supplies 550 calories and 29 grams of fat. That is 45% of the recommended daily intake of fat grams. So after eating a Big Mac the rest of your calories that day will need to be low fat!!!

Of course the Big Mac isn’t the only item McDonalds sells. How are fat and protein related for the entire McDonald’s menu? The scatterplot below of the Fat (in grams) versus the Protein (in grams) for foods sold at McDonalds.

It shows a positive, moderate (r = .61), linear relationship. If you want to consume only 15 grams of fat in your McDonalds lunch, how much protein would you consume? Now we need to model the relationship with a line and give its equation. The equation will let us predict the protein content for any McDonalds sandwich given the amount of fat it has. This equation is just the equation of a straight line that goes through the points, but no line can go through all the points. So what equation should we use?

Back in Algebra II you learned how to find the line of best fit by placing a line down the “middle” of the points and finding the equation of that line. The problem with this method is that everyone will probably get a different equation. We want a method that will give us an accurate equation to make predictions with but also a method where everyone using the same data will get the same equation. Before we can discuss this type of equation we need to talk about residuals.

Residuals

If we were to construct a line of best fit some of the points will be above this line and some below. The vertical distance from the point to the line of best fit is called the residual. Below you can see some examples of that distance on our McDonald’s scatterplot.

If we were to look at the Big Mac, which has 29 grams of fat, our best fit line says it should have 31 grams of protein. We call this estimate the predicted value, and write it as (called y-hat) but the actually protein value of the Big Mac is 25 grams, we call this the observed value, and write it as y. The difference between the observed value and the predicted value is the residual. The Big Mac’s residual would be y - = 25 – 31 = -6.

residual = observed value – predicted value = y -

Linear Regression Part 2

When we draw our line of best fit, some residuals are positive and some are negative. We can’t see how well the line fits by just adding them up because the negative and positive residuals would just cancel each other out. We had this same problem when calculating standard deviation. We will deal with it the same way as we did with standard deviation, by squaring the residuals. Squaring make them all positive so we can add them up. So, we want the line that has the smallest sum of all the squared residuals.

The line of best fit is the line where the sum of the squared residuals is the smallest. We call this the least squares regression line (LSR). You might think finding this line is hard, surprisingly it is not. Below you will see the formula for finding the LSR.

All you need to find is the correlation coefficient and the means and standard deviations for the x’s and y’s. You might ask yourself why isn’t the equation y = mx + b, it is a simple line after all. Well it is because we have entered a new statistical universe!! Really, there is a reason but it beyond the scope of this course. What you need to remember is that a is your y-intercept and b is your slope in the formula, but once you find the slope and y-intercept you can write it any way you prefer.

When we use technology to find the LSR it will allow you to choose between y = a + bx and y=mx+b. Below you will see the McDonald’s date with the LSR drawn in and the LSR equation.

Residual Plot

When we want to know if a line of best fit is a good model, we can ask instead what the model missed.

We look at the residuals to see this. Residuals help us to see whether the linear model makes sense. So after we find the LSR, we usually plot the residuals and hope to find …. nothing!!!

Is the association between your explanatory variable and response variable really straight? Could the underlying association be curved? Even if your r-value is strong and your scatterplot looks linear you need to check the residuals to make sure your data is straight enough to use the LSR as a prediction model.

A residual plot is a scatterplot with the points (x, residual) plotted. Below you will see the McDonald’s data and the residual plot below it.

A residual plot should be the most boring scatterplot you’ve seen. It shouldn’t have any interesting features, like direction or shape. It should have about the same amount of scatter throughout. It should show no bends and it should show no outliers. Our scatterplot above is a good boring scatterplot!!

Extrapolation

LSR models can be very useful, but don’t try to push them too far. With our McDonald’s data making a prediction about the protein content of a sandwich with 31 grams of fat seems reasonable because our data is encompasses that amount of fat, but making a prediction about protein content in a sandwich with 60 grams of fat is far outside any data we collected, this might not be accurate. This is called extrapolation. Extrapolation is dangerous; you are making the assumption that the pattern of the data will be the same far outside the range of any data collected. Here is an example; you collect data about the height of a baby from birth to the age of 2 years. You create a scatter plot of age vs. height and calculate a LSR equation. You then try to use that equation to predict the height of someone when they are 25 years old. You probably predict that they will be 10 feet tall!! Why did this happen? This happened because growth is not linear over a person’s lifetime, it might be linear from birth to the age of 2 but at some age height levels off. The moral to the example, beware of extrapolation!!

Association vs. Causation

Previously we have tried to make it clear that no matter how strong the correlation between two variables, there is no way to show that one variable causes the other. Just because we can now get a LSR line does not change that.

No matter how strong the correlation, no matter how straight the line, there is no way to conclude from a regression alone that one variable causes the other. Unless you design an experiment, there is no way to be sure that a lurking variable is not the source of the apparent association.