Practice Problems on Correlation & Simple Regression

1. Suppose that, across a sample of stores, the correlation coefficient between beer prices and beer sales is -0.65. What does this number indicate?

(a)  There is almost no variability in beer sales that is unexplained by beer price.

(b)  More beer sales tend to go along with lower beer prices.

(c)  As price increases by $1, beer sales decrease by 65%

(d)  All of the above are true.

Answer: (b)

The correlation is negative, so (b) is correct: higher sales go with lower prices (basic economics also tells us this!)

Option (a) is false because while .652 = 42% of the variability in sales is explained by price, the remainder (58%) is not explained by price. (we’ll discuss this interpretation of the squared correlation next week)

Option (c) is a kind of distorted interpretation of the regression slope, not the correlation coefficient.

2. The purpose of a scatterplot is:

(a)  To test for the significance of association in bivariate data.

(b)  To calculate the correlation coefficient.

(c)  To provide a visual picture of the relationship in bivariate data.

(d)  To determine a confidence interval for the regression slope.

Answer: (c)

The scatterplot is a visual display of bivariate data

3. The standard error of the sample regression slope tells you:

(a)  Approximately how different the slope coefficient will be in different samples.

(b)  Approximately how large the prediction errors are.

(c)  Approximately how spread out the Y scores are.

(d)  Approximately how much of the variability of Y is explained by X.

Answer: (a)

The standard error of any quantity is a measure of how different that quantity will be in different samples. So (a) is the correct interpretation of the standard error of the sample regression slope.

4. The correlation coefficient describes the ______between 2 variables.

(a)  strength of curved association

(b)  strength of random association

(c)  strength of linear association

(d)  American Marketing Association

Answer: (c)

The correlation coefficient measures linear (straight-line) association: how close the points in a scatterplot fall to a straight line.

5. R2 is a measure used to describe the overall fit of the regression line. Which of the following statements is/are correct about R2?

(a)  In general, the closer the R2 is to 1, the better the fit of the regression line to the points in the scatterplot.

(b)  R2 tells you the proportion of the points in the scatterplot that fall right on the regression line.

(c)  R2 will always decrease as you add new observations to your regression.

(d)  All of the above are true statements about R2.

Answer: (a)

Larger R2 means a closer fit between the points and the regression line, so (a) is correct

Option (b) is not true (for example, because R2 could be large even if no points fall right on the regression line (so long as most of the points are close to the line)

Option (c) is also not true: there is no consistent relation between R2 and the number of points; new points can either increase or decrease R2

6. A cost accountant is developing a regression model to predict the total cost of producing a batch of circuit boards as a function of the batch size. The independent and dependent variables for this regression would be:

(a)  IV: circuit board DV: batch size

(b)  IV: batch size DV: total cost

(c)  IV: average cost DV: circuit board

(d)  IV: total cost DV: average cost

Answer: (b)

(The next 9 questions are based on the following information.)

Pete Estrian is looking to buy a used Honda Civic. He checks the Internet and finds a huge list of Civics for sale in his area. He selects a random sample of 10 cars, ranging in age from 2 years old to 15 years old. For each car, he enters the age (in years) and the offered sales price (in thousands) into Excel. He runs a regression predicting price from age, and gets the following (edited) output:

ANOVA
df / SS / MS / F / Significance F
Regression / 1 / 93.5 / 93.51 / 117.0 / 0.000005
Residual / 8 / 6.4 / 0.80
Total / 9 / 99.9
Coefficients / Standard Error / t Stat / P-value
Intercept / 12.10 / 0.60 / 20.2 / 0.00000004
Age / -0.80 / 0.07 / -10.8 / 0.000005

7. What is the equation for the regression line?

(a)  Predicted price = $12,100 - $800 * Age

(b)  Predicted price = $12,100 - $600 * Age

(c)  Predicted price = $20,200 - $10,800 * Age

(d)  Predicted price = $12,100 - $70 * Age

Answer: (a)

Slope = -.80 thousands = -$800

Intercept = 12.10 thousands = $12,100

Predicted Price = $12,100 - $800 * Age

8. Car #5 in the sample was 10 years old and cost $4,000. Determine the predicted price and the residual for this car.

(a)  Predicted price = $11,300; residual = -$7,300

(b)  Predicted price = $11,300; residual = $7,300

(c)  Predicted price = $4,100; residual = $100

(d)  Predicted price = $4,100; residual = -$100

Answer: (d)

Predicted price = $12,100 - $800*10 = $4,100

residual = $4,000 - $4,100 = -$100

9. Construct a 95% confidence interval for the drop in price associated with an additional year of age.

(a)  ($639, $961)

(b)  ($667, $933)

(c)  ($749, $851)

(d)  Cannot be determined from the information given

Answer: (a)

This problem is asking for a confidence interval for the slope (or more accurately for a confidence interval for “how negative” the slope is.)

From the t-table then, you get t=2.306 (for df=8)

To calculate the interval: -0.80 +/- (t from table) * .07

Using df=8:

-0.80 +/- 2.306 * .07 è (-.961, -.639)

Converting to dollars from thousands, and since the problem asks for the drop in price, we can convert from negatives to positives, and thus we get ($639, $961)

10. What is the correlation between Age and Price?

(a)  r= 0.97

(b)  r= -0.97

(c)  r= 0.80

(d)  r= -0.80

Answer: (b)

We can get R2 from the ANOVA table, and then use it to determine the correlation.

R2 = SSRegression / SSTotal = 93.5 / 99.9 = .936

Take the square root to get +/- .97 – but we need to make sure the sign of the correlation matches the sign of the slope (they need to be negative)

So the correlation is –0.97.


(Honda Civics Prices and Ages, continued.)

11. What is the typical difference between the predicted prices (based on the regression line) and the actual prices for these cars?

(a)  about $70

(b)  about $600

(c)  about $890

(d)  about $2,530

Answer: (c)

The question is asking for how far the points are from the regression line – this is best measured by the SD of the residuals, which can be calculated by taking the square root of MSResidual from the ANOVA table: sqrt(.80) = .894 or about $890.

12. What does the p-value of 0.000005 tell us?

(a)  It is not very plausible that the population regression line relating Price to Age is flat.

(b)  There is strong evidence that the slope of the population regression line is not 0.

(c)  There is strong evidence that the knowing a Civic’s age would improve our prediction of its price.

(d)  All of the statements above are implied by the low p-value.

Answer: (d)

The low p-value says that there’s strong evidence that the population regression line is probably not flat, and that the regression line improves our predictions. So (d) is the best answer – all three given statements are true.

13. The average age of the cars in the sample is 7.1 years. What is the average price of the cars in the sample?

(a)  $5,000

(b)  $6,420

(c)  $7,190

(d)  Cannot be determined from the information given.

Answer: (b)

The regression line always goes through the middle of the scatterplot, through the point defined by the mean of X and the mean of Y. So if we plug the mean of X into the regression equation, we’ll get the mean of Y:

Average price = $12,100 - $800 * average age = $6,420 = $12,100 - $800 * 7.1 = $6,420

14. What is the best conclusion we can draw about Honda Civics that are 5 years old, based on the information we have?

(a)  We conclude that the average price of 5-year old Civics is about $8,100, but we expect to see some differences in prices for different 5-year Civics.

(b)  We conclude that that all 5-year-old Civics should cost the same, about $8,100.

(c)  We conclude that all 5-year-old Civics should cost more than all 6-year-old Civics, although we can’t be completely sure by how much.

(d)  All of the above are equally valid conclusions.

Answer: (a)

The regression line tells of the average price for each level of Age. Thus, (a) is correct, because the average price for Age=5 is = $12,100 - $800 * 5 = $8,100. But there’s still variability of the points around the regression line – in other words, the prices of 5-year old Civics will vary around the average of $8,100.

Options (b) and (c) are not correct because of this variability around the regression line. The regression line tells us about average prices (for cars of different Ages), not about the exact prices of ALL cars of any particular Age.

15. Suppose instead that Pete had taken a second sample, consisting of 6 cars that ranged in age from 4 to 8 years old, and suppose that he regressed Price on Age for this second sample. How would the standard error of the regression slope be different for this second sample (compared to the first sample described on the previous page)?

(a)  The standard error of the regression slope would probably be larger for the second sample.

(b)  The standard error of the regression slope would probably be about the same for both samples.

(c)  The standard error of the regression slope would probably be smaller for the second sample.

Answer: (a)

The second sample is smaller -- 6 cars versus 10 (or 12) cars.

The second sample involves less spread out X-values than the first sample (X ranging from 4 to 8 rather than from 2 to 15).

The standard error of the regression slope gets larger as the sample size gets smaller, and as the SD of the X’s gets smaller. (Remember, more data points are better, and wider data points are better – they lead to smaller standard errors and thus more precise estimates.) So the standard error of the slope in the second sample will be larger than before. The second sample wouldn’t estimate the population regression slope as accurately as the first sample.


(The next 5 questions deal with the following information.)

Below is partial output from a regression predicting consumption of beef (called BeefConsumption, and measured in pounds of beef per person annually) from the price of beef (called BeefPrice, and measured in cents per pound). [The data are from the United States from 1925 to 1941. During this period, the price of beef ranged from about 55 to 80 cents per pound.]

ANOVA
df / SS / MS / F / Significance F
Regression / 1 / 166.0 / 166.0 / 19.6 / 0.0005
Residual / 15 / 127.2 / 8.5
Total / 16 / 293.1
Coefficients / Standard Error / t Stat / P-value / Lower 95% / Upper 95%
Intercept / 85.24 / 7.30 / 11.67 / 6.3E-09 / 69.68 / 100.80
BeefPrice / -0.47 / 0.11 / -4.42 / 0.0005 / -0.69 / -0.24

16. In 1941, beef cost 56 cents per pound, and annual consumption of beef was 60.0 pounds per person. Determine the predicted consumption of beef in 1941, and say whether the data point for 1941 is above or below the regression line.

(a)  Predicted consumption of beef = 57.0; data point is above the regression line

(b)  Predicted consumption of beef = 57.0; data point is below the regression line

(c)  Predicted consumption of beef = 58.9; data point is above the regression line

(d)  Predicted consumption of beef = 58.9; data point is below the regression line

Answer: (c)

Predicted consumption of beef = 85.24 - .47 * 56 = 58.9

Actual consumption is 60.0 which is above the prediction of 58.9. So the data point is above the regression line (in other words, the residual is positive)

17. What is the best interpretation of the y-intercept (85.24) for this regression?

(a)  When beef was free in 1941, everyone consumed about 85 pounds of beef per year.

(b)  The intercept doesn’t tell us much, because 0 isn’t a reasonable value to plug in for the price of beef.

(c)  The intercept doesn’t tell us much, because the p-value is too low.

(d)  Both (a) and (c).

Answer: (b)

The y-intercept tells us the prediction for Y (BeefConsumption) when X (BeefPrice) is 0. But because beef was never free in the sample (the lowest price of 55 cents isn’t even close to 0) , we don’t want to extrapolate the regression line by plugging in 0 for BeefPrice.

18. Determine the correlation between BeefPrice and BeefConsumption.

(a)  r = +0.75

(b)  r = -0.75

(c)  r = +0.47

(d)  r = -0.47

Answer: (b)

R2 = r2 =SSRegression / SSTotal = 166.0 / 293.1 = .566