Chapter 10 - Simple Linear Regression and Correlation

Chapter 10

simple linear regression and correlation

(The template for this chapter is: Simple Regression.xls.)

10-1. A statistical model is a set of mathematical formulas and assumptions that describe some real-world situation.

10-2. Steps in statistical model building: 1) Hypothesize a statistical model; 2) Estimate the model parameters; 3) Test the validity of the model; and 4) Use the model.

10-3. Assumptions of the simple linear regression model: 1) A straight-line relationship between X and Y; 2) The values of X are fixed; 3) The regression errors, e, are identically normally distributed random variables, uncorrelated with each other through time.

10-4. is the Y-intercept of the regression line, and is the slope of the line.

10-5. The conditional mean of Y, E(Y | X), is the population regression line.

10-6. The regression model is used for understanding the relationship between the two variables, X and Y; for prediction of Y for given values of X; and for possible control of the variable Y, using the variable X.

10-7. The error term captures the randomness in the process. Since X is assumed nonrandom, the addition of e makes the result (Y) a random variable. The error term captures the effects on Y of a host of unknown random components not accounted for by the simple linear regression model.

10-8. The equation represents a simple linear regression model without an intercept (constant) term.

10-9. The least-squares procedure produces the best estimated regression line in the sense that the line lies “inside” the data set. The line is the best unbiased linear estimator of the true regression line as the estimators and have smallest variance of all linear unbiased estimators of the line parameters. Least-squares line is obtained by minimizing the sum of the squared deviations of the data points from the line.

10-10.  Least squares is less useful when outliers exist. Outliers tend to have a greater influence on the determination of the estimators of the line parameters because the procedure is based on minimizing the squared distances from the line. Since outliers have large squared distances they exert undue influence on the line. A more robust procedure may be appropriate when outliers exist.


10-11. (Template: Simple Regression.xls, sheet: Regression)

Simple Regression
Income / Wealth
X / Y / Error / Quantile / Z / Confidence Interval for Slope
1 / 1 / 17.3 / 0.8 / 0.667 / 0.431 / 1-a / (1-a) C.I. for b1
2 / 2 / 23.6 / -3.02 / 0.167 / -0.967 / 95% / 10.12 / + or - / 2.77974
3 / 3 / 40.2 / 3.46 / 0.833 / 0.967
4 / 4 / 45.8 / -1.06 / 0.333 / -0.431 / Confidence Interval for Intercept
5 / 5 / 56.8 / -0.18 / 0.500 / 0.000 / 1-a / (1-a) C.I. for b0
95% / 6.38 / + or - / 9.21937

Regression Equation: Wealth Growth = 6.38 + 10.12 Income Quantile

10-12.  b1 = SSXY /SSX = 934.49/765.98 = 1.22

10-13. (Template: Simple Regression.xls, sheet: Regression)

Thus, b0 = -3.057 b1 = 0.187

r2 / 0.9217 / Coefficient of Determination
Confidence Interval for Slope / r / 0.9601 / Coefficient of Correlation
1-a / (1-a) C.I. for b1
95% / 0.18663 / + or - / 0.03609 / s(b1) / 0.0164 / Standard Error of Slope
Confidence Interval for Intercept
1-a / (1-a) C.I. for b0
95% / -3.05658 / + or - / 2.1372 / s(b0) / 0.97102 / Standard Error of Intercept
Prediction Interval for Y
1-a / X / (1-a) P.I. for Y given X
95% / 10 / -1.19025 / + or - / 2.8317 / s / 0.99538 / Standard Error of prediction
Prediction Interval for E[Y|X]
1-a / X / (1-a) P.I. for E[Y | X]
+ or -
ANOVA Table
Source / SS / df / MS / F / Fcritical / p-value
Regn. / 128.332 / 1 / 128.332 / 129.525 / 4.84434 / 0.0000
Error / 10.8987 / 11 / 0.99079
Total / 139.231 / 12


10-14. b1 = SSXY /SSX = 2.11

b0 = - b1 = 165.3 - (2.11)(88.9) = -22.279

10-15.

Simple Regression
Inflation / Return
X / Y / Error
1 / 1 / -3 / -20.0642
2 / 2 / 36 / 17.9677
3 / 12.6 / 12 / -16.294
4 / -10.3 / -8 / -14.1247
5 / 0.51 / 53 / 36.4102
6 / 2.03 / -2 / -20.0613
7 / -1.8 / 18 / 3.64648
8 / 5.79 / 32 / 10.2987
9 / 5.87 / 24 / 2.22121
Inflation & return on stocks
r2 / 0.0873 / Coefficient of Determination
Confidence Interval for Slope / r / 0.2955 / Coefficient of Correlation
1-a / (1-a) C.I. for b1
95% / 0.96809 / + or - / 2.7972 / s(b1) / 1.18294 / Standard Error of Slope
Confidence Interval for Intercept
1-a / (1-a) C.I. for b0
95% / 16.0961 / + or - / 17.3299 / s(b0) / 7.32883 / Standard Error of Intercept
s / 20.8493 / Standard Error of prediction
ANOVA Table
Source / SS / df / MS / F / Fcritical / p-value
Regn. / 291.134 / 1 / 291.134 / 0.66974 / 5.59146 / 0.4401
Error / 3042.87 / 7 / 434.695
Total / 3334 / 8

There is a weak linear relationship (r) and the regression is not significant (r2, F, p-value)

10-16.

Simple Regression
Year / Value
X / Y / Error
1 / 1960 / 180000 / 84000
2 / 1970 / 40000 / -72000
3 / 1980 / 60000 / -68000
4 / 1990 / 160000 / 16000
5 / 2000 / 200000 / 40000
Average value of Aston Martin
r2 / 0.1203 / Coefficient of Determination
Confidence Interval for Slope / r / 0.3468 / Coefficient of Correlation
1-a / (1-a) C.I. for b1
95% / 1600 / + or - / 7949.76 / s(b1) / 2498 / Standard Error of Slope
Confidence Interval for Intercept
1-a / (1-a) C.I. for b0
95% / -3040000 / + or - / 1.6E+07 / s(b0) / 4946165 / Standard Error of Intercept
s / 78993.7 / Standard Error of prediction
ANOVA Table
Source / SS / df / MS / F / Fcritical / p-value
Regn. / 2.6E+09 / 1 / 2.6E+09 / 0.41026 / 10.128 / 0.5674
Error / 1.9E+10 / 3 / 6.2E+09
Total / 2.1E+10 / 4


There is a weak linear relationship (r) and the regression is not significant (r2, F, p-value).

Limitations: sample size is very small.

Hidden variables: the 70s and 80s models have a different valuation than other decades possibly due to a different model or style.

10-17. Regression equation is:

Credit Card Transactions = 39.6717 + 0.06129 Debit Card Transactions

r2 / 0.9624 / Coefficient of Determination
Confidence Interval for Slope / r / 0.9810 / Coefficient of Correlation
1-a / (1-a) C.I. for b1
95% / 0.6202 / + or - / 0.17018 / s(b1) / 0.06129 / Standard Error of Slope
Confidence Interval for Intercept
1-a / (1-a) C.I. for b0
95% / 177.641 / + or - / 110.147 / s(b0) / 39.6717 / Standard Error of Intercept
Prediction Interval for Y
1-a / X / (1-a) P.I. for Y given X
+ or - / s / 56.9747 / Standard Error of prediction
Prediction Interval for E[Y|X]
1-a / X / (1-a) P.I. for E[Y | X]
+ or -
ANOVA Table
Source / SS / df / MS / F / Fcritical / p-value
Regn. / 332366 / 1 / 332366 / 102.389 / 7.70865 / 0.0005
Error / 12984.5 / 4 / 3246.12
Total / 345351 / 5

There is no implication for causality. A third variable influence could be “increases in per capital income” or “GDP Growth”.

10-18. SSE = Take partial derivatives with respect to b0 and b1:

= -2

= -2

Setting the two partial derivatives to zero and simplifying, we get:

= 0 and = 0. Expanding, we get:

-nb0 - = 0 and - = 0

Solving the above two equations simultaneously for b0 and b1 gives the required results.

10-19. 99% C.I. for : 1.25533 2.807(0.04972) = [1.1158, 1.3949].

The confidence interval does not contain zero.

10-20. MSE = 7.629

From the ANOVA table for Problem 10-11:

ANOVA Table
Source / SS / df / MS
Regn. / 1024.14 / 1 / 1024.14
Error / 22.888 / 3 / 7.62933
Total / 1047.03 / 4

10-21. From the regression results for problem 10-11

s(b0) = 2.897 s(b1) = 0.873

s(b1) / 0.87346 / Standard Error of Slope
s(b0) / 2.89694 / Standard Error of Intercept

10-22. From the regression results for problem 10-11

Confidence Interval for Slope

1-a / (1-a) C.I. for b1
95% / 10.12 / + or - / 2.77974
Confidence Interval for Intercept
1-a / (1-a) C.I. for b0
95% / 6.38 / + or - / 9.21937

95% C.I. for the slope: 10.12 ± 2.77974 = [7.34026, 12.89974]

95% C.I. for the intercept: 6.38 ± 9.21937 = [-2.83937, 15.59937]


10-23. s(b0) = 0.971 s(b1) = 0.016; estimate of the error variance is MSE = 0.991. 95% C.I. for : 0.187 + 2.201(0.016) = [0.1518, 0.2222]. Zero is not a plausible value at = 0.05.

Confidence Interval for Slope
1-a / (1-a) C.I. for b1
95% / 0.18663 / + or - / 0.03609 / s(b1) / 0.0164 / Standard Error of Slope
Confidence Interval for Intercept
1-a / (1-a) C.I. for b0
95% / -3.05658 / + or - / 2.1372 / s(b0) / 0.97102 / Standard Error of Intercept

10-24. s(b0) = 85.44 s(b1) = 0.1534

Estimate of the regression variance is MSE = 8122

95% C.I. for b1: 1.5518 ± 2.776 (0.1534) = [1.126, 1.978]

Zero is not in the range.

Confidence Interval for Slope
1-a / (1-a) C.I. for b1
95% / 1.55176 / + or - / 0.42578 / s(b1) / 0.15336 / Standard Error of Slope
Confidence Interval for Intercept
1-a / (1-a) C.I. for b0
95% / -255.943 / + or - / 237.219 / s(b0) / 85.4395 / Standard Error of Intercept

10-25. s 2 gives us information about the variation of the data points about the computed regression line.

10-26. In correlation analysis, the two variables, X and Y, are viewed in a symmetric way, where no one of them is “dependent” and the other “independent,” as the case in regression analysis. In correlation analysis we are interested in the relation between two random variables, both assumed normally distributed.

10-27. From the regression results for problem 10-11:

r / 0.9890 / Coefficient of Correlation

10-28. r = 0.960

r / 0.9601 / Coefficient of Correlation


10-29. t(5) = = 0.640

Accept H0. The two variables are not linearly correlated.

10-30.  Yes. For example suppose n = 5 and r = .51; then:

t = = 1.02 and we do not reject H0. But if we take n = 10,000 and

r = 0.04, giving t = 14.28, this leads to strong rejection of H0.

10-31. We have: r = 0.875 and n = 10. Conducting the test:

t (8) = = = 5.11

There is statistical evidence of a correlation between the prices of gold and of copper. Limitations: data are time-series data, hence not dependent random samples. Also, data set contains only 10 points.

10-34. n= 65 r = 0.37 t (63) = = 3.16

Yes. Significant. There is a correlation between the two variables.

10-35.  = ln [(1 + r)/(1 – 5)] = ln (1.37/0.63) = 0.3884

= ln [(1 + )/(1 – )] = ln (1.22/0.78) = 0.2237

= 1/ = 1/ = 0.127

z = ( - )/ = (0.3884 – 0.2237)/0.127 = 1.297. Cannot reject H0.

10-36.  Using “TINV(a,df)” function in Excel, where df = n-2 = 52: =TINV(0.05,52) = 2.006645

And TINV(0.01, 52) = 2.6737

Reject H0 at 0.05 but not at 0.01. There is evidence of a linear relationship at = 0.05 only.

10-37. t (16) = b1/s(b1) = 3.1/2.89 = 1.0727.

Do not reject H0. There is no evidence of a linear relationship using any .

10-38. Using the regression results for problem 10-11:

critical value of t is: t( 0.05, 3) = 3.182

computed value of t is: t = b1/s(b1) = 10.12 / 0.87346 = 11.586

Reject H0. There is strong evidence of a linear relationship.


10-39. t (11) = b1/s(b1) = 0.187/0.016 = 11.69

Reject H0. There is strong evidence of a linear relationship between the two variables.

10-40. b1/ s(b1) = 1600/2498 = 0.641

Do not reject H0. There is no evidence of a linear relationship.

10-41. t (58) = b1/s(b1) = 1.24/0.21 = 5.90

Yes, there is evidence of a linear relationship.

10-42.  Using the Excel function, TDIST(x,df,#tails) to estimate the p-value for the t-test results, where x = 1.51, df = 585692 – 2 = 585690, #tails = 2 for a 2-tail test:

TDIST(1.51, 585690,2) = 0.131.

The corresponding p-value for the results is 0.131. The resgression is not significant even at the 0.10 level of significance.

10-43. t (211) = z = b1/s(b1) = 0.68/12.03 = 0.0565

Do not reject H0. There is no evidence of a linear relationship using any . (Why report such results?)

10-44. b1 = 5.49 s(b1) = 1.21 t (26) = 4.537

Yes, there is evidence of a linear relationship.

10-45.  The coefficient of determination indicates that 9% of the variation in customer satisfaction can be explained by the changes in a customer’s materialism measurement.