Regression Analysis - Lecture Notes by Dr. C. Christopher Lee

Overview of Regression Model (regression equation):

Y = B0 + B1 * X1
Where Y = dependent variable
X = independent variable (factor, determinant, field, attribute)
  • How does Regression Analysis build the model (regression equation)?
    = How does Regression Analysis determine B0and B1?
     Regression Analysis uses calculus to find B0 and B1 that minimizes the total sum of residual square (unexplained variation square)
  • What is a good statistical measurement for reliability estimate, which shows how reliable the regression model is?
     R2 (Coefficient of Determination)
  • Suppose R2 = 0.971. What does it mean? Interpret R2.
     This regression model accounts for 97.1% of variation in Y (DV) estimates.
  • The higher R2, the more reliable the regression model.
  • How do we know that the regression model is reliable enough?
    = How do we know that the regression model is statistically significant?
    We conduct Hypothesis Testing to answer this question.

Hypothesis Testing
  • Hypothesis Testing:
  • Set up the hypotheses:
  • H0 (null hypothesis): There exists no statistical significance.
  • H1 (research hypothesis): There exists statistical significance.
  • Compute the sample statistic.  Find p-value. Statistically, p-value is the probability that the mean of sampling distribution is greater than the sample statistic (computed value) , that is, P(X > Sample Statistic).
  • Test the sample static (computed value) with the critical (table) value (α value).
  • Make a conclusion on which hypothesis (H0 or H1) is supportive, true
    = which hypothesis does evidence support?
    = which hypothesis do research findings reveal evidence to support?
  • α value = level of significance
    = 1 - Confidence Coefficient (C.C.)

= Probability of Rejection Area
Suppose you demand 95 statistical confidence on the hypothesis testing,
C.C. = 0.95 Therefore, α = 1 - 0.95 = 0.05

  • How do make a decision on the hypothesis testing?
    = Which hypothesis is true? Null or Research Hypothesis?

If the sample statistic (computed value) falls into the rejection area, we reject H0.
 When the sample statistic falls into the rejection area, p-value < α.
 Thus, we simply follow the Decision Rule for Hypothesis Testing:

DECISION RULE: If p-value α, reject H0accept H1.

In other words, if p-value is less than α value, we find the statistical significance. We conclude that the regression model is reliable. The R2 is reliable. We can use the regression model to estimate the population (future in the forecasting business).

  • Hypothesis Testing for Simple Regression Model:

H0 (null hypothesis): β1 = 0 where β1is the coefficient (populationslope) for X1
There exists no relationship between X1 and Y.
 The model is not reliable.
 The model is statistically insignificant.

H1 (research hypothesis): β1 ≠ 0

There exists significant relationship between X1 and Y
 The model is reliable.
 The model is statistically significant.

  • Class Example - Energy Cost Model:
  • From the SPSS, p-value = 0.000 (Significance F)
  • 0.0000 (p-value) 0.001 (α value). Thus, we reject H0.
  • Evidence shows that the regression model (Y = 15111.74 + 280.66 X) is statistically significant (p<0.001).
  • In conclusion, chances are 999 out of 1000 that the regression model is reliable
  • Practical meaning: It is very safe to use this equation to analyze the population (in this case, predict the future).
  • Question: Here, we used α=0.001 instead of α = 0.05. Why?
  • Answer: The lower the α value, the higher the confidence coefficient  The higher the statistical confidence. Here, p value is extremely low. So we can afford the low α value and reject H0 (null hypothesis).

Sample Data: Energy Cost by Year where DV = Energy Cost, IV = Year
Year / X / Y
1998 / 1 / 15355.38
1999 / 2 / 15412.91
2000 / 3 / 15926.64
2001 / 4 / 16614.18
2002 / 5 / 16918.69
2003 / 6 / 16837.14
2004 / 7 / 16812.51
2005 / 8 / 17102.45
2006 / 9 / 17461.89
2007 / 10 / 17846.76
2008 / 11 / 18187.93
2009 / 12 / 18782.19
2010 / 13 / 18863.18
2011 / 14 / 18914.00
2012 / 15 / 19319.15
SPSS Regression Output: DV = Energy Cost, IV = Year
Variables Entered/Removedb
Model / Variables Entered / Variables Removed / Method
1 / Yeara / . / Enter
a. All requested variables entered.
b. Dependent Variable: Energy Cost
Model Summary
Model / R / R Square / Adjusted R Square / Std. Error of the Estimate
1 / .985a / .971 / .968 / 226.59456
a. Predictors: (Constant), Year
ANOVAb
Model / Sum of Squares / df / Mean Square / F / Sig.
1 / Regression / 2.206E7 / 1 / 2.206E7 / 429.548 / .000a
Residual / 667486.231 / 13 / 51345.095
Total / 2.272E7 / 14
a. Predictors: (Constant), Year
b. Dependent Variable: Energy Cost
Coefficientsa
Model / Unstandardized Coefficients / Standardized Coefficients / t / Sig.
B / Std. Error / Beta
1 / (Constant) / 15111.743 / 123.122 / 122.738 / .000
Year / 280.657 / 13.542 / .985 / 20.726 / .000
a. Dependent Variable: Energy Cost
Forecasting using Regression Analysis
  • Point Forecast vs. Interval Forecasts (Estimates):
  • Forecasting with Point Estimate (Forecast) - dangerous/stupid act
  • Plug a value in the independent variable; compute the regression equation.
  • Forecasting with Interval Estimate (Forecast) - safe move
  • Upper limit = Point estimate + SAMPLING ERROR
  • Lower limit = Point estimate - SAMPLING ERROR
  • SAMPLING ERROR = [Table value of Z] x [Standard Error of Y estimate (SE)]
    = Z x SE
  • Z (= Table value of Z =Critical value of Z) comes from your own choice of α value or confidence level:
  • Example: At α = 0.05, CC = 0.95; Z = 1.96 from the Z table.
  • Task: Make a forecast for Year = 20, with 95% statistical confidence.
  • CC = 0.95 Thus, α = 1 - 0.95 = 0.05; therefore, Z = 1.96
  • Point estimate: Y = 15111.74 + 280.66 x (20) = 20724.94
  • Interval estimate:
  • SAMPLING ERROR = Z x SE
    = 1.96 x 226.59
    = 444.13
    where (1) Z = 1.96 based onα = 0.05
    (2) 226.59 is from the Excel output.
  • Upper limit = Point estimate + ERROR
    = 20724.94 + 444.13
    = 21169.07
  • Lower limit = Point estimate - ERROR
    = 20724.94 - 444.13
    = 20280.81
  • Results: 20280.81 ≤ Yf ≤ 21169.07
  • Interpretation: Chances are 19 of 20 that the energy cost at Year 20 will fall into a range between 20280.81 and 21169.07.
  • “19 of 20” comes from 95% statistical confidence; 0.95 = 95/100 = 19/20

R2 - Coefficient of Determination - Model Reliability

R2 = Explained Variation = 1 - Unexplained Variation

______

Total Variation Total Variation

As shown in the equation, If Unexplained Variation is reduced, R2 will go upmathematically.

That is why Regression Analysis attempts to minimize the total unexplained variation using calculus, which will produce the highest possible R2 from the sample data.

If R2 is closer to 1, the model is more reliable.

0 ≤ R2 ≤ 1

Factory Energy Costs
X / Y / Yf / Y - Yf / (Y - Yf)2 / Yf - Ym / (Yf - Ym)2
Year / Energy Costs / Predicted, Forecast / Residual, Error, Unexplained Variation / Residual2 / Regression, Explained Variation / Regression2
1 / 15355.38 / 15392.40 / -37.02 / 1370.46 / -1964.60 / 3859654.14
2 / 15412.91 / 15673.06 / -260.15 / 67676.42 / -1683.94 / 2835664.27
3 / 15926.64 / 15953.71 / -27.07 / 733.01 / -1403.29 / 1969211.30
4 / 16614.18 / 16234.37 / 379.81 / 144254.66 / -1122.63 / 1260295.23
5 / 16918.69 / 16515.03 / 403.66 / 162942.64 / -841.97 / 708916.07
6 / 16837.14 / 16795.69 / 41.45 / 1718.46 / -561.31 / 315073.81
7 / 16812.51 / 17076.34 / -263.83 / 69607.76 / -280.66 / 78768.45
8 / 17102.45 / 17357.00 / -254.55 / 64795.70 / 0.00 / 0.00
9 / 17461.89 / 17637.66 / -175.77 / 30894.10 / 280.66 / 78768.45
10 / 17846.76 / 17918.31 / -71.55 / 5120.03 / 561.31 / 315073.81
11 / 18187.93 / 18198.97 / -11.04 / 121.92 / 841.97 / 708916.07
12 / 18782.19 / 18479.63 / 302.56 / 91543.33 / 1122.63 / 1260295.23
13 / 18863.18 / 18760.29 / 102.89 / 10587.20 / 1403.29 / 1969211.30
14 / 18914.00 / 19040.94 / -126.94 / 16114.54 / 1683.94 / 2835664.27
15 / 19319.15 / 19321.60 / -2.45 / 6.00 / 1964.60 / 3859654.14
Sum / 260355.00 / 0.00 / 667486.23 / 22055166.53
Average (Ym) / 17357.00 / MSE w/ df=13 / 51345.0947
Regression SS (Sum of Squares) = Explained Variation / 22055166.53
Residual SS (Sum of Squares) = Unexplained Variation / 667486.23
Total SS = Total Variation / 22722652.76
R Square (R2) = Regression SS ÷ Total SS
= Explained Variation ÷ Total Variation / 0.971

Note: Regression Analysis is also called OLS model. OLS stands for Ordinary Least Square.

  • There are many exotic regression models, which you can learn in the advanced Business Analytics courses at the graduate school. In contrast, the basic regression model is called “ordinary” model.
  • “Least” means the minimization. The basic regression model attempts to minimize the total unexplained variation.
  • “Square” indicates the sum of residual square (= total unexplained variation).

Page 1 of 8