LINEAR REGRESSION ANALYSIS

Introduction

Regression was first used in 1877 by Francis Galton while studying the relationship between the height of fathers and sons. His study of height was about one thousand fathers and sons revealed a very interesting relationship. It is assumed that tall fathers tend to have tall sons and short fathers have short sons. But the study revealed that tall father’sson is shorter than himself and a short father has sons taller than himself. The height of sons regressed to the mean. The term "regression" is now used for many sorts of curve fitting.

Linear Regression

Linear regression analyzes the relationship between two variables, X and Y. For each subject (or experimental unit), we know both X and Y and want to find the best straight line through the data. In some situations, the slope and/or intercept have a scientific meaning. In other cases, we use the linear regression line as a standard curve to find new values of X from Y, or Y from X.

Goal

In general, the goal of linear regression is to find the line that best predicts Y from X. Linear regression does this by finding the line that minimizes the sum of the squares of the vertical distances of the points from the line. Important thing is that linear regression does not test whether data are linear or not (except via the runs test). It assumes that data are linear, and finds the slope and intercept that make a straight line best fit the data.

Regression or Correlation?

Linear regression and correlation are similar and easily confused. In some situations it makes sense to perform both calculations. Calculate correlation if:

  • We measured both X and Y in each subject and wish to measure that how well they are associated.
  • Calculate the Pearson (parametric) correlation coefficients if we can assume that both X and Y is sampled from normally-distributed populations.
  • Otherwise calculate the Spearman (nonparametric) correlation coefficient.

Calculate linear regressions only if

  • One of the variables X is likely to precede or cause the other variable Y.
  • Choose linear regression if we manipulated the X variable, e.g. in an experiment. It makes a difference which variable is called X and which is called Y, as linear regression calculations are not symmetrical with respect to X and Y. If we swap the two variables, we will obtain a different regression line.

In contrast, correlation calculations are symmetrical with respect to X and Y. If we swap the labels X and Y, we will still get the same correlation coefficient.

The regression analysis helps in three important ways

  • It provides estimates of values of the dependent variables from values of independent variables.
  • It helps to obtain a measure of the error involved in using the regression line as a basis for estimations.
  • It helps to obtain measure of the degree of association or correlation that exists between the two variables.

Difference between correlation and regression analysis

Correlation / Regression
It measures the degree of relationship between x and y / It measures the nature of relationship between the variables
It doesn’t indicate the cause and effect relation between the variables. / It indicates the cause and effect relation between the variables.
For example- there is a strong correlation between rooster’s crows with the rising of the sun but the rooster does not cause the sun to rise. / For example – there is a cause and effect relationship between the sales price and demand. If price decreases demand increases.

Assumption for simple linear regression model

  • The value of the dependent variable, Y, is dependent in some degree upon the value of the independent variable, X.
  • The average relationship between X and Y can be adequately described by a linear equation Y = a + bX whose geometrical presentation is a straight line.
  • Associated with each value of X there is a sub-population of Y.
  • The mean of each sub-population Y is called the expected value of Y for a given
  • An individual value in each sup-population Y, may be expressed as, Where “e” is the deviationof a particular value of y from .
  • It is assumed that the variances of all sub-populations, called variances of the regression, are identical.

Uses

  • Regression analysis is widely used in almost all the scientific disciplines. In economics it is the basic technique for measuring the relationship among economic variables that constitute the essence of economic theory and economic life.
  • Sales forecast depending different variables like advertising, technology etc.

Linear Regression Analysis

Linear equation is,Y = a + bX

Here, Y = dependent variable

a = Y intercept / fixed value

b = Slope

X = independent variable

Least square method

This method mathematically determines the best fitting regression line for the observed data. It assumes that scatter of points around the best-fit line has the same standard deviation all along the curve. The assumption is violated if the points with high or low X values tend to be further from the best-fit line. The assumption that the standard deviation is the same everywhere is termed homoscedasticity.

The goal of linear regression is to adjust the values of slope and intercept to find the line that best predicts Y from X. More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line. Why minimize the sum of the squares of the distances? Why not simply minimize the sum of the actual distances? From the following data a best fit line is drawn.

x / y
5 / 8
4 / 8
9 / 12
6 / 9
7 / 10
5 / 6

If we take the sum of deviation or absolute deviation, it does not stress the magnitude of the error. So, if we want to penalize the large absolute errors so that we can avoid then, we can accomplish this if we squarethe individual errors before we add them. Squaring each term accomplishes two goals:

  • It magnifies or penalizes the larger errors.
  • It cancels the effect of the positive and negative values.

As we are looking for the estimating line that minimizes the sum of the wuares of the errors, this is called least square method.

Application of Linear Regression

Example:

The Vice President for research and development of M.M. Ispahani limited, a large tea marketing company, believes that the firm’s annual profits depend on the amount spent on R & D. the new chairman does not agree and has asked for evidence. Here are the data for 6 years

Year / R&D expenses (in Lakhs) / Annual Profit (in lakhs)
2005 / 2 / 20
2006 / 3 / 25
2007 / 5 / 34
2008 / 4 / 30
2009 / 11 / 40
2010 / 5 / 31

The vice president wants to make a relationship and a prediction of profit for the year 2011 if the R&D expenditure is 9 lakhs.

Solution

Year / X / Y / X2 / XY
2005 / 2 / 20 / 4 / 40
2006 / 3 / 25 / 9 / 75
2007 / 5 / 34 / 25 / 170
2008 / 4 / 30 / 14 / 120
2009 / 11 / 40 / 121 / 440
2010 / 5 / 31 / 25 / 155
∑X= 30 / ∑Y= 180 / ∑X2 = 200 / ∑XY = 1000

Now, by putting the value in the equation we find the value of

a = 20 and b = 2

so, Y = 20 + 2 X

Now, to forecast the profit for year 2011 if the R&D expenditure is 9 lakhs.

Y = 20 + 2 (9)

= 38 lakhs

So, it is expected that about 38 lakh taka of profit will come it the R&D expense is 9 lakh for year 2011.

Demonstration: