Topics in Multiple Regression

In this set of notes we present different extensions to the regular multiple regression model formulated by . First we discuss the dummy variable that deals with categorical (qualitative) dataand then we’ll add the interaction term; finally we’ll discuss non linear regression models

Regression with Categorical Data

The dummy variable is a mathematical tool that makes it possible to include non-numerical information in the regression model. This makes the model more useful in decision making setting, where data may or may not belong to a certain category.

Value / Size(1000ft2) / Fireplace
84.4 / 2 / Yes
77.4 / 1.71 / No
75.7 / 1.45 / No
85.9 / 1.76 / Yes
79.1 / 1.93 / No
70.4 / 1.2 / Yes
75.8 / 1.55 / Yes
85.9 / 1.93 / Yes
78.5 / 1.59 / Yes
79.2 / 1.5 / Yes
86.7 / 1.9 / Yes
79.3 / 1.39 / Yes
74.5 / 1.54 / No
83.8 / 1.89 / Yes
76.8 / 1.59 / No

Example 1
One would like to know the effects on assessed house value of the house size, and whether or not there is a fireplace in the house. Data from a sample of 15 houses was recorded and is provided below:

The independent variable Size is quantitative (ft2), but the variable Fireplace is qualitative (Yes, No). We define a new variable “FirePlace” and let it have the value 1 when there is a fireplace and 0 when there is none. The data set becomes:

Value / Size(1000ft2) / Fireplace
84.4 / 2 / 1
77.4 / 1.71 / 0
75.7 / 1.45 / 0
85.9 / 1.76 / 1
79.1 / 1.93 / 0
70.4 / 1.2 / 1
75.8 / 1.55 / 1
85.9 / 1.93 / 1
78.5 / 1.59 / 1
79.2 / 1.5 / 1
86.7 / 1.9 / 1
79.3 / 1.39 / 1
74.5 / 1.54 / 0
83.8 / 1.89 / 1
76.8 / 1.59 / 0

Now we can formulate a multiple regression model of the form:

The house value for a house with a fireplace is described by the equation
which reduces to , while the house value for a house without a fireplace is described by which reduces to . In comparing the two equations the difference is at the intercept (0+2 vs. 0), while the size contribution to the house value (1) is the same for both houses. Since 1 is the slope, we have two parallel lines (see the graph) that describe the linear relationship for the two houses.

Such formulation fits the case where we assume the house value changes with the size at the same rate whether or not the house has a fireplace. When this is not the case additional term should be added to the model (interaction term). This will be discussed later.

After running the model in Excel we get the following output:

SUMMARY OUTPUT
Regression Statistics
Multiple R / 0.900587
R Square / 0.811057
Adjusted R Square / 0.779567
Standard Error / 2.262596
Observations / 15
ANOVA
df / SS / MS / F / Significance F
Regression / 2 / 263.703915 / 131.852 / 25.75565 / 4.55E-05
Residual / 12 / 61.4320854 / 5.11934
Total / 14 / 325.136
Coefficients / Standard Error / t Stat / P-value / Lower
95% / Upper 95%
Intercept / 200.0905 / 4.35165794 / 45.98029 / 7.31E-15 / 190.609 / 209.5719
Size / 16.18583 / 2.57444171 / 6.287124 / 4.02E-05 / 10.57661 / 21.79506
Fireplace / 3.852982 / 1.24122269 / 3.104183 / 0.009119 / 1.148591 / 6.557374

The estimated regression equation is: Value = 200.0905 +16.185(Size) + 3.853(FirePlace).

For houses with fireplace the equation is: Value = (200.905+3.853) + 16.18(Size);

For houses without fireplace the equation is: Value = 200.905 + 16.18(Size)

Observing the two equations when distinction is made between houses with and without fireplace, the assessed house value increases on the average by $16,185 for 1000 ft2 increase in the house size. When comparing two houses, on the average a house with a fireplace is assessed $3,853 more than an equal-size house without a fireplace.

In validating the model we see that 81% of the variability in house value assessment is explained by this model, and from the very small significance F of 4.55(10-5) we know that there is strong evidence in the data at least one of the variables is linearly related to the house value. We conclude that the model is very useful.

A linear regression with more than two-levelcategorical variable

In the previous example we dealt with a quantitative variable who had two possible levels (values), Fireplace = Yes or No. There are cases where a categorical variable has three or more levels. This is formulated by adding more dummy variables. For k levels we define k-1 dummy variables; this is enough to represent all the levels. For example, three levels require two dummy variables. This is so because level 1 = (1,0); Level 2 = (0,1); and Level 3 = (0,0). So, level 3 is defined as: “not level 1” and “not level 2”. The following example explains the concept.

Example 2

The director of a training program for a large insurance company is evaluating three different methods of training underwriters. The three methods are “Traditional”, CD_ROM based, and Web-based. She divided 30 trainees into 3 randomly assigned groups of 10. Before the start of the training, each trainee is given a proficiency test in mathematics and computer skills. At the end of the training, all students take the same end-of-training exam. The results are stored in the file UNDERWRITERS.

Develop a multiple regression model that helps predict the score on the end-of-training exam, based on the score on the proficiency test and the method of training used.

Solution
The dependent variable is “End score” and the independent variables are “Proficiency” (a quantitative variable) and the “Method” (a categorical variable). This variable has 3 values (TRADITIONAL, CD, WEB), therefore we’ll use 2 dummy variables. We selected (arbitrarily!) the variable CD and WEB to appear in the equation explicitly. This obviously affect the equation, but not the prediction obtained from the equation. The multiple regression models becomes End Score = 0 + 1Proficiency + 2CD + 3WEB + An excerpt from the data file is provided below:

End-Score / Proficiency / Method / End-Score / Proficiency / CD / WEB
14 / 94 / Traditional / 14 / 94 / 0 / 0
19 / 96 / Traditional / 19 / 96 / 0 / 0
17 / 98 / Traditional / 17 / 98 / 0 / 0
*** / *** / *** / *** / *** / *** / ***
38 / 80 / CD / / 38 / 80 / 1 / 0
34 / 84 / CD / 34 / 84 / 1 / 0
43 / 90 / CD / 43 / 90 / 1 / 0
*** / *** / *** / *** / *** / *** / ***
55 / 92 / Web / 55 / 92 / 0 / 1
53 / 96 / Web / 53 / 96 / 0 / 1
55 / 99 / Web / 55 / 99 / 0 / 1

After running the data in Excel we get the following output:

SUMMARY OUTPUT
Regression Statistics
Multiple R / 0.886397
R Square / 0.785699
Adjusted R Square / 0.760972
Standard Error / 9.634874
Observations / 30
ANOVA
df / SS / MS / F / Significance F
Regression / 3 / 8849.066 / 2949.689 / 31.77489 / 7.53E-09
Residual / 26 / 2413.601 / 92.8308
Total / 29 / 11262.67
Coefficients / Standard Error / t Stat / P-value / Lower 95% / Upper 95%
Intercept / -86.27 / 17.03405 / -5.06456 / 2.83E-05 / -121.284 / -51.256
Proficiency / 1.125782 / 0.158856 / 7.086787 / 1.59E-07 / 0.799248 / 1.452316
CD-ROM / 30.37672 / 4.322997 / 7.026774 / 1.84E-07 / 21.49067 / 39.26277
WEB / 22.28867 / 4.31543 / 5.164878 / 2.18E-05 / 13.41818 / 31.15917

The estimated multiple regression is End Score = -86.27 + 1.125Proficiency + 30.377CD + 22.289WEB.

The prediction equation that pertains to a trainee who uses the traditional method (CD = 0, WEB = 0) is End Score = -86.27 + 1.125Proficiency.
The prediction equation that pertains to a trainee who uses the CD_ROM method (CD = 1, WEB = 0) is End Score = (-86.27+ 30.377) + 1.125Proficiency
The prediction equation that pertains to a trainee who uses the WEB method (CD = 0, WEB = 1) is
End Score = (-86.27+ 22.289) + 1.125Proficiency

The model explains more than 78% of the variability in the end-of training test scores. The model is very useful (significance F = 7.53(10-9)). Each one of the variables contributes to the prediction power of the model as reflected by the very small p-values of the t-tests.

Let us now interpret the coefficients of this equation:

  • b1 = 1.125: For every one point increase in the proficiency test the end score is expected to increase by 1.125 points keeping the method of training the same. Comment: Note though that this rate of increase is common to all the methods.
  • b2 = 30.377: The end score of a trainee who uses the CD-ROM method is on the average higher by 30.377 points than this of a trainee who uses the traditional method, if both trainees had the same proficiency score. To understand this statement note that the equation describing the end score of a person who uses a CD can be written End score (CD) = End score (Traditional) + 30.377, provided the two persons have the same proficiency score (compare the equations above).
  • b3 = 22.289: The end score of a trainee who uses the Web method is on the average higher by 22.289 points than this of a trainee who uses the traditional method, if both trainees had the same proficiency score.

Predicting with the model:

  • Suppose we want to predict the end score of a trainee who scored 100 in the proficiency exam and is enrolled in the Web-based training. The prediction is End Score = -86.27 + 1.125(100) + 30.377(0) + 22.289(1) = 48.519.
  • Usually we’ll predict by means of prediction interval or confidence interval of the mean. Suppose we use 95% confidence level.
  • To provide a confidence interval for the person mentioned above (called prediction interval), we’ll use Data Analysis Plus and get the following result: Lower Limit = 27.78; Upper Limit = 69.41. In other words, the 95% confidence prediction for the person’s end score is 48.51927.74.
  • To provide a confidence interval for the mean end score of all the trainees whose proficiency score is 100 and take the Web-based course we use Data Analysis again and get the results: Lower Limit = 42.20; Upper Limit = 55. Note that this interval is smaller than the interval required for the single individual.

A linear regression with more than one categorical variable
When more than one type of categorization can specify the data collected, more than one set of binary variables is needed, one set for each category type. The following example demonstrates this case.

Example 3

A study was made in ObrienElectronics Ltd. about salary and its relationship to different groups of employees in the firm. The study needed to cover variety of aspects that influence salary such as years of experience, age, gender, and education.The data collected is recorded in the file SALARY.

(a)Set up a linear regression model that base salary paid on the variables mentioned above.

(b)Is there either gender or age ‘pay– discrimination’ in the company?

(c)Based on your answer in part (b) use the better model (apply the parsimony principle if necessary) to predict the salary in one year of a new 35 years old female person who has been working for 10 years now with another company, and who earned an MBA degree.

Solution

(a)The model: Salary = 0+1Exp+2MBA+3Male+4Over50+
Variable definitions: EXP = Years of experience
MBA = Yes/No (1,0)
Male = Yes/No (1,0)
Over50 = Yes/No (1,0)

(b)To answer the question about anypay discrimination with respect either age or gender we need to run the regression model and study the relevant (beta) coefficients. The hypotheses we need to set up are:
H0: 3 = 4 = 0
H1: At least one beta is not equal to zero.

****************************************

Let us take a pause in solving the current problem, in order to introduce a procedure used to analyze groups of variables. The procedure is called “Partial F-Test”. In short, it compares nested models, one is called the complete model and another one is called the reduced model:

The Partial F-Test

Complete Model: y = 0+1x1+2x2+ … +gxg+g+1xg+1+ … +kxk+ 
Reduced model: y = 0+1x1+2x2+ … +gxg+ 

Note that the reduced model is “nested” in the complete model. The procedure will test whether or not the addition of the group of variables xg+1, … ,xk substantially improves the reduced model. A summary of the Partial F-test analysis is provided next.

Definitions:

SSEC = the sum of squared errors for the complete model (when all the variables are included). This value is obtained from the Excel output of the computer run

SSER = the sum of squared errors for the reduced (partial) model. This value is obtained from the Excel output of the computer run.

k = the number of variables in the complete model

r = the number of variables in the reduced model

n = total sample size

The hypotheses tested are:

H0: g+1 = g+2 = … = k = 0

H1: At least one beta is not equal to zero

Intuition: If all the tested betas are equal to zero the reduced model and the complete model are identical and there was no improvement achieved by adding the group of variables. If at least one bi is not equal to zero than there is a linear relationship between the corresponding xi and y; the group should be added, because the complete model is better.

F-statistic:

Rejection Region: F > F,k-g,n-(k+1)

**********************************************

Example 3 - Continued

The complete model here is: Salary = 0+1Exp+2MBA+3Male+4Over50+

The reduced model is: Salary = 0+1Exp+2MBA+

To run the complete model we use all the data set. To run the reduced model we use only the columns of Salary (for the y range), and Exp, MBA(for the x range). The relevant information in the complete model output is:

ANOVA
df / SS / MS / F / Significance F
Regression / 4 / 10601.55 / 2650.387 / 45.89149 / 8.39E-10
Residual / 20 / 1155.067 / 57.75334
Total / 24 / 11756.61

From the reduced model output we have:

ANOVA
df / SS / MS / F / Significance F
Regression / 2 / 10262 / 5131.002 / 75.52607 / 1.4E-10
Residual / 22 / 1494.61 / 67.93683
Total / 24 / 11756.61

=

F.05,2,20 = 3.158

Since 2.498 < 3.158 we have insufficient evidence to support H1. This means that 3 and 4 could be equal to zero, so adding the two variables Gender and Over50 does not add information to the reduced model that has not already been there before. In terms of the specific variables, there is insufficient evidence to infer that there are differences in average pay between male and female employees as well as between older or younger workers,if the two employees compared have the same experience and same education.

Comments:

(i)The equation for the complete model is as follows:
Salary = 29.186 + 2.815EXP +14.167MBA+8.438MALE – 6.085Over50
We can learn, that when keeping the rest of the variables unchanged:

  1. For each additional year of experience the salary increase by $2815 on the average (p value = 8.94(10-11), so the variable is very strong predictor).
  2. An employee with an MBA degree earns on the average $1,416 more than an employee without this degree, if they are equal in all other characteristics (p value = 0.00732).
  3. From the equation (but see the discussion in comment(ii)) it seems a male employee earns $8,438 more than a female workers with similar characteristics (p value = 0.028)
  4. From the equation (but see the discussion in comment(ii)) it seems an employee over 50 years old earns on the average $6,085 less than a similar employee whose age is younger than 50.

(ii)It is interesting toobserve the complete model and see the significance of each individual variable “Male” and “Over50”. When testing the significance of the variable “Male” the p-value is .0286. This value is smaller than .05, so at 5% significance level we can argue that the variable “Male” contributes to improving a model that contains all the variables but “Male” (which means the variable Over50 is already ‘in’). Similarly, the p-value for the test of “Over50” is .17, so this variable does not improve a model that already includes the variable “Male”. The result that we obtained in the partial F-test answered a different question. Should we include any of these variables if none have been added yet! The answer was no.

(c)Based on the parsimony principle, we’ll use the smallest model that is still adequate. Thus we select the reduced model. The equation is:

(d)Salary = 33.47 + 2.79Exp + 11.65MBA = 33.47 + 2.79(10) = 11.65(1) = 73.062.

Regression Model with interaction

There are cases where the rate at which y changes when one independent variable, say x1 increases by 1 unit depends on the particular value of another independent variable, say x2, we say there is interaction between x1 and x2. In this case we need to add a new variable to the regression model that will take on this change of slope effect. We do it by adding the term X = x1x2 to the model, which in general becomes y = 0 + 1x1 + 2x2 + … + kxk + k+1x1x2 + k+2x1x3 + … + 

Note that not all the possible products need to be included, only those relevant to the case studied. Higher level products such as x1x2x3 may be included too, but will be omitted here.

Let us demonstrate the concept and the effects of such an interaction on the linear relationship by two examples. One includes quantitative variables only and the other one categorical variable as well.

Example 4

MPG / Horsepower / Weight
43.1 / 48 / 1985
19.9 / 110 / 3365
19.2 / 105 / 3535
17.7 / 165 / 3445

A consumer organization wants to develop a regression model to predict the MPG (miles per gallon) based on the horse power of the car’s engine and the car weight (in pounds). A sample of 50 recent car models was selected and relevant data recorded and saved in the file MPG. Here is an excerpt of this data file:

Initially the model formulated was MPG = 0 + 1HP + 2W + Then an analyst working for the organization suggested horsepower and weight might interact to affect the MPG

A new model was formulated: MPG = 0 + 1HP + 2W + HP)(W) + and a new column was added to the data file to express the interaction values in the sample. Here is an excerpt:

MPG / Horsepower / Weight / HP*W
43.1 / 48 / 1985 / 95280
19.9 / 110 / 3365 / 370150
19.2 / 105 / 3535 / 371175

After the data set was run on Excel again the following output was obtained:

SUMMARY OUTPUT
Regression Statistics
Multiple R / 0.894047
R2 / 0.799319
Adjusted R2 / 0.786232
Standard Error / 3.778069
Observations / 50
ANOVA
df / SS / MS / F / Significance F
Regression / 3 / 2615.247 / 871.748 / 61.0733 / 4.48E-16
Residual / 46 / 656.5952 / 14.2738
Total / 49 / 3271.842
Coefficients / Standard Error / t Stat / P-value / Lower
95% / Upper 95%
Intercept / 85.07138 / 8.313213 / 10.2332 / 1.94E-13 / 68.33775 / 101.805
Horsepower / -0.45077 / 0.102861 / -4.382 / 6.75E-05 / -0.65782 / -0.24372
Weight / -0.01524 / 0.00278 / -5.481 / 1.72E-06 / -0.02083 / -0.00964
HP*W / 0.0001 / 2.97E-05 / 3.38210 / 0.00147 / 4.07E-05 / 0.00016

The equation received is MPG = 85.07-.45HP-.015W+.0001(HP)(W). Almost 80% of the variation in MPG among the cars is explained by this model (r2 = .799); the model is very useful because Significance F = 4.48(10-16), an extremely small p-value. This means that at least one of the independent variables is linearly related to MPG.There is still an unanswered question about the contribution of the interaction to the model. This question can be answered by testing the coefficient b3 of the interaction term. Clearly, if b3 = 0 there is no affect to the interaction, so we set up the two hypotheses:

H0: 3 0
H1: 30

The p value of this t-test is .00147. At 5% significance level the null hypothesis is easily rejected. There is overwhelming evidence to infer that b3 is not equal to zero. This translates to the conclusion that the including the interaction reduces the errors thus contributes significantly to the precision of the model, in the presence of the other variables!

Interpreting the coefficient in the presence of interaction

Caution must be employed when interpreting the coefficients of the regression when interaction is involved. Although the main effect variables HP and W are present, one should not explain their coefficients as the rate of change of the dependent variable with respect to each independent variable. More specifically, the coefficient b1 = -.45 is not the amount of reduction in MPG per one unit increase of HP. To understand why, note the equation when W = C (C is some constant).

MPG = 85.07-.45HP-.015C+.0001(HP)(C) = (85.07 - .015C) + (-.45+.0001C)HP. So the rate of change of MPG with respect to HP is -.45+.0001C (and not -.45)! The relationship between HP and MPG is linear, but both the intercept and the slope of the line change with C (that is, both depend on W). As an illustration observe the relationships for two values of W: 2500, 3500.

(i)MPG = 85.07 - .015*2500 + (-.45 + .0001*2500)HP = 47.57 - .2HP

(ii)MPG = 85.07 - .015*3500 + (-.45 + .0001*3500)HP = 32.57 - .1HP

The following is a graphical description of the two lines, As you can see both the intercept and the slope change when the car weight changes.

We’ll revisit this model later as one of the summary problems.

Example 5