Additional Topics in Regression Analysis

Chapter 14: Additional Topics in Regression Analysis 369

Chapter 14:

Additional Topics in Regression Analysis

14.1

where Yi = College GPA

X1 = SAT score

X2 = 1 for sophomore, 0 otherwise

X3 = 1 for junior, 0 otherwise

X4 = 1 for senior, 0 otherwise

The excluded category is first year

14.2

where Yi = wages

X1 = Years of experience

X2 = 1 for Germany, 0 otherwise

X3 = 1 for Great Britain, 0 otherwise

X4 = 1 for Japan, 0 otherwise

X5 = 1 for Turkey, 0 otherwise

The excluded category consists of wages in the United States

14.3

where Yi = cost per unit

X1 = 1 for computer controlled machines, 0 otherwise

X2 = 1 for computer controlled machines & computer controlled material handling, 0 otherwise

X3 = 1 for South Africa, 0 otherwise

X4 = 1 for Japan, 0 otherwise

The excluded category is Colombia

14.4 a. For any observation, the values of the dummy variables sum to one. Since the equation has an intercept term, there is perfect multicollinearity and the existence of the “dummy variable trap”.

b. measures the expected difference between demand in the first and fourth quarters, all else equal. measures the expected difference between demand in the second and fourth quarters, all else equal. measures the expected difference between demand in the third and fourth quarters, all else equal.

14.5 Analyze the correlation matrix first:

Correlations: Sales Pizza1, Price Pizza1, Promotion Pi, Sales B2, Price B2, Sale

Sales Pi Price Pi Promotio Sales B2 Price B2 Sales B3 Price B3 Sales B4

Price Pi -0.263

0.001

Promotio 0.570 -0.203

0.000 0.011

Sales B2 0.136 0.170 0.031

0.090 0.034 0.700

Price B2 0.118 0.507 0.117 -0.370

0.143 0.000 0.146 0.000

Sales B3 0.014 0.174 0.045 0.103 0.199

0.862 0.029 0.581 0.199 0.013

Price B3 0.179 0.579 0.034 0.162 0.446 -0.316

0.026 0.000 0.675 0.043 0.000 0.000

Sales B4 0.248 0.102 0.123 0.310 0.136 0.232 0.081

0.002 0.205 0.127 0.000 0.091 0.004 0.313

Price B4 0.177 0.509 0.124 0.229 0.500 0.117 0.523 -0.158

0.027 0.000 0.124 0.004 0.000 0.147 0.000 0.049

Strongest correlation with sales of Pizza1 is the type of promotion. Price of Pizza1 has the correct ‘negative’ association with sales. Prices of the competing brands are expected to be positive since they are substitutes with Pizza1; however, the sales of the competing brands are expected to be negatively related to the sales of Pizza1.

Regression Analysis: Sales Pizza1 versus Price Pizza1, Promotion Pi, ...

The regression equation is

Sales Pizza1 = - 6406 - 24097 Price Pizza1 + 1675 Promotion Pizza1

+ 0.0737 Sales B2 + 4204 Price B2 + 0.177 Sales B3 + 18003 Price B3

+ 0.345 Sales B4 + 11813 Price B4

Predictor Coef SE Coef T P VIF

Constant -6406 2753 -2.33 0.021

Price Pi -24097 3360 -7.17 0.000 2.5

Promotio 1674.6 283.9 5.90 0.000 1.2

Sales B2 0.07370 0.08281 0.89 0.375 3.1

Price B2 4204 4860 0.87 0.388 4.3

Sales B3 0.17726 0.09578 1.85 0.066 1.9

Price B3 18003 4253 4.23 0.000 3.0

Sales B4 0.3453 0.1392 2.48 0.014 1.9

Price B4 11813 6151 1.92 0.057 3.0

S = 3700 R-Sq = 54.9% R-Sq(adj) = 52.4%

Analysis of Variance

Source DF SS MS F P

Regression 8 2447394873 305924359 22.35 0.000

Residual Error 147 2012350019 13689456

Total 155 4459744891

The multiple regression with all of the independent variables indicates that 54.9% of the variation in the sales of Pizza1 can be explained by all of the independent variables. However, not all of the independent variables are significantly different from zero. It appears that neither the price of Brand 2, nor the sales of Brand 2 has a statistically significant effect on Pizza1. Eliminating those variables that are insignificant yields:

Regression Analysis: Sales Pizza1 versus Price Pizza1, Promotion Pi, ...

The regression equation is

Sales Pizza1 = - 6546 - 23294 Price Pizza1 + 1701 Promotion Pizza1

+ 0.197 Sales B3 + 18922 Price B3 + 0.418 Sales B4 + 15152 Price B4

Predictor Coef SE Coef T P VIF

Constant -6546 2676 -2.45 0.016

Price Pi -23294 3210 -7.26 0.000 2.3

Promotio 1701.0 279.9 6.08 0.000 1.2

Sales B3 0.19737 0.09234 2.14 0.034 1.8

Price B3 18922 4092 4.62 0.000 2.8

Sales B4 0.4183 0.1137 3.68 0.000 1.3

Price B4 15152 4978 3.04 0.003 2.0

S = 3686 R-Sq = 54.6% R-Sq(adj) = 52.8%

Analysis of Variance

Source DF SS MS F P

Regression 6 2435527670 405921278 29.88 0.000

Residual Error 149 2024217221 13585350

Total 155 4459744891

All of the variables are now significant at the .05 level and all have the correct sign excepting the sales of brand 3 and 4.

14.6

where Yi = per capita cereal sales

X1 = cereal price

X2 = price of competing cereals

X3 = mean per capita income

X4 = % college graduates

X5 = mean annual temperature

X6 = mean annual rainfall

X7 = 1 for cities east of the Mississippi, 0 otherwise

X8 = 1 for high per capita income, 0 otherwise

X9 = 1 for intermediate per capita income, 0 otherwise

X10 = 1 for northwest, 0 otherwise

X11 = 1 for southwest, 0 otherwise

X12 = 1 for northeast, 0 otherwise

X13 = X1X7 – interaction term between price and cities east of the Mississippi

The model specification includes continuous independent variables, dichotomous indicator variables and slope dummy variables. Based on economic demand theory, we would expect the coefficient on cereal price to be negative due to the law of demand. Prices of substitutes are expected to have a positive impact on per capita cereal sales. If the cereal is deemed a normal good, mean per capita income will have a positive impact on sales. The signs and sizes of other variables may be empirically determined. While the functional form can be linear, non-linearity could be introduced based on an initial analysis of the scatterplots of the relationships. High correlation among the independent variables could also be detected, for example, per capita income and % college graduates may very well be collinear. Several iterations of the model could be conducted to find the optimal combinations of variables.

14.7 Define the following variables for the experiment

Y = the number of defective parts per 8 hour work shift

X1 = Shift

1. Day shift

2. Afternoon shift

3. Night shift

X2 = Material suppliers

1. Supplier 1

2. Supplier 2

3. Supplier 3

4. Supplier 4

X3 = production level

X4 = number of shift workers

Two series of dummy variables are required to analyze the impact of shifts and materials suppliers on the number of defective parts. For each dummy variable, (k-1) categories are required to avoid the ‘dummy variable trap.’ Interaction terms may be appropriate between the production level and shift.

14.8 Define the following variables for the experiment

Y = worker compensation

X1 = years of experience

X2 = job classification level

1. Apprentice

2. Professional

3. Master

X3 = individual ability

X4 = gender

1. male

2. female

X5 = race

1. White

2. Black

3. Latino

Two different dependent variables can be developed from the salary data. Base compensation will be one analysis that can be conducted. The incremental salaries can also be analyzed. Dummy variables are required to analyze the impact of job classifications on salary. Discrimination can be measured by the size of the dummy variable on gender and on race. For each dummy variable, (k-1) categories are required to avoid the ‘dummy variable trap.’ The F-test for the significance of the overall regression will be utilized to determine whether the model has significant explanatory power. And the t-test for the significance of the individual regression slope coefficients will be utilized to determine the impact of each independent variable. Model diagnostics will be based on R-square and the behavior of the residuals.

14.9 a. Define the following variables for the experiment

Y = worker compensation – annual average rate of wage increase

X1 = years of experience

X2 = job classification group

1. Administrative

2. Analytical

3. Managerial

X3 = 1 for MBA, 0 otherwise

X4 = gender

1. male

2. female

X5 = race

1. White

2. Black

3. Latino

Average annual rate of wage increase can be analyzed with a combination of continuous independent variables and a series of dummy variables. Dummy variables are required to analyze the impact of job classifications on salary. Discrimination can be measured by the size of the dummy variable on gender and on race. For each dummy variable, (k-1) categories are required to avoid the ‘dummy variable trap.’

b. Key points would include interpretations of coefficients on the dichotomous variables and the existence, if any, of interaction terms. Tests of significance of the overall regression, t-tests on significance of individual coefficients and model diagnostics would be conducted to provide statistical evidence of wage discrimination.

14.10 What is the long term effect of a one unit increase in x in period t?

a. = = 3.03

b. = = 3.289

c. = = 5.556

d. = = 6.515

14.11 a.

, therefore, do not reject H0 at the 5% level

b. , 95% CI: .142 ± 2.08(.047), (.0442, .2398)

c. Total effect:

A $.25 increase in clothing expenditures

14.12

Regression Analysis: Y Retail Sales versus X Income, Ylag1

The regression equation is

Y Retail Sales = 1752 + 0.367 X Income + 0.053 Ylag1

21 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant 1751.6 500.0 3.50 0.003

X Incom 0.36734 0.08054 4.56 0.000

Ylag1 0.0533 0.2035 0.26 0.796

S = 153.4 R-Sq = 91.7% R-Sq(adj) = 90.7%

, therefore, do not reject H0 at the 20% level

14.13

Regression Analysis: Y_money versus X1_income, X2_ir, Y_lagmoney

The regression equation is

Y_money = - 2309 + 0.158 X1_income - 14126 X2_ir + 1.06 Y_lagmoney

27 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant -2309 1876 -1.23 0.231

X1_incom 0.1584 0.2263 0.70 0.491

X2_ir -14126 6372 -2.22 0.037

Y_lagmon 1.0631 0.1266 8.40 0.000

S = 456.1 R-Sq = 97.6% R-Sq(adj) = 97.3%

Analysis of Variance

Source DF SS MS F P

Regression 3 194108213 64702738 311.02 0.000

Residual Error 23 4784762 208033

Total 26 198892975

Source DF Seq SS

X1_incom 1 167714527

X2_ir 1 11728933

Y_lagmon 1 14664753

Unusual Observations

Obs X1_incom Y_money Fit SE Fit Residual St Resid

24 17455 24975.2 23990.1 186.1 985.1 2.37R

25 16620 24736.3 24663.3 322.8 73.0 0.23 X

26 17779 23407.3 24922.0 189.3 -1514.7 -3.6

Durbin-Watson statistic = 1.65

14.14

Regression Analysis: Y_%stocks versus X_Return, Y_lag%stocks

The regression equation is

Y_%stocks = 1.65 + 0.228 X_Return + 0.950 Y_lag%stocks

24 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant 1.646 2.414 0.68 0.503

X_Return 0.22776 0.03015 7.55 0.000

Y_lag%st 0.94999 0.04306 22.06 0.000

S = 2.351 R-Sq = 95.9% R-Sq(adj) = 95.5%

Analysis of Variance

Source DF SS MS F P

Regression 2 2689.6 1344.8 243.38 0.000

Residual Error 21 116.0 5.5

Total 23 2805.6

Source DF Seq SS

X_Return 1 0.7

Y_lag%st 1 2688.9

Unusual Observations

Obs X_Return Y_%stock Fit SE Fit Residual St Resid

20 -26.5 56.000 60.210 1.160 -4.210 -2.06R

14.15

Regression Analysis: Y_income versus X_money, Y_lagincome

The regression equation is

Y_income = 11843 + 0.388 X_money + 0.807 Y_lagincome

19 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant 11843 5666 2.09 0.053

X_money 0.3875 0.3778 1.03 0.320

Y_laginc 0.8068 0.1801 4.48 0.000

S = 1952 R-Sq = 99.6% R-Sq(adj) = 99.6%

Analysis of Variance

Source DF SS MS F P

Regression 2 15787845901 7893922950 2071.84 0.000

Residual Error 16 60961685 3810105

Total 18 15848807586

Source DF Seq SS

X_money 1 15711421835

Y_laginc 1 76424065

Unusual Observations

Obs X_money Y_income Fit SE Fit Residual St Resid

13 68694 182744 178826 521 3918 2.08R

14.16

Regression Analysis: Y_Birth versus X_1stmarriage, Y_lagBirth

The regression equation is

Y_Birth = 21262 + 0.485 X_1stmarriage + 0.192 Y_lagBirth

19 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant 21262 5720 3.72 0.002

X_1stmar 0.4854 0.1230 3.94 0.001

Y_lagBir 0.1923 0.1898 1.01 0.326

S = 2513 R-Sq = 93.7% R-Sq(adj) = 93.0%

Analysis of Variance

Source DF SS MS F P

Regression 2 1515082551 757541276 119.93 0.000

Residual Error 16 101062160 6316385

Total 18 1616144711

Source DF Seq SS

X_1stmar 1 1508597348

Y_lagBir 1 6485203

Unusual Observations

Obs X_1stmar Y_Birth Fit SE Fit Residual St Resid

15 105235 95418 89340 982 6078 2.63R

14.17

Regression Analysis: Y_logSales versus X_logAdExp, Y_loglagSales

The regression equation is

Y_logSales = 0.492 + 0.746 X_logAdExp + 0.263 Y_loglagSales

24 cases used 1 cases contain missing values

Predictor Coef SE Coef T P

Constant 0.4920 0.3913 1.26 0.222

X_logAdE 0.74569 0.09934 7.51 0.000

Y_loglag 0.26313 0.09136 2.88 0.009

S = 0.05506 R-Sq = 94.0% R-Sq(adj) = 93.4%

Analysis of Variance

Source DF SS MS F P

Regression 2 0.99860 0.49930 164.73 0.000

Residual Error 21 0.06365 0.00303

Total 23 1.06225

Source DF Seq SS

X_logAdE 1 0.97346

Y_loglag 1 0.02515

Unusual Observations

Obs X_logAdE Y_logSal Fit SE Fit Residual St Resid

15 6.88 7.4883 7.6214 0.0136 -0.1331 -2.49R

14.18

Regression Analysis: Y_logCons versus X_LogDI, Y_laglogCons

The regression equation is

Y_logCons = 0.405 + 0.373 X_LogDI + 0.558 Y_laglogCons