Chapter 14: Additional Topics in Regression Analysis 369
Chapter 14:
Additional Topics in Regression Analysis
14.1
where Yi = College GPA
X1 = SAT score
X2 = 1 for sophomore, 0 otherwise
X3 = 1 for junior, 0 otherwise
X4 = 1 for senior, 0 otherwise
The excluded category is first year
14.2
where Yi = wages
X1 = Years of experience
X2 = 1 for Germany, 0 otherwise
X3 = 1 for Great Britain, 0 otherwise
X4 = 1 for Japan, 0 otherwise
X5 = 1 for Turkey, 0 otherwise
The excluded category consists of wages in the United States
14.3
where Yi = cost per unit
X1 = 1 for computer controlled machines, 0 otherwise
X2 = 1 for computer controlled machines & computer controlled material handling, 0 otherwise
X3 = 1 for South Africa, 0 otherwise
X4 = 1 for Japan, 0 otherwise
The excluded category is Colombia
14.4 a. For any observation, the values of the dummy variables sum to one. Since the equation has an intercept term, there is perfect multicollinearity and the existence of the “dummy variable trap”.
b. measures the expected difference between demand in the first and fourth quarters, all else equal. measures the expected difference between demand in the second and fourth quarters, all else equal. measures the expected difference between demand in the third and fourth quarters, all else equal.
14.5 Analyze the correlation matrix first:
Correlations: Sales Pizza1, Price Pizza1, Promotion Pi, Sales B2, Price B2, Sale
Sales Pi Price Pi Promotio Sales B2 Price B2 Sales B3 Price B3 Sales B4
Price Pi -0.263
0.001
Promotio 0.570 -0.203
0.000 0.011
Sales B2 0.136 0.170 0.031
0.090 0.034 0.700
Price B2 0.118 0.507 0.117 -0.370
0.143 0.000 0.146 0.000
Sales B3 0.014 0.174 0.045 0.103 0.199
0.862 0.029 0.581 0.199 0.013
Price B3 0.179 0.579 0.034 0.162 0.446 -0.316
0.026 0.000 0.675 0.043 0.000 0.000
Sales B4 0.248 0.102 0.123 0.310 0.136 0.232 0.081
0.002 0.205 0.127 0.000 0.091 0.004 0.313
Price B4 0.177 0.509 0.124 0.229 0.500 0.117 0.523 -0.158
0.027 0.000 0.124 0.004 0.000 0.147 0.000 0.049
Strongest correlation with sales of Pizza1 is the type of promotion. Price of Pizza1 has the correct ‘negative’ association with sales. Prices of the competing brands are expected to be positive since they are substitutes with Pizza1; however, the sales of the competing brands are expected to be negatively related to the sales of Pizza1.
Regression Analysis: Sales Pizza1 versus Price Pizza1, Promotion Pi, ...
The regression equation is
Sales Pizza1 = - 6406 - 24097 Price Pizza1 + 1675 Promotion Pizza1
+ 0.0737 Sales B2 + 4204 Price B2 + 0.177 Sales B3 + 18003 Price B3
+ 0.345 Sales B4 + 11813 Price B4
Predictor Coef SE Coef T P VIF
Constant -6406 2753 -2.33 0.021
Price Pi -24097 3360 -7.17 0.000 2.5
Promotio 1674.6 283.9 5.90 0.000 1.2
Sales B2 0.07370 0.08281 0.89 0.375 3.1
Price B2 4204 4860 0.87 0.388 4.3
Sales B3 0.17726 0.09578 1.85 0.066 1.9
Price B3 18003 4253 4.23 0.000 3.0
Sales B4 0.3453 0.1392 2.48 0.014 1.9
Price B4 11813 6151 1.92 0.057 3.0
S = 3700 R-Sq = 54.9% R-Sq(adj) = 52.4%
Analysis of Variance
Source DF SS MS F P
Regression 8 2447394873 305924359 22.35 0.000
Residual Error 147 2012350019 13689456
Total 155 4459744891
The multiple regression with all of the independent variables indicates that 54.9% of the variation in the sales of Pizza1 can be explained by all of the independent variables. However, not all of the independent variables are significantly different from zero. It appears that neither the price of Brand 2, nor the sales of Brand 2 has a statistically significant effect on Pizza1. Eliminating those variables that are insignificant yields:
Regression Analysis: Sales Pizza1 versus Price Pizza1, Promotion Pi, ...
The regression equation is
Sales Pizza1 = - 6546 - 23294 Price Pizza1 + 1701 Promotion Pizza1
+ 0.197 Sales B3 + 18922 Price B3 + 0.418 Sales B4 + 15152 Price B4
Predictor Coef SE Coef T P VIF
Constant -6546 2676 -2.45 0.016
Price Pi -23294 3210 -7.26 0.000 2.3
Promotio 1701.0 279.9 6.08 0.000 1.2
Sales B3 0.19737 0.09234 2.14 0.034 1.8
Price B3 18922 4092 4.62 0.000 2.8
Sales B4 0.4183 0.1137 3.68 0.000 1.3
Price B4 15152 4978 3.04 0.003 2.0
S = 3686 R-Sq = 54.6% R-Sq(adj) = 52.8%
Analysis of Variance
Source DF SS MS F P
Regression 6 2435527670 405921278 29.88 0.000
Residual Error 149 2024217221 13585350
Total 155 4459744891
All of the variables are now significant at the .05 level and all have the correct sign excepting the sales of brand 3 and 4.
14.6
where Yi = per capita cereal sales
X1 = cereal price
X2 = price of competing cereals
X3 = mean per capita income
X4 = % college graduates
X5 = mean annual temperature
X6 = mean annual rainfall
X7 = 1 for cities east of the Mississippi, 0 otherwise
X8 = 1 for high per capita income, 0 otherwise
X9 = 1 for intermediate per capita income, 0 otherwise
X10 = 1 for northwest, 0 otherwise
X11 = 1 for southwest, 0 otherwise
X12 = 1 for northeast, 0 otherwise
X13 = X1X7 – interaction term between price and cities east of the Mississippi
The model specification includes continuous independent variables, dichotomous indicator variables and slope dummy variables. Based on economic demand theory, we would expect the coefficient on cereal price to be negative due to the law of demand. Prices of substitutes are expected to have a positive impact on per capita cereal sales. If the cereal is deemed a normal good, mean per capita income will have a positive impact on sales. The signs and sizes of other variables may be empirically determined. While the functional form can be linear, non-linearity could be introduced based on an initial analysis of the scatterplots of the relationships. High correlation among the independent variables could also be detected, for example, per capita income and % college graduates may very well be collinear. Several iterations of the model could be conducted to find the optimal combinations of variables.
14.7 Define the following variables for the experiment
Y = the number of defective parts per 8 hour work shift
X1 = Shift
1. Day shift
2. Afternoon shift
3. Night shift
X2 = Material suppliers
1. Supplier 1
2. Supplier 2
3. Supplier 3
4. Supplier 4
X3 = production level
X4 = number of shift workers
Two series of dummy variables are required to analyze the impact of shifts and materials suppliers on the number of defective parts. For each dummy variable, (k-1) categories are required to avoid the ‘dummy variable trap.’ Interaction terms may be appropriate between the production level and shift.
14.8 Define the following variables for the experiment
Y = worker compensation
X1 = years of experience
X2 = job classification level
1. Apprentice
2. Professional
3. Master
X3 = individual ability
X4 = gender
1. male
2. female
X5 = race
1. White
2. Black
3. Latino
Two different dependent variables can be developed from the salary data. Base compensation will be one analysis that can be conducted. The incremental salaries can also be analyzed. Dummy variables are required to analyze the impact of job classifications on salary. Discrimination can be measured by the size of the dummy variable on gender and on race. For each dummy variable, (k-1) categories are required to avoid the ‘dummy variable trap.’ The F-test for the significance of the overall regression will be utilized to determine whether the model has significant explanatory power. And the t-test for the significance of the individual regression slope coefficients will be utilized to determine the impact of each independent variable. Model diagnostics will be based on R-square and the behavior of the residuals.
14.9 a. Define the following variables for the experiment
Y = worker compensation – annual average rate of wage increase
X1 = years of experience
X2 = job classification group
1. Administrative
2. Analytical
3. Managerial
X3 = 1 for MBA, 0 otherwise
X4 = gender
1. male
2. female
X5 = race
1. White
2. Black
3. Latino
Average annual rate of wage increase can be analyzed with a combination of continuous independent variables and a series of dummy variables. Dummy variables are required to analyze the impact of job classifications on salary. Discrimination can be measured by the size of the dummy variable on gender and on race. For each dummy variable, (k-1) categories are required to avoid the ‘dummy variable trap.’
b. Key points would include interpretations of coefficients on the dichotomous variables and the existence, if any, of interaction terms. Tests of significance of the overall regression, t-tests on significance of individual coefficients and model diagnostics would be conducted to provide statistical evidence of wage discrimination.
14.10 What is the long term effect of a one unit increase in x in period t?
a. = = 3.03
b. = = 3.289
c. = = 5.556
d. = = 6.515
14.11 a.
, therefore, do not reject H0 at the 5% level
b. , 95% CI: .142 ± 2.08(.047), (.0442, .2398)
c. Total effect:
A $.25 increase in clothing expenditures
14.12
Regression Analysis: Y Retail Sales versus X Income, Ylag1
The regression equation is
Y Retail Sales = 1752 + 0.367 X Income + 0.053 Ylag1
21 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant 1751.6 500.0 3.50 0.003
X Incom 0.36734 0.08054 4.56 0.000
Ylag1 0.0533 0.2035 0.26 0.796
S = 153.4 R-Sq = 91.7% R-Sq(adj) = 90.7%
, therefore, do not reject H0 at the 20% level
14.13
Regression Analysis: Y_money versus X1_income, X2_ir, Y_lagmoney
The regression equation is
Y_money = - 2309 + 0.158 X1_income - 14126 X2_ir + 1.06 Y_lagmoney
27 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant -2309 1876 -1.23 0.231
X1_incom 0.1584 0.2263 0.70 0.491
X2_ir -14126 6372 -2.22 0.037
Y_lagmon 1.0631 0.1266 8.40 0.000
S = 456.1 R-Sq = 97.6% R-Sq(adj) = 97.3%
Analysis of Variance
Source DF SS MS F P
Regression 3 194108213 64702738 311.02 0.000
Residual Error 23 4784762 208033
Total 26 198892975
Source DF Seq SS
X1_incom 1 167714527
X2_ir 1 11728933
Y_lagmon 1 14664753
Unusual Observations
Obs X1_incom Y_money Fit SE Fit Residual St Resid
24 17455 24975.2 23990.1 186.1 985.1 2.37R
25 16620 24736.3 24663.3 322.8 73.0 0.23 X
26 17779 23407.3 24922.0 189.3 -1514.7 -3.6
Durbin-Watson statistic = 1.65
14.14
Regression Analysis: Y_%stocks versus X_Return, Y_lag%stocks
The regression equation is
Y_%stocks = 1.65 + 0.228 X_Return + 0.950 Y_lag%stocks
24 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant 1.646 2.414 0.68 0.503
X_Return 0.22776 0.03015 7.55 0.000
Y_lag%st 0.94999 0.04306 22.06 0.000
S = 2.351 R-Sq = 95.9% R-Sq(adj) = 95.5%
Analysis of Variance
Source DF SS MS F P
Regression 2 2689.6 1344.8 243.38 0.000
Residual Error 21 116.0 5.5
Total 23 2805.6
Source DF Seq SS
X_Return 1 0.7
Y_lag%st 1 2688.9
Unusual Observations
Obs X_Return Y_%stock Fit SE Fit Residual St Resid
20 -26.5 56.000 60.210 1.160 -4.210 -2.06R
14.15
Regression Analysis: Y_income versus X_money, Y_lagincome
The regression equation is
Y_income = 11843 + 0.388 X_money + 0.807 Y_lagincome
19 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant 11843 5666 2.09 0.053
X_money 0.3875 0.3778 1.03 0.320
Y_laginc 0.8068 0.1801 4.48 0.000
S = 1952 R-Sq = 99.6% R-Sq(adj) = 99.6%
Analysis of Variance
Source DF SS MS F P
Regression 2 15787845901 7893922950 2071.84 0.000
Residual Error 16 60961685 3810105
Total 18 15848807586
Source DF Seq SS
X_money 1 15711421835
Y_laginc 1 76424065
Unusual Observations
Obs X_money Y_income Fit SE Fit Residual St Resid
13 68694 182744 178826 521 3918 2.08R
14.16
Regression Analysis: Y_Birth versus X_1stmarriage, Y_lagBirth
The regression equation is
Y_Birth = 21262 + 0.485 X_1stmarriage + 0.192 Y_lagBirth
19 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant 21262 5720 3.72 0.002
X_1stmar 0.4854 0.1230 3.94 0.001
Y_lagBir 0.1923 0.1898 1.01 0.326
S = 2513 R-Sq = 93.7% R-Sq(adj) = 93.0%
Analysis of Variance
Source DF SS MS F P
Regression 2 1515082551 757541276 119.93 0.000
Residual Error 16 101062160 6316385
Total 18 1616144711
Source DF Seq SS
X_1stmar 1 1508597348
Y_lagBir 1 6485203
Unusual Observations
Obs X_1stmar Y_Birth Fit SE Fit Residual St Resid
15 105235 95418 89340 982 6078 2.63R
14.17
Regression Analysis: Y_logSales versus X_logAdExp, Y_loglagSales
The regression equation is
Y_logSales = 0.492 + 0.746 X_logAdExp + 0.263 Y_loglagSales
24 cases used 1 cases contain missing values
Predictor Coef SE Coef T P
Constant 0.4920 0.3913 1.26 0.222
X_logAdE 0.74569 0.09934 7.51 0.000
Y_loglag 0.26313 0.09136 2.88 0.009
S = 0.05506 R-Sq = 94.0% R-Sq(adj) = 93.4%
Analysis of Variance
Source DF SS MS F P
Regression 2 0.99860 0.49930 164.73 0.000
Residual Error 21 0.06365 0.00303
Total 23 1.06225
Source DF Seq SS
X_logAdE 1 0.97346
Y_loglag 1 0.02515
Unusual Observations
Obs X_logAdE Y_logSal Fit SE Fit Residual St Resid
15 6.88 7.4883 7.6214 0.0136 -0.1331 -2.49R
14.18
Regression Analysis: Y_logCons versus X_LogDI, Y_laglogCons
The regression equation is
Y_logCons = 0.405 + 0.373 X_LogDI + 0.558 Y_laglogCons