Chapter 20.3: Nominal Independent Variables

ANOVA and Multiple Regression

The two-sample t-test and ANOVA can be viewed as special cases of multiple regression with nominal independent variables. Consider Example 15.1 in which the goal was to compare three advertising strategies (Convenience, Quality and Price) for an apple juice product. Suppose first that we just wanted to compare Convenience and Quality. We could do a two-sample t-test:

Oneway Analysis of Sales By Strategy

t-Test

Difference / t-Test / DF / Prob > |t|
Estimate / -75.45 / -2.514 / 38 / 0.0163
Std Error / 30.01
Lower 95% / -136.20
Upper 95% / -14.70

Assuming equal variances

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio / Prob > F
Strategy / 1 / 56927.03 / 56927.0 / 6.3206 / 0.0163
Error / 38 / 342248.95 / 9006.6
C. Total / 39 / 399175.97

Means for Oneway Anova

Level / Number / Mean / Std Error / Lower 95% / Upper 95%
Convenience / 20 / 577.550 / 21.221 / 534.59 / 620.51
Quality / 20 / 653.000 / 21.221 / 610.04 / 695.96

Std Error uses a pooled estimate of error variance

We could also consider a regression model. Let if Convenience and if Quality. A regression model would look like this

.

Bivariate Fit of Sales By I1

Linear Fit

Sales = 577.55 + 75.45 I1

Summary of Fit

RSquare / 0.142611
RSquare Adj / 0.120048
Root Mean Square Error / 94.90285
Mean of Response / 615.275
Observations (or Sum Wgts) / 40

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio
Model / 1 / 56927.03 / 56927.0 / 6.3206
Error / 38 / 342248.95 / 9006.6 / Prob > F
C. Total / 39 / 399175.97 / 0.0163

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t|
Intercept / 577.55 / 21.22092 / 27.22 / <.0001
I1 / 75.45 / 30.01092 / 2.51 / 0.0163

Notice that the coefficient on I1 equals the difference between the mean of quality (I1=1) and the mean of convenience (I1=0). This makes sense because the coefficient on I1 is the average change in sales for a one unit increase in I1. This just equals the difference between the mean of quality and the mean of convenience. Also notice that the p-value for testing whether I1=0 is the same as the p-value for the two-sided t-test.

What happens if we also want to incorporate the price strategy into the analysis, i.e., use a one-way ANOVA with three levels for convenience, quality and price? The one-way ANOVA analysis is the following:

Oneway Analysis of Sales By Strategy

Oneway Anova

Summary of Fit

Rsquare / 0.101882
Adj Rsquare / 0.07037
Root Mean Square Error / 94.31038
Mean of Response / 613.0667
Observations (or Sum Wgts) / 60

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio / Prob > F
Strategy / 2 / 57512.23 / 28756.1 / 3.2330 / 0.0468
Error / 57 / 506983.50 / 8894.4
C. Total / 59 / 564495.73

Means for Oneway Anova

Level / Number / Mean / Std Error / Lower 95% / Upper 95%
Convenience / 20 / 577.550 / 21.088 / 535.32 / 619.78
Price / 20 / 608.650 / 21.088 / 566.42 / 650.88
Quality / 20 / 653.000 / 21.088 / 610.77 / 695.23

Std Error uses a pooled estimate of error variance

Means Comparisons

Dif=Mean[i]-Mean[j] / Quality / Price / Convenience
Quality / 0.000 / 44.350 / 75.450
Price / -44.350 / 0.000 / 31.100
Convenience / -75.450 / -31.100 / 0.000

Alpha=

0.05

Comparisons for each pair using Student's t

t
2.00247
Abs(Dif)-LSD / Quality / Price / Convenience
Quality / -59.721 / -15.371 / 15.729
Price / -15.371 / -59.721 / -28.621
Convenience / 15.729 / -28.621 / -59.721

Positive values show pairs of means that are significantly different.

There is a multiple regression equivalent to the ANOVA but it involves creating two dummy variables: if quality, if convenience or price and if price, if convenience or quality. The regression model looks like this:

In this case the intercept represents the mean for convenience, represents the mean for quality and represents the mean for price. represents the difference between the means for quality and convenience and represents the difference between the means for price and convenience.

Response Sales

Whole Model

Actual by Predicted Plot

Summary of Fit

RSquare / 0.101882
RSquare Adj / 0.07037
Root Mean Square Error / 94.31038
Mean of Response / 613.0667
Observations (or Sum Wgts) / 60

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio
Model / 2 / 57512.23 / 28756.1 / 3.2330
Error / 57 / 506983.50 / 8894.4 / Prob > F
C. Total / 59 / 564495.73 / 0.0468

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t|
Intercept / 577.55 / 21.08844 / 27.39 / <.0001
I1 / 75.45 / 29.82356 / 2.53 / 0.0142
I2 / 31.1 / 29.82356 / 1.04 / 0.3014

Effect Tests

Source / Nparm / DF / Sum of Squares / F Ratio / Prob > F
I1 / 1 / 1 / 56927.025 / 6.4003 / 0.0142
I2 / 1 / 1 / 9672.100 / 1.0874 / 0.3014

Residual by Predicted Plot

Notice that the coefficients on I1 and I2 are the difference in sample means between quality and convenience and between price and convenience respectively. Also notice that the p-value for the F-test of is identical to the p-value of the F-test from one-way ANOVA of the hypothesis that the means of convenience, quality and price are equal. This is the case because is equivalent to the means of convenience, quality and price being equal.

Combining Nominal Independent Variables and Continuous Independent Variables

Multiple linear regression can accomodate both categorical independent variables and continuous independent variables. For example (Keller and Warrack, section 20.3), suppose you want to predict used-car prices from their odometer reading and their color (white, silver and other). To represent the situation of three possible colors, we need two indicator variables. In general to represent a nominal variable with possible categories, we must create indicator variables. Here, we create two indicator variables.

1 if the color is white

= 0 if the color is not white

if the color is silver

0 if the color is not silver

The category “Other colors” is defined by .

Our regression model is

where equals odometer reading.

Response Price

Whole Model

Actual by Predicted Plot

Summary of Fit

RSquare / 0.69803
RSquare Adj / 0.688594
Root Mean Square Error / 284.5421
Mean of Response / 14822.82
Observations (or Sum Wgts) / 100

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio
Model / 3 / 17966997 / 5988999 / 73.9709
Error / 96 / 7772564 / 80964 / Prob > F
C. Total / 99 / 25739561 / <.0001

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t|
Intercept / 16700.646 / 184.3331 / 90.60 / <.0001
Odometer / -0.05554 / 0.004737 / -11.72 / <.0001
I-1 / 90.481959 / 68.16886 / 1.33 / 0.1876
I-2 / 295.47602 / 76.36998 / 3.87 / 0.0002

Effect Tests

Source / Nparm / DF / Sum of Squares / F Ratio / Prob > F
Odometer / 1 / 1 / 11129137 / 137.4575 / <.0001
I-1 / 1 / 1 / 142641 / 1.7618 / 0.1876
I-2 / 1 / 1 / 1211971 / 14.9692 / 0.0002

Residual by Predicted Plot

The interpretation of the coefficient 90.48 on I1 is that a white car sells, on the average, for $90.48 more than a car of the “Other Color” category with the same number of odometer miles. The interpretation of the coefficient 295.48 on I2 is that a silver color car sells, on the average for $295.48 more than a car of the “Other color” category with the same number of odometer miles. The interpretation of the cofficient -.05554 on odometer is that for two cars of the same color that differ in odometer by one, the car with an additional mile on the odometer sells for 5.55 cents less on average.

For each color car, there is a regression line for how the average price depends on odometer (). The regression lines for each color car are parallel. The indicator variables can be interpreted as shifting the intercepts of the regression lines. See the picture on page 708:

For white cars:

For silver cars:

For all other cars: .

Note: We will not consider it but it would be wise to investigate whether there was an interaction between color and odometer. A more general multiple regression model which would allow for different slopes and intercepts for each color of car would be

.

Example: Problem 20.23

The president of a company that manufactures car seats has been concerned about the number and cost of machine breakdowns. The problem is that the machines are old and becoming quite unreliable. However, the cost of replacing them is quite high, and the president is not certain that the cost can be made up in today’s slow economy.

In Exercise 18.85, a simple linear regression model was used to analyze the relationship between welding machine breakdowns and the age of the machine. The analysis proved to be so useful to company management that they decided to expand the model to include other machines. Data were gathered for two other machines. These data as well as the original data are stored in file Xr20-23 in the following way:

Column 1: Cost of repairs

Column 2: Age of machine

Column 3: Machine (1= welding machine; 2= lathe; 3=stamping machine)

(a) Develop a multiple regression model

(b) Interpret the coefficients

(c) Can we conclude that welding machines cost more to repair than stamping machines if the machines are of the same age?

Solution:

(a) Y=Cost of repairs. We want to include as independent variables both the continuous variable Age of Machine () and the nominal variable Machine. To include the nominal variable, we need to create two indicator variables (because the nominal variable takes on three values). Let

1 if the machine is a welding machine

= 0 if the machine is not a welding machine

if the machine is a lathe

0 if the machine is not a lathe

Our multiple regression model is

Response Repairs

Whole Model

Actual by Predicted Plot

Summary of Fit

RSquare / 0.593778
RSquare Adj / 0.572016
Root Mean Square Error / 48.59141
Mean of Response / 340.6457
Observations (or Sum Wgts) / 60

Analysis of Variance

Source / DF / Sum of Squares / Mean Square / F Ratio
Model / 3 / 193271.36 / 64423.8 / 27.2852
Error / 56 / 132223.03 / 2361.1 / Prob > F
C. Total / 59 / 325494.39 / <.0001

Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t|
Intercept / 119.25213 / 35.00037 / 3.41 / 0.0012
Age / 2.538233 / 0.402311 / 6.31 / <.0001
I-1 / -11.75534 / 19.70184 / -0.60 / 0.5531
I-2 / -199.3737 / 30.71301 / -6.49 / <.0001

Effect Tests

Source / Nparm / DF / Sum of Squares / F Ratio / Prob > F
Age / 1 / 1 / 93984.714 / 39.8050 / <.0001
I-1 / 1 / 1 / 840.574 / 0.3560 / 0.5531
I-2 / 1 / 1 / 99497.022 / 42.1397 / <.0001

Residual by Predicted Plot

The residual by predicted plot does not show any gross departures from a random scatter.

(b) For each additional month of age, repair costs increase on average by $2.54 for machines of the same type; welding machines cost on average $11.76 less to repair than stamping machines for the same age of machine; lathes cost on average $199.40 less to repair than stamping machines for the same age of machine.

(c) To test whether welding machines cost more on average to repair than stamping machines if the machines are of the same age, we want to test whether the coefficient on I-1 is zero, i.e., we want to test. The p-value for the two-sided test is 0.5531. Thus, there is not enough evidence to conclude that welding machines cost more on average to repair than stamping machines if the machines are of the same age.

Practice Problems from Chapter 20: 20.6, 20.8, 20.20, 20.24