SW983 - CODING OF CATEGORICAL VARIABLES IN MULTIPLE REGRESSION ANALYSIS

Overview

Multiple regression analysis (MR) and other correlation based statistical techniques like factor analysis are typically used for continuous variables. MR was developed as an extension of correlation (partial and semi- partial)

The techniques discussed today, broaden its application to include the use of categorical variables. The goal then is to learn how categorical variables can be introduced to MR analysis and how the results can be interpreted.

Recall that categorical variables can be used to measure group membership. Subjects differ from each other in type or kind rather than in degree (continuous). The categorical (or discrete) variable may reflect assignment to a group (e.g. experimental or control group) or some attribute variable (e.g. sex of respondent).

Note: Methods for coding categorical variables are the same regardless of whether the data are experimental or nonexperimental, explanatoy or predictive.

Coding

Coding of categorical variables is not simply a statistical or mathematical or clerical function but depends upon the validity of underlying conceptualizations to be useful (e.g. treatment vs. control).

It is important to remember that what we are doing throughout this chapter is comparing means across groups just as we did using t-tests and ANOVA. The overall results will always be the same regardless of the coding method used (dummy, effect and orthogonal), and these results will be identical to t-test or ANOVA (when more than two groups are involved).

Dummy Coding (most common)

Definition - Use of zero's and one's to denote group membership (called indicator coding in our text). 0 = not a member of group, 1 = member of group.

Rule: Number of dummy variables needed (k) = Number of groups (g) - 1

Dummy coding allows multiple regression to be used to compare group means (i.e. metric dependent variable and two groups). Results will be identical to t-tests (in the case of just two groups) or oneway ANOVA (in the case of more than two groups).

Consider the equation: Y’ = b0 + bkXk

Where,

Y'(predicted value for any individual) = mean for the group

b0 = mean for the omitted group

bk = difference in mean for represented group and omitted group = k - 0

Xk = 1 if member of kth group, other wise = 0.

SSreg = SSbetween groups

SSresidual = Sswithin groups

R2 = Eta2 (ANOVA)

F = t2 (two groups)

= F (ANOVA, more than two groups)

b0 (intercept) = mean for omitted group

b1 = difference between group 1 and omitted group

b2 = difference between group 2 and omitted group etc.

t ratio (statistic) for each b coefficient is a test of the significance of the difference between means of the omitted group and the group represented by the particular dummy variable associated with b. Note that this is not available from oneway unless you run contrasts.

Effect Coding (Fixed Effects Linear Model)

Definition - Like dummy coding except that the omitted group is assigned -1's rather than 0's in all the vectors (i.e. group variables).

Remember, overall results will not differ with effect or dummy coding. However, interpretation of the regression coefficients is very different.

Regression coefficients now reflect the effects of treatments (i.e. deviations from the grand mean).

b0 =

bk = deviation of mean for group i from grand mean

Y' = mean for the group

To calculate the b for the omitted group, remember that S bg = 0