Chapter 5-4. Linear Regression Adjusted Means, ANOVA, ANCOVA, Dummy Variables, and Interaction
In Stata version 11, there were some changes made on how to use categorical predictor variables, which made it easier. Both the current Stata 11 approach and the earlier approach, which will be called Stata 10, are given in this chapter, for the benefit of those with an earlier version.
For this discussion, we will use an example published by Nawata et al (2004). The data were taken from the authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the authors.
File rmr.dta Codebook
group urinary excretion of albumin group (U-Alb)
a = U-Alb < 30 mg/d
b = 30 mg/d ≤ U-Alb ≤ 300 mg/d
c = 300 mg/d < U-Alb
lbm lean body mass (kg)
rmr resting metabolic rate (kJ/h/m2)
In Chapter 5, we saw that the t test is identically a simple linear regression (univariable regression, without covariates). The linear regression predicted the mean outcome (unadjusted mean) for the two groups.
A more convincing analysis, however, is to model the adjusted means, which are the group means adjusted for potential confounding variables.
Look at the Nawata (2004) article. Notice the unadjusted means for RMR are reported in their Table 1, although significance between the means is tested both in an unadjusted fashion and adjusted fashion (last two columns of Table 1). Next, notice the adjusted means are given in the legend to Figure 1.
In the authors statistical methods section, they stated they used analysis of variance (ANOVA) and analysis of covariance (ANCOVA).
Although there are special routines in Stata to fit them more easily, analysis of variance (ANOVA) is just a linear regression with categorical predictor variables, and analysis of covariance (ANCOVA) is just a linear regression with both categorical variables and continuous variables. In ANOVA terminology, categorical variables are called factors and continuous variables are called covariates.
______
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.
Opening the rmr dataset in Stata,
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on rmr.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\rmr.dta", clear
* which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use rmr, clear
Click on the data browser icon, and we discover the group variable is alphabetic (called a string variable, for “string of characters”). Stata displays string variables with a red font.
Since arithmetic cannot be done on letters, we first need to convert the string variable group into a numeric variable,
DataCreate or change variables
Other variable transformation commands
Encode value labels from string variable
Main tab: String variable: group
Create a numeric variable: groupx
OK
encode group, generate(groupx)
To see what this did,
StatisticsSummaries, tables & tests
Tables
Twoway tables with measures of association
Row variable: group
Group variables: groupx
OK
tabulate group groupx
| groupx
group | a b c | Total
------+------+------
a | 10 0 0 | 10
b | 0 10 0 | 10
c | 0 0 12 | 12
------+------+------
Total | 10 10 12 | 32
It looks like nothing happened. Actually the variable groupx has values 1, 2, and 3. It is just that value labels were assigned that matched the original variable.
Actual value Label
1 a
2 b
3 c
To see this crosstabulation without the value labels, we use,
StatisticsSummaries, tables & tests
Tables
Twoway tables with measures of association
Row variable: group
Group variables: groupx
Check Suppress value labels
OK
tabulate group groupx, nolabel
| groupx
group | 1 2 3 | Total
------+------+------
a | 10 0 0 | 10
b | 0 10 0 | 10
c | 0 0 12 | 12
------+------+------
Total | 10 10 12 | 32
This is still a nominal scaled variable, but that is okay for the ANOVA type commands in Stata, which create indicator variables “behind the scenes”.
To test the hypothesis that the three unadjusted means are equal for rmr, similar to what Nawata did for his Table 1, we can compute a one-way ANOVA, where one-way implies one predictor variable, which is groupx.
StatisticsLinear models and related
ANOVA/MANOVA
One-way ANOVA
Response variable: rmr
Factor variable: group <- string variable OK for oneway
Output: produce summary table
OK
oneway rmr group, tabulate
| Summary of rmr
group | Mean Std. Dev. Freq.
------+------
a | 136.08333 16.940851 12
b | 132.7 19.032428 10
c | 166.5 19.019129 12
------+------
Total | 145.82353 23.604661 34
Analysis of Variance
Source SS df MS F Prob > F
------
Between groups 7990.92451 2 3995.46225 11.91 0.0001
Within groups 10396.0167 31 335.355376
------
Total 18386.9412 33 557.180036
Bartlett's test for equal variances: chi2(2) = 0.1787 Prob>chi2 = 0.915
The p value from the Analysis of Variance table, “Prob > F = 0.0001”, is what Nawata reported in Table 1 on the RMR row, P Value < 0.0001. (Nawata’s data were slightly different.)
This was a test of the hypothesis that the three group means are the same:
To get Nawata’s Adjusted P Value (for LBM), the last column of Table 1, we use an ANCOVA.
Stata 10:
Linear models and relatedANOVA/MANOVA
Analysis of variance and covariance
Dependent variable: rmr
Model: group lbm
Model variables: Categorical except the following
continuous variables: lbm
OK
anova rmr group lbm, continuous(lbm) // Stata version 10
which gives an error message:
. anova rmr group lbm, continuous(lbm) partial
no observations
r(2000);
Stata 11:
Linear models and relatedANOVA/MANOVA
Analysis of variance and covariance
Dependent variable: rmr
Model: group lbm
Model variables: Categorical except the following
continuous variables: lbm
OK
anova rmr group lbm // Stata version 11
which gives an error message:
. anova rmr group lbm
group: may not use factor variable operators on string variables
r(109);
When you see the error message “no observations”, it usually means that you tried to use a string variable where a numeric variable was required. (We used “group” instead of “groupx”, which the oneway command allows but the anova command does not.)
Stata 10: Going back to change this,
Linear models and relatedANOVA/MANOVA
Analysis of variance and covariance
Dependent variable: rmr
Model: groupx lbm
Model variables: Categorical except the following
continuous variables: lbm
OK
anova rmr groupx lbm, continuous(lbm) // Stata version 10
Number of obs = 34 R-squared = 0.4785
Root MSE = 17.8777 Adj R-squared = 0.4264
Source | Partial SS df MS F Prob > F
------+------
Model | 8798.55426 3 2932.85142 9.18 0.0002
|
groupx | 7734.4293 2 3867.21465 12.10 0.0001
lbm | 807.629752 1 807.629752 2.53 0.1224
|
Residual | 9588.38691 30 319.612897
------+------
Total | 18386.9412 33 557.180036
Stata 11: Going back to change this, we use our numeric variable for group, groupx, and we put a “c.” in front of our continuous variable, where the “c.” informs Stata it is a continuous variable.
Linear models and relatedANOVA/MANOVA
Analysis of variance and covariance
Dependent variable: rmr
Model: groupx c.lbm
OK
anova rmr groupx c.lbm // Stata version 11
Number of obs = 34 R-squared = 0.4785
Root MSE = 17.8777 Adj R-squared = 0.4264
Source | Partial SS df MS F Prob > F
------+------
Model | 8798.55426 3 2932.85142 9.18 0.0002
|
groupx | 7734.4293 2 3867.21465 12.10 0.0001
lbm | 807.629752 1 807.629752 2.53 0.1224
|
Residual | 9588.38691 30 319.612897
------+------
Total | 18386.9412 33 557.180036
We use the p value for the groupx row, which is “p=0.0001”, in approximate agreement with Nawata’s p=0.0008 (differing only because the datasets differ slightly).
In this model, we tested the null hypothesis that the three group means are the same:
versus
after adjustment for LBM (after controlling for LBM). In other words, we actually tested
versus
Since p=.0001 < 0.05, we reject H0 and accept HA , concluding that at least two of the group’s differ in their mean value.
Aside: Review of Hypothesis Testing
Recall, that the p value is the probability of obtaining the result that we did in our sample if the null hypothesis H0 is true. If H0 is true, then we observed at least two means that were different from each other just due to sampling variation. It can be attributed to sampling variation, since under H0, no differences in means existed in the population we sampled from, so the sample means differ simply because “they just happen to come out that way” by the process of taking a sample (by chance). H0, then, can be viewed as the mathematical expression consistent with sampling variation, which is what we want to eliminate as an explanation for the observed effect.
If this probability is small, p < 0.05, we conclude that sampling variation was too unlikely to be an explanation for the sample result. We then accept the opposite, the alternative hypothesis HA, that there is a difference somewhere among the means in the population we sampled from. We have to always keep in mind the dysjunctive syllogism of research (Ch 5-2, p.1) when accepting HA, however, recognizing that the p value cannot rule out bias or confounding by other confounders as alternative explanations for the observed effect.
The same results could be derived using a linear regression model.
Stata 10: We can get the anova procedure to show us the regression model that matches the ANOVA, using the regress option.
Linear models and relatedANOVA/MANOVA
Analysis of variance and covariance
Model tab: Dependent variable: rmr
Model: groupx lbm
Model variables: Categorical except the following
continuous variables: lbm
Reporting tab: Display anova and regression table
OK
anova rmr groupx lbm, continuous(lbm) regress anova // Stata 10
Source | SS df MS Number of obs = 34
------+------F( 3, 30) = 9.18
Model | 8798.55426 3 2932.85142 Prob > F = 0.0002
Residual | 9588.38691 30 319.612897 R-squared = 0.4785
------+------Adj R-squared = 0.4264
Total | 18386.9412 33 557.180036 Root MSE = 17.878
------
rmr Coef. Std. Err. t P>|t| [95% Conf. Interval]
------
_cons 142.1363 16.17229 8.79 0.000 109.1081 175.1645
groupx
1 -29.91667 7.305324 -4.10 0.000 -44.83613 -14.9972
2 -33.32727 7.660557 -4.35 0.000 -48.97222 -17.68233
3 (dropped)
lbm .5454562 .3431357 1.59 0.122 -.1553203 1.246233
------
Number of obs = 34 R-squared = 0.4785
Root MSE = 17.8777 Adj R-squared = 0.4264
Source | Partial SS df MS F Prob > F
------+------
Model | 8798.55426 3 2932.85142 9.18 0.0002
|
groupx | 7734.4293 2 3867.21465 12.10 0.0001
lbm | 807.629752 1 807.629752 2.53 0.1224
|
Residual | 9588.38691 30 319.612897
------+------
Total | 18386.9412 33 557.180036
Notice it reports that group 3 was dropped. What was actually dropped was the indicator variable (or dummy variable) for group 3. Group 3 became the referent and was absorbed into the intercept term (_cons). The anova command creates indicator variables for the categorical variables, uses them in the calculation, but does not add them to the variables in the data browser.
Stata 11: We can follow the anova procedure with a regress command to show us the regression model that matches the ANOVA,
regress
. anova rmr groupx c.lbm // Stata version 11
Number of obs = 34 R-squared = 0.4785
Root MSE = 17.8777 Adj R-squared = 0.4264
Source | Partial SS df MS F Prob > F
------+------
Model | 8798.55426 3 2932.85142 9.18 0.0002
|
groupx | 7734.4293 2 3867.21465 12.10 0.0001
lbm | 807.629752 1 807.629752 2.53 0.1224
|
Residual | 9588.38691 30 319.612897
------+------
Total | 18386.9412 33 557.180036
. regress
Source | SS df MS Number of obs = 34
------+------F( 3, 30) = 9.18
Model | 8798.55426 3 2932.85142 Prob > F = 0.0002
Residual | 9588.38691 30 319.612897 R-squared = 0.4785
------+------Adj R-squared = 0.4264
Total | 18386.9412 33 557.180036 Root MSE = 17.878
------
rmr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
groupx |
2 | -3.410606 7.654802 -0.45 0.659 -19.0438 12.22258
3 | 29.91667 7.305324 4.10 0.000 14.9972 44.83613
|
lbm | .5454562 .3431357 1.59 0.122 -.1553203 1.246233
_cons | 112.2196 15.87451 7.07 0.000 79.79954 144.6397
------
Notice it reports that group 1 was dropped. What was actually dropped was the indicator variable (or dummy variable) for group 1. Group 1 became the referent and was absorbed into the intercept term (_cons). The anova command creates indicator variables for the categorical variables, uses them in the calculation, but does not add them to the variables in the data browser.
Notice that the anova command provides a single p value to test the equality of the three groups, whereas the regression model uses two p values (one for each of two group indicators). We will see how to combine these into a single p value below.
When modeling a categorical predictor variable, one less indicator than the number of categories is used in the model (see box).
______
Why One Indicator Variable Must Be Left Out
Regression models use a data matrix which includes a column of 1’s for the constant term.
_cons group1 group2 group3 group
1 1 0 0 1
1 0 1 0 2
1 0 0 1 3
The combination of group1, group2, and group3 indicator variables predict the constant variable exactly. This is, we have perfect colinearity, where
Constant = group1 + group2 + group3
With all three indicator variables in the model, it is impossible for the constant term to have an “independent” contribution to the outcome when holding constant all of the indicator terms—the constant term is completely “dependent” upon the indicator terms.
Leaving one indicator variable out resolves this problem. That indicator becomes part of the constant (the constant being the combined referent group for all predictor variables in the model).