Chapter 2-10. Linear Regression
Just comparing groups, such as with a t test, is too simple of an analysis because the effect is likely to be confounded by one or more variables.
In the lab, we can perform experiments where we hold all conditions constant, and just vary the one condition we are interested in testing. Outside of the lab, we cannot hold these other condition constant (although randomization approximates this). Fortunately, we can use regression models to statistically hold constant all other variables included in the model.
Regression models are merely extensions of simpler tests, such as t tests, to allow for other variables in model besides the grouping variable. These other variables are generally called covariates.
What we need, then, is to extend the t test (or one-way ANOVA) to allow for covariates, so we can compare means while “controlling for” the covariates. That is, we hold the covariates constant (we can hold them at 0 or perhaps at their mean value).
For a continuous outcome variable, if the main predictor variable is just a group variable with two groups, then linear regression is nothing more than an equal variances independent groups t test, extended to include covariates.
The first thing we will do, then, is verify that linear regression is simply an extension of the equal variance t test.
The fev dataset is from a study of the relationship between several variables and pulmonary function (FEV). [source: example data set that accompanies Rosner (1995)]
FileOpen
Find the directory where you copied the course CD:
Change to the subdirectory datasets & do-files
Single click on fev.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
BiostatsCourse", clear
* which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "datasets & do-files"
use fev.dta, clear
______
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.
Looking at the data labels,
Describe data
Describe variables in memory
OK
describe
Contains data from fev.dta
obs: 654
vars: 6 9 Aug 2004 22:31
size: 12,426 (99.8% of memory free)
------
storage display value
variable name type format label variable label
------
id long %12.0g ID
age byte %8.0g Age (years)
fev float %9.0g Forced Expiratory Volume
(liters)
height float %9.0g Height (inches)
male byte %8.0g
smoker byte %18.0g smokerlab
Smoking Status
------
First we compute a t test on fev by sex.
StatisticsSummaries, tables & tests
Classical tests of hypotheses
Two-group mean-comparison test
Main tab: Variable name: fev
Group variable name: male
OK
ttest fev ,by(male)
Two-sample t test with equal variances
------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
------+------
0 | 318 2.45117 .0362111 .645736 2.379925 2.522414
1 | 336 2.812446 .0547507 1.003598 2.704748 2.920145
------+------
combined | 654 2.63678 .0339047 .8670591 2.570204 2.703355
------+------
diff | -.3612766 .0663963 -.491653 -.2309002
------
diff = mean(0) - mean(1) t = -5.4412
Ho: diff = 0 degrees of freedom = 652
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
Next, compute a linear regression.
StatisticsLinear models and related
Linear regression
Model tab: Dependent variable: fev
Independent variables: male
OK
regress fev male
Source | SS df MS Number of obs = 654
------+------F( 1, 652) = 29.61
Model | 21.3239848 1 21.3239848 Prob > F = 0.0000
Residual | 469.595849 652 .720239032 R-squared = 0.0434
------+------Adj R-squared = 0.0420
Total | 490.919833 653 .751791475 Root MSE = .84867
------
fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
male | .3612766 .0663963 5.44 0.000 .2309002 .491653
_cons | 2.45117 .047591 51.50 0.000 2.35772 2.54462
------
We see that the t test and the linear regression give identical results. In both analyses, the effect is a difference of .3612766, except that the t test subtracts in the opposite direction. Both have a t value of 5.44 with an identical p value. Finally, notice that the Y-intercept regression coefficient (the “_cons”), has a value of 2.45117, which is the mean of the female group (where the x variable = 0 and crosses the y axis). Notice that the regression coefficient for male has a value of 0.3612766, which is the mean difference in the t test output. This is the change in FEV as you change one unit on the x variable (go from 0=female to 1=male).
What linear regression does is fit a straight line through the points on a scatterplot. It attempts to go through the mean of the Y variable for each value of the X variable. It can do this exactly when the X variable only has two values. Thus it fits a line through the group means, 2.45 and 2.81.
Graphing this,
graph twoway (scatter fev male)(lfit fev male), yline(2.45) yline(2.81)
The regression coefficient (B) is the slope of the regression line. Recall from algebra that the slope of the straight line is computed by:
Slope = rise/run
= (y1- y0)/(x1 – x0)
= (male mean – female mean)/(male score – female score)
= ( 2.812446 – 2.45117 ) / ( 1 - 0 )
= (mean difference ) / 1
= mean difference
= 0.3612766
The regression coefficient, being a slope, is always interpreted as:
The amount of change in the dependent variable (outcome variable) for one unit change in the independent variable (predictor variable).
If we add other variables to the model, the regression coefficient for each variable is interpreted as:
The amount of change in the outcome variable for one unit change in the predictor variable, after controlling for (i.e, holding constant) all other predictor variables in the model.
______
FEV dataset
Source: dataset that accompanies text: Rosner (1995).
Data are determinations of FEV in 654 children, ages 3-19, who were seen in the Childhood Respiratory Disease Study in East Boston, Massachusetts (Tager et al, 1979). See Appendix 1 “dataset descriptions” for further details.
______
We wish to use these data to show that smoking decreases pulmonary function, which can then be used in a public health message to convince teenagers that smoking will hurt their ability to perform well in sports.
We will compare “ever smoked” to “never smoked” on the forced expiratory volume (FEV1) outcome without controlling for covariates, and then repeat the analysis controlling for covariates, in the following order:
1) smoker
2) smoker + male
3) smoker + male + age
4) smoker + male + age + height
1) Fit a model with just smoker as a predictor,
StatisticsLinear models and related
Linear regression
Model tab: Dependent variable: fev
Independent variables: smoker
OK
regress fev smoker
Source | SS df MS Number of obs = 654
------+------F( 1, 652) = 41.79
Model | 29.569683 1 29.569683 Prob > F = 0.0000
Residual | 461.35015 652 .707592255 R-squared = 0.0602
------+------Adj R-squared = 0.0588
Total | 490.919833 653 .751791475 Root MSE = .84119
------
fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
smoker | .7107189 .1099426 6.46 0.000 .4948346 .9266033
_cons | 2.566143 .0346604 74.04 0.000 2.498083 2.634202
------
This is strange. Passive smoking is shown to increase FEV, contrary to what is expected.
2) add male to the model,
Linear models and related
Linear regression
Model tab: Dependent variable: fev
Independent variables: smoker male
OK
regress fev smoker male
Source | SS df MS Number of obs = 654
------+------F( 2, 651) = 41.07
Model | 55.0054527 2 27.5027263 Prob > F = 0.0000
Residual | 435.914381 651 .669607344 R-squared = 0.1120
------+------Adj R-squared = 0.1093
Total | 490.919833 653 .751791475 Root MSE = .8183
------
fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
smoker | .7607029 .107258 7.09 0.000 .5500895 .9713163
male | .3957065 .0642038 6.16 0.000 .2696349 .521778
_cons | 2.357876 .0477359 49.39 0.000 2.264141 2.451611
------
A popular variable selection strategy is to keep a variable in the model if it changes the coefficient of the primary exposure (smoker) by at least 10% (see box), or if the added variable is statistically significant (p < 0.05).
An NEJM example of an article that does this is Kulkarni et al (2006), who state in their Statistical Methods section,
“In the multiple regression models, confounders were included if they were significant at
a 0.05 level or they altered the coefficient of the main variable by more than 10 percent in
cases in which the main association was significant.”
“10% change in estimate” variable selection rule
Confounding is said to be present if the unadjusted effect differs from the effect adjusted for putative confounders. [Rothman, 1998].
A variable selection rule consistent with this definition of confounding is the change-in-estimate method of variable selection. In this method, a potential confounder is included in the model if it changes the coefficient, or effect estimate, of the primary exposure variable (treat in our example) by 10%. This method has been shown to produce more reliable models than variable selection methods based on statistical significance [Greenland, 1989].
Here, a 10% change for the coefficient (slope) for smoker from the previous model, 0.71 ´ 1.1 is 0.78. We see that we almost achieved that, having a coefficient of 0.76 in the second model, so smoker appears to be a marginally important confounder of the smoker-fev relationship. We could choose to not consider male a confounder if we wanted to, by using a strict 10% change criteria. On the other hand, we should leave it in to be cautious about confounding, particularly in the early stages of model development, where we have not yet decided on the final model.
3) add age to the model,
StatisticsLinear models and related
Linear regression
Model tab: Dependent variable: fev
Independent variables: smoker male age
OK
regress fev smoker male age
Source | SS df MS Number of obs = 654
------+------F( 3, 650) = 337.95
Model | 299.135337 3 99.7117789 Prob > F = 0.0000
Residual | 191.784497 650 .295053072 R-squared = 0.6093
------+------Adj R-squared = 0.6075
Total | 490.919833 653 .751791475 Root MSE = .54319
------
fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
smoker | -.1539741 .0779766 -1.97 0.049 -.3070905 -.0008577
male | .3152733 .0427104 7.38 0.000 .2314063 .3991403
age | .2267942 .0078845 28.76 0.000 .2113121 .2422763
_cons | .2377708 .0802279 2.96 0.003 .0802337 .3953079
------
This is nice. After controlling for age, the smoker-FEV relationship is in the direction we originally expected, with smoking significantly reducing FEV (p = .049).
4) add height to the model,
Linear models and related
Linear regression
Model tab: Dependent variable: fev
Independent variables: smoker male age height
OK
regress fev smoker male age height
Source | SS df MS Number of obs = 654
------+------F( 4, 649) = 560.02
Model | 380.64028 4 95.1600701 Prob > F = 0.0000
Residual | 110.279553 649 .16992227 R-squared = 0.7754
------+------Adj R-squared = 0.7740
Total | 490.919833 653 .751791475 Root MSE = .41222
------
fev | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
smoker | -.0872464 .0592535 -1.47 0.141 -.2035981 .0291054
male | .1571029 .0332071 4.73 0.000 .0918967 .2223092
age | .0655093 .0094886 6.90 0.000 .0468774 .0841413
height | .1041994 .0047577 21.90 0.000 .0948571 .1135418
_cons | -4.456974 .2228392 -20.00 0.000 -4.894547 -4.019401
------
After controlling for height, the smoker-FEV relationship is no longer significant, which kills our article, since our study hypothesis was a deleterious smoker-FEV relationship.
Before we accept that finding, we should get to know our data better (something you should do to begin with, actually).
Requesting a crosstabulation of age and smoker,
StatisticsSummaries, tables & tests
Tables
Two-way tables with measures of association
Main tab: Row variable: age
Column variable: smoker
Cell contents: within column relative frequencies
OK
tabulate age smoker, column
Age | Smoking Status
(years) | not curre current s | Total
------+------+------
3 | 2 0 | 2
| 0.34 0.00 | 0.31
------+------+------
4 | 9 0 | 9
| 1.53 0.00 | 1.38
------+------+------
5 | 28 0 | 28
| 4.75 0.00 | 4.28
------+------+------
6 | 37 0 | 37
| 6.28 0.00 | 5.66
------+------+------
7 | 54 0 | 54
| 9.17 0.00 | 8.26
------+------+------
8 | 85 0 | 85
| 14.43 0.00 | 13.00
------+------+------
9 | 93 1 | 94
| 15.79 1.54 | 14.37
------+------+------
10 | 76 5 | 81
| 12.90 7.69 | 12.39
------+------+------
11 | 81 9 | 90
| 13.75 13.85 | 13.76
------+------+------
12 | 50 7 | 57
| 8.49 10.77 | 8.72
------+------+------
13 | 30 13 | 43
| 5.09 20.00 | 6.57
------+------+------
14 | 18 7 | 25
| 3.06 10.77 | 3.82
------+------+------
15 | 9 10 | 19
| 1.53 15.38 | 2.91
------+------+------
16 | 6 7 | 13
| 1.02 10.77 | 1.99
------+------+------
17 | 6 2 | 8
| 1.02 3.08 | 1.22
------+------+------
18 | 4 2 | 6
| 0.68 3.08 | 0.92
------+------+------
19 | 1 2 | 3
| 0.17 3.08 | 0.46
------+------+------
Total | 589 65 | 654
| 100.00 100.00 | 100.00
Can you see what happened in the last regression model? Why does the effect of smoking go away when we control for height?
Height is related to body size, and so to lung size. That is, teenagers can exhale more forcefully than pre-teenagers. (The smoking-FEV association is confounded by lung size, since smoking is also related to age, and thus to lung size.)
Exercise: Look at Figure 1 in the article by Stanojevic et al (2008). It is easy to see how lung size (height and age) in pediatric subjects can overwhelm any deleterious effect of secondhand smoke.
Since smoking occurs more frequently in teenage years, smoking is a surrogate for body size. That explains why we saw increased FEV with smoking in our first model. What can we do?
Let’s try a restriction approach. Since the children with ages less than nine do not smoke in our sample, we can get a more homogenous body size (lung capacity) if we limit the analysis to ages ³ 9. (Restriction is also a common method to control for confounding, as all studies are careful to use inclusion-exclusion criteria when selecting a sample.)