Stat 112 D. Small
Review of Multiple Regression (Lectures 22-27)
I. Multiple Regression Model
- Goal: Estimate the mean of Y for the subpopulation with explanatory variables .
- Applications: Prediction of Y given , estimation of causal effect of a variable on Y controlling for confounding variables.
- Data: For each of n units, we observe a response variable Y and explanatory variables . So we observe .
- Ideal Multiple Regression Model Assumptions:
- (linearity)
- (constant variance)
- Distribution of Y for each subpopulation is normally distributed (normality)
- Observations are independent
- Estimation: The coefficients are estimated by choosing to make the sum of squared prediction errors as small as possible. These are called the least squares estimates. estimated by root mean square error of residuals.
- Predictions and Residuals: Predicted value of Y given is the estimated mean of Y for the subpopulation , . The residual for observation i is the error in using to predict , .
- Errors in Prediction and Residuals: Assuming ideal multiple linear regression model holds,
- Approximately 68% of the residuals will be within of . Approximately 68% of predictions of a future Y based on (i.e., predict by ) will be off by at most .
- Approximately 95% of the residuals will be within of . Approximately 95% of predictions of a future Y based on will be off by at most .
II. Interpreting Regression Coefficients
- Interpreting Regression Coefficients: Multiple Regression Model: . Interpretation of : Increase in mean of Y that is associated with a one unit increase in (from to ), holding fixed . Interpretation of multiple regression coefficient on a variable depends on what other explanatory variables are in the model. Example: In the Y=yield, =rainfall, =temperature example, the coefficient on rainfall was very different in the simple regression of yield on rainfall as compared to the multiple regression of yield on rainfall and temperature.
- Multiple Regression and Causal Inference: Suppose we want to figure out the causal effect of increasing by one for all units in the population. If we include all of the confounding variables in the multiple regression model in addition to including , the coefficient on would be the causal effect of increasing by one for all units in the population.
- Omitted Variable Bias formula: What happens if we omit a confounding variable from the regression, how biased will the coefficient be?
Suppose that
Omitted Variable Bias formula: .
Formula also applies to least squares estimates: .
III. Inference for Multiple Regression:
- Hypothesis tests: . Test of vs. uses t-statistic , reject for large |t|. Interpretation of test: “Is there evidence that is a useful predictor (improves predictions) once have been taken into account (held fixed)?” or “Is associated with once has been taken into account?”
- Confidence intervals: Approximate 95% confidence interval for is
- Confidence interval for mean response: Range of plausible values for
- Prediction interval: Range of values that will be likely to contain the Y for a particular unit, not in the original sample, that has explanatory variables .
- Overall usefulness of predictors: For multiple regression model , test whether any of the explanatory variables (predictors) are useful. vs. at least one of does not equal zero. Test (called overall F test) is carried out using Analysis of Variance table in JMP. We reject for large values of F statistic.
- R-squared statistic: R squared is a measure of how good the predictions from the multiple regression model are compared to using the sample mean of Y, (i.e., use none of the predictors) to predict Y. Similar interpretation to simple linear regression: R-squared statistic is the proportion of the variation in Y explained by the multiple regression model.
IV. Diagnostics and Model Building
- Assumptions of Ideal Multiple Linear Regression model:
- (linearity).
- (constant variance)
- Distribution of Y for each subpopulation is normally distributed (normality)
- Observations are independent
- Diagnostics for checking assumptions and remedies for violations of assumptions:
1. Tools for checking linearity: Residual plots versus predicted values and versus explanatory variables . If the model is correct, there should be no pattern in these plots. A pattern in the mean of the residuals indicates a violation of linearity. Transformations of one or more of the can be tried or polynomial terms (e.g., squares) of one or more of the variables can be added to try to remedy nonlinearity.
2. Tools for checking constant variance: Residual plots versus predicted values and versus explanatory variables . A pattern in the spread of the residuals (e.g., fan or funnel pattern) indicates nonconstant variance. Transformations of the Y variable can be tried to correct nonconstant variance.
3. Tools for checking normality: Make histogram of the residuals and see if it is approximately bell shaped.
4. Tools for checking independence of observations: Plot the residuals versus the time order of the observations. If the observations are independent, there should be no pattern in the plot over time.
- Model Building:
- Make scatterplot matrix of variables. Decide on whether to transform any of the explanatory variables. Check for obvious outliers.
- Fit tentative model.
- Check residual plots for whether assumptions of multiple regression model are satisfied. Look for outliers and influential points.
- Consider fitting richer model with interactions or curvature. See if extra terms can be dropped.
- Make changes to model and repeat steps 2-4 until an adequate model is found.
- Transformations for Explanatory Variables in Step 1 of Model Building: In deciding whether to transform an explanatory variable x, we consider two features of the plot of the response y vs. the explanatory variable x:
1. Is there curvature in the relationship between y and x? This suggests that we transform x using a transformation chosen by Tukey’s Bulging Rule.
- Are most of the x values “crunched” together and a few very spread apart? This will lead to several points being very influential. When this is the case, it is best to transform x to make the x values more evenly spaced and less influential. If the x values are positive, the log transformation is a good idea.
V. Outliers and Influential Observations
- Residual Outliers (called outliers in direction of scatterplot for simple linear regression): These are observations with residuals of large magnitude (absolute value), i.e., observation’s y value is unusual given its explanatory variable values.
- Outliers in the explanatory variables: If an observation has unusual explanatory variables values, it is said to have high leverage. Leverage is a quantity computed by JMP. When there are p explanatory variables, an observation is said to have high leverage if its leverage is greater than 2p/n, where n is the number of observations.
- Influential observations: The least squares method is not resistant to outliers. An observation is influential if removing it markedly changes the estimated coefficients of the regression model. There are two sources of an observation being influential, high leverage and large magnitude of the residual. In general, an observation that has high leverage and a residual that is not of a small magnitude will often be influential. An observation that is a residual outlier but has small leverage will not usually be influential unless its residual is of a very large magnitude. Cook’s Distance can be used to find observation that are influential. An observation has large influence if its Cook’s Distance is greater than 1.
- Strategy for dealing with influential observations: If an influential observation has high leverage, omit the observation and report the conclusions for a reduced range of the explanatory variables, not including the explanatory variables of the influential observation. If an influential observation does not have high leverage, the observation cannot just be removed. We can report results with and without the influential observation.
VI. Specially Constructed Explanatory Variables
- Interactions: There is an interaction between and if the impact of an increase in on Y depends on the level of . To incorporate interaction in the multiple regression model, we add the explanatory variable to the multiple regression model. There is evidence of an interaction between and if the coefficient on is significant (t-test has p-value < .05).
Example of interaction model for pollution data:
The amount by which the mean of Y (mortality) increases for a one unit increase in logHC, holding precip, educ and nonwhit fixed, depends on the level of precip:
B. Polynomial terms for curvature: To model a curved relationship between y and x, we can add squared (and cubic or higher order) terms as explanatory variables: . Fit as a multiple regression with two explanatory variables and . Coefficients are not directly interpretable: change in the mean of Y that is associated with a one unit increase in X depends on X:
To test whether multiple regression model with and provides better predictions than the multiple regression model with just , use the p-value of the t-test on the coefficient.
- Second order model: A model that includes all squares and interactions of original explanatory variables, e.g.,