Study Guide for Regression
Questions to Answer at the Beginning
1. What is the logical cause and effective relationship to be tested?
2. What are the important explanatory variables?
3. Looking at a scatterplot of Y on X, is the relationship curved?
4. Are any of the explanatory variables in categories?
5. Is the dependent variable continuous without clumps or limits?
Basic Steps
- Formulate the model to be tested as Y = a + b X + e
- Include all of the important explanatory variables. Exclude extraneous ones.
- Adjust the model to take account of curvature, for example, with logs or a polynomial.
- Create k – 1 binaries for each categorical explanatory variable with k groups.
- Estimate the regression including the binaries and polynomial terms.
- Evaluate the estimate considering the adjusted r2, ANOVA, and the t’s on the coefficients.
- Interpret the coefficients in terms of the units of the X’s and Y and draw conclusions in the context of the problem.
The Estimate
Think of the estimated model in equation terms.
Y = a + b X1 + c X2 + d X3 + e
The coefficient, b, is the slope of the estimated equation with respect to X1, given the levels of the other explanatory variables.
The t-ratio tests the hypothesis that the associated slope coefficient comes from a population with a zero slope, holding constant the other variables in the model.
The adjusted r2 is the percent of variation in Y explained by the regression.
The table of the ANOVA result tests the hypothesis that all of the coefficients are zero in the population.
Forecast a value of Y for a given set of X’s.
Advanced Steps (location in e.stat)
- If the model is quadratic, find the value of X at which Y is extreme: X* = -b/2c. Judge whether the extreme value is plausible. (21.09)
- Does the model include all of the important explanatory variables? If not, find measures of the omitted variables and include them. (22.07)
- Plot the residuals and look for heteroscedasticity. Regress the residuals on one or more of the Xs as a test. If present, adjust by recasting the model in per capita terms or in some other fashion. (22.06)
- If the data are time series, compute the Durbin-Watson d and look for autocorrelation. When autocorrelation is found, adjust the model by adding a trend, a lagged dependent variable, or other method. (25.05)
- Are the explanatory variables correlated among themselves? To test for multicolinearity, compute an auxiliary regression of one X on the others. If the explanatory variables are colinear, consider omitting an explanatory variable or making other adjustments. 922.08)
- Are the residuals distributed normally? Look at the scatterplot of the residuals or the normal probability plot to decide. If the residuals are not normal, consider recasting the model or using limited dependent variable techniques discussed in more advanced works. (22.09)
- Does causality run from the dependent variable to one or more explanatory variables as well as the other way? If so, the model is not identified. Consider a simultaneous equation estimation method discussed in more advanced works. (22.10)
- Examine the partial correlation plots and the studentized residuals, looking for outliers. If outliers are present, identify the observations and consider modifying the model. If partial correlation plots show curvature, design the model to allow for curvature. (2.11 and 22.17)
- Use a Chow test to determine whether observations from subgroups might be pooled to estimate a regression. (22.12)
- Use a test of linear restrictions to determine whether several coefficients are jointly zero in the population, whether variables might be combined, or other constraints. (22.14)