AGR206Ch07MLR2.doc Revised: 10/18/05
Chapter 7. Model development.
7:1 Situations that require variable selection.
In general, explanatory variables used in multiple linear regression are not orthogonal and the proportion of variance of Y explained by the different X’s shows overlap. The pattern of “overlap” with many variables can be complex, and the estimated values of the parameters may change considerably as variables are added or removed. When the pattern of covariance among explanatory variables is strong, it is not possible to assign much of the total variation of Y exclusively to each predictor, and the estimated effect of a unit change in any X depends on which predictors are included in the model. This type of situation is characterized by cases where the whole model is highly significant, but no single X-variable get assigned a significant amount of type III sum of squares. In situations where the goal of the analysis is to get meaningful regression coefficients, and to generate a model of how the various predictors affect the response variable, it is necessary to refine the model to address the negative impact of collinearity.
The development of the model involves determining which variables will be included, considering all X’s as well as quadratic terms if necessary. Model development should be guided by knowledge about the process, which should establish limits to the kinds of models and effects considered. A purely empirical approach to simply explain as much variance in Y as possible will not be productive, as it is usually possible to make models that explain almost all of the variance in Y, but these models are extremely complicated and cumbersome, and typically have no meaning. For example, a polynomial of 4th order will probably fit very well to explain the response of plant yield to fertilization rate, but the parameters will have not biological meaning, whereas the use of a non linear equation can be more revealing and can include parameters that give estimates of the soil’s ability to supply the nutrient in question.
Although careful formulation and construction of the model is not always needed to achieve the goals of the analysis, as indicated below, it is common practice to eliminate poor predictors for parsimony and to focus on those factors that are important. Careful model construction is crucial for exploratory observational experiments where collinearity is a problem, and to eliminate superfluous variables that are costly to measure. In the following sections I present an overview of the situations when variable selection is necessary, after which I explain the effects of collinearity. The final section presents some of the methods to perform the variable selection.
7:1.1 Types of experiments.
In general, experiments can be classified into four categories:
1. Designed experiments without covariates measured.
2. Designed experiments where one or more covariates are measured.
3. Observational experiments to test predetermined hypotheses.
4. Exploratory observational experiments.
7:1.1.1 Designed experiments.
Designed experiments can have explanatory variables whose values are assigned to experimental units by the researcher, in a manipulative fashion. Values for the X variables are usually chosen such that effects of different factors are orthogonal and can be fully separated by analysis of variance. In this case, all types of sums of squares are equal. By adding the type III SS for all factors one obtains the total sum of squares of the model or regression.
A similar situation takes place in designed observational experiments, where all combinations of levels or values of the explanatory variables are observed. In this case, the sums of squares have the same properties as in a manipulative designed experiment. The only difference is that in an observational designed experiment one cannot make inferences about cause-and-effect relationships.
Neither type of designed experiment has collinearity problems. No variable selection procedure is necessary in this case, although researchers frequently drop the non-significant variables or terms from the model before reporting the results.
Example of designed manipulative experiment: factorial combinations of levels of N (50, 100, 150 kg/ha) and P (20, 40, 80 kg/ha) are randomly applied to field plots. Each combination is applied to 5 plots. Crop yield is measured as the response variable and regressed on the amount of P and N applied. As an exercise you can think about the difference between regressing yield on N and P as continuous variables, and performing an ANOVA with N and P as categorical variables. Write the models for both situations, count the number of parameters in each model, determine which model will always have better R2 and explain why. If you have an interest in this and want to know more, ask your instructor.
Example of designed observational experiment: N and P levels are measured in a population of fields. All areas are classified into categories of N and P levels and 5 plots within each combination of N and P levels are selected, where yield is later measured.
7:1.1.2 Designed experiment with covariates.
In this type of experiments, additional explanatory variables are measured in each experimental unit. Thus, the final model can include both the explanatory variables whose values were set by the design, and explanatory variables whose values were determined a posteriori, by measuring each plot. Because the values of the covariates were not set or selected by the experimenter, they are not orthogonal to other effects or variables in the model. However, regular ANCOVA requires that the covariates be entered first and considered together by using the type I SS, so no variable selection is necessary.
7:1.1.3 Observational experiments to test hypotheses.
When observational experiments are performed to test predetermined hypotheses, no variable selection is necessary. Variables to include in the model are selected prior to the experiment, and are part of the hypothesis.
7:1.1.4 Exploratory observational experiments.
Exploratory observational experiments are those in which a series of response and explanatory variables are measured in a set of random and independent experimental units. When the goal of the experiment is to develop predictive equations to be used within the pattern of variance-covariance of X-variables in the sample, no variable selection is necessary. The model may, and probably will, have “incorrect” partial regression coefficients, but the differences between the actual value and the estimated value of the different coefficients tend to cancel out and the overall prediction for the response variable is still valid.
When MLR is being applied with descriptive purposes, there is no need to reduce number of variables. As indicated above, prediction and estimation of Yhat does not require reduction in the number of variables, unless one desires to reduce superfluous ones to reduce the costs of using the model. When one is interested in estimation of b and understanding, careful model development and variable selection are necessary. The coefficients are sensitive to deletions, and removal of too many variables may bias the MSE and estimated b.
Thus, the main situation where variable selection is very important is when one wants to generate models for further hypotheses and testing on the basis of a sample. The main goals are to identify the most important variables and to get good estimates for the corresponding partial regression coefficients. The interest is in gaining understanding of the process by which the values of the response variable are generated.
It is very important to consider that when building a model and selecting from a number of potential explanatory variables, the final model will typically have an MSE that underestimates the true variance of the error. This happens because in the process of building a model one can choose from those alternative models that have a low MSE just due to random variation among samples. Because of the selection process, all the probability levels associated with a model are “tainted” and no longer have the apparent meaning. Correct probability levels have to be obtained from new data for which specific hypotheses about the values of parameters have been formulated. In any case, the probability values obtained after variable selection are useful for internal comparisons within the same data set.
7:2 Effects of collinearity.
"To reiterate, severe collinearity means that there is no unique solution to the regression equation, and no amount of mathematical trickery can solve the problem."
(Philippi 1993)
Collinearity is a problem for two reasons. The first reason is that when some explanatory variables are very closely related to others, the determinant of the X'X matrix becomes very close to zero. Under this condition, minor changes in the calculations, even those introduced by the computer precision, can cause large differences in the results of the calculations. Thus, the results of the regression are numerically very unstable and may vary between operating systems that use different precision, or due to differences in the order in which different statistical packages perform the operations.
Second, the estimations performed in the analysis become highly dependent on the specific values obtained in the sample. Minor changes in the random sample of data obtained can lead to major changes in the values of the regression coefficients, with reversals of signs being common. This variability is reflected in the estimation of the variance of the estimates, and makes interpretation of the coefficients very difficult.
The variance inflation factor (VIF) is a statistic that represents the factor by which the variance of regression coefficient estimates increases due to the near-linear dependence among explanatory variables. The VIF for explanatory variable Xi is
where R2i is the coefficient of determination when regressing variable Xi on the rest of the explanatory variables.
When the goal of the MLR is to predict expected values of the response variable Y, collinearity is not a serious problem if the following conditions are met:
- The correlations observed reflect the true correlation structure among the predictors in the real system being studied. This correlational structure is a property of the system, and will continue to exist in any random sample unless a stratified observational or a manipulative experiment is conducted.
- The system continues to operate in the same manner as when the data were collected, so the correlations among the X's remain the same.
- Predictions are restricted to the multivariate range spanned by the data.
Collinearity poses serious problems when the goal of the analysis is to understand the system, to identify important variables, and to produce meaningful estimates of the regression coefficients. Although remedial measures such as ridge and principal components regression can ameliorate the behavior of the regression coefficients, they will not eliminate the uncertainty caused by the collinearity. The only potentially complete solution is to get better data, for which it will probably be necessary to conduct controlled manipulative experiments where the ranges and combinations of values of the predictors are selected carefully.
The effects of collinearity are illustrated with a simulated data set, where all of the characteristics of the samples are kept constant with the exception that in one case the predictors are highly correlated, whereas in the other case they are almost orthogonal.
Figure 7-1. Characteristics of the variables in the data set without collinearity. The true model is Y= 0.2 X1 + 0.8 X2 +e.
In the example, the letter r in the names of the variables identifies the data set with collinearity. As shown in Figure 1, both sets of variables have almost exactly the same properties. Minor differences between the two sets of variables are due to the random sampling.
Figure 7-2. Characteristics of the variables in the data set with collinearity. The true model is Y= 0.2 X1 + 0.8 X2 +e.
In spite of the equal univariate properties, the regression equations obtained and the variances of the parameter estimates are very different (Figure 3). As an exercise, it can be corroborated that the variances of the parameters in the data set with collinearity are the same as those for the one without collinearity multiplied by the corresponding VIF. The greater variance from sample to sample in the correlated data set can be demonstrated by simulated repeated random sampling. This process is implemented in the file xmpl_ColinDemo.xls.
By using the Custom Test … menu one can calculate Yhat for the following values of (X1,X2) and (X1r,X2r): (10,10) and (10,-10). Although parameters for the two estimated models are very different, and their variances are also very different, predictions of Y for values of predictors within the observed combinations (e.g. X1r=X2r=10) are equally good in both models. However, when values of predictors are outside the observed combinations (10,-10), the model with collinearity produces very poor predictions. The point (10,-10) constitute an extrapolation outside the scope of the model.
Figure 7-3. Effects of collinearity on the estimation of regression coefficients. In both cases the MSE=0.61, and the standard deviation of the expected response for X1(and Xr1)=0.5 and X2 (and Xr2)=0.5 is 0.02.
7:3 All possible regressions.
The process of selecting variables or building the model by analyzing all possible regressions consists of choosing a selection criterion and the applying it to all possible models based on the data available. For example, if we have 5 predictors, and we choose to use the adjusted coefficient of determination as the selection criterion, we regress Y on each one of the five X’s separately, then we do it on all possible pairs, triplets, etc. until we have an Radj2 for each possible model. A blind selection process will pick the model with the highest Radj2. A better selection process would involve the inspection of those models that have very close values for the selection criterion, because one of them may make a lot more biological or mechanistic sense than the others. Variable selection and model-building is an inherently subjective process where information that is external to the data set can have a lot of weight. This is illustrated by the example of the mad scientist who studied hearing in spiders. Ask Emilio to tell you about it. This is a mandatory exercise.
One of the alternatives to performing all possible regressions is stepwise regression, which is no longer used as much as in the past. Stepwise regression approximates the result of doing all possible regressions without calculating all possible models. This technique is useful when a very many variables are involved in the modeling.