Chapter 9 Building the Regression Model I: Model Selection and Validation

9.3 Criteria for Model Selection:

Two opposed criteria of selecting a model:

l Including as many covariates as possible so that the fitted values are reliable.

l Including as few covariates so that the cost of obtaining information and monitoring is not a lot.

Note:

There is not unique statistical procedure for selecting the best regression model.

Note:

Common sense, basic knowledge of the data being analyzed, and considerations related to invariance principle (shift and scale invariance) can not ever be set side.

Motivating example:

The “Hald" regression data

78.5 / 7 / 26 / 6 / 60
74.3 / 1 / 29 / 15 / 52
104.3 / 11 / 56 / 8 / 20
87.6 / 11 / 31 / 8 / 47
95.9 / 7 / 52 / 6 / 33
109.2 / 11 / 55 / 9 / 22
102.7 / 3 / 71 / 17 / 6
72.5 / 1 / 31 / 22 / 44
93.1 / 2 / 54 / 18 / 22
115.9 / 21 / 47 / 4 / 26
83.8 / 1 / 40 / 23 / 34
113.3 / 11 / 66 / 9 / 12
109.4 / 10 / 68 / 8 / 12

Total 13 observations.

Several methods which can be used are:

(a) using the value of .

(b) using the value of , the mean residual sum of squares.

(d) using and criteria.

(e) using criterion.

(a) :

Example (continue):

In “Hald” data, there are 4 covariates, and . All possible models are divided into 5 sets:

Set A: possible model.

Set B: possible models.

Set C: possible models.

Set D:

possible models

Set E: possible model.

Total models.

For every set, one or two models with large are picked. They are the following:

Sets / Models /
Set B / / 0.666
/ 0.675
Set C / / 0.979
/ 0.972
Set D / / 0.982
/ 0.982
Set E / / 0.982

Principle based on :

A model with large and small number of covariates should be a good choice since large implies the reliability of fitted values and a small number of covariates reduce the costs of obtaining information and monitoring.

Example (continue):

Based on the above principle, two models,

and

are sensible choices!!

Note:

and are highly correlated. The correlation coefficient is -0.973. Therefore, it is not surprising that the two models have very close .

(b) Mean residual sum of squares :

A useful result:

As more and more covariates are added to an already overfitted model, the mean residual sum of squares will tend to stabilize and approach the true value of provided that all important variables have been included. That is,

Example (continue):)

Again, we compute all for 16 possible models. We have the following table:

Sets / / Average
Set B / 115.06(), 82.39(), 176.31(), 80.35() / 113.53
Set C / 5.79(,), 122.71(,), 7.48(,), 41.54(,)
86.89(,), 17.59(,) / 47.00
Set D / 5.35(,), 5.33(,,),5.65(,,)
8.20(,,) / 6.13
Set E / 5.98(,,) / 5.98

The plot of average against p (the number of covariates, including ) is

Principle based on :

A model with mean sum of squares close to the estimate of (the horizontal line) and with the fewest covariates might be a sensible model.

Example:

The estimate of could be 6. The model

is sensible since its mean residual sum of squares is 5.79 (close to 6) and the number of covariates are small compared with the other models with close to 6.

Suppse the full model (containing all possible covariates) is

Then,

where n is the sample size, p is the number of parameters, is the residual sum of squares from a model containing p parameters, and is the mean residual sum of squares from the model containing all possible covariates.

Intuition of Mallows :

Suppose is the mean residual sum of squares from the full model (containing all possible covariates)

and

is the true model, . Then

, the mean residual sum of squares from model p, should be a sensible estimate accurately. That is, . Thus, . Also, the mean residual sum of squares for the overfitted model .

Thus, will falls close to the line of

Note:

Principle based Mallows :

The principle of selecting a best regression equation is to plot versus p for every possible models. Then, choose some models with fewer covariates close to the line

Example (continue):

For the motivating example, we calculate for all 16 possible models. We then have the following table:

Set A / 443.2
Set B / 202.5 () ,142.5 () ,315.2 () ,138.7 ()
Set C / 2.7 (,) ,198.1 (,) ,5.5 (,), 62.4 (,),
138.2 (,), 22.4 (,)
Set D / 3 (,,), 3 (,,), 3.5 (,,), 7.3 (,,)
Set E / 5 (,,,)

The point value for the model is close to the line and the model also has fewer parameters. Therefore, we recommend this model as a sensible choice.

(d) and criteria:

Principle based on and criteria:

Search for models with small values of and .

Example (continue):

For the motivating example, we calculate and for all 16 possible models. We then have the following tables:

Set B / 63.51 () ,59.17 () ,69.09 () ,58.85 ()
Set C / 25.41 (,) ,65.11 (,) ,28.74 (,), 51.03 (,),
60.62 (,), 39.86 (,)
Set D / 25.02 (,,), 24.97 (,,), 25.73 (,,), 30.57 (,,)
Set E / 26.93 (,,,)
Set B / 64.641 () ,60.30 () ,70.22 () ,59.98 ()
Set C / 27.11 (,) ,66.81 (,) ,30.44 (,), 52.73 (,),
62.32 (,), 41.56 (,)
Set D / 27.28 (,,), 27.23 (,,), 27.99 (,,), 32.83 (,,)
Set E / 29.75 (,,,)

According to the two selection criteria, several models,

are sensible models.

(d) (prediction sum of squares) criteria:

where is the predicted value of using the current model (model p) and the observation without to fit the model. For example, in “Hald” data, as the current model is

is the predicted value of using the data

to fit the model.

Principle based on criteria:

Search for models with small values of .