STAT 360: Regression Analysis Handout #2: Reducing the Unexplained Variation Through

STAT 360: Regression Analysis
Handout #2: Reducing the Unexplained Variation through Conditioning

Example 2.1: Consider the following data that has been collected from my STAT 110 students over the past semesters.

Response Variable: Hair Length (mm)
Variables under investigation (i.e. independent variables):

Gender
Height (inches)

A snip-it of the data is provided here.

A univarate analysis of the response ignores or marginalizes the effect of all other variables that may be under study. In this case, it is said that the marginal distribution of the response is being summarized.

A summary of marginal distribution ignores all variables.

Marginal Distribution of Hair Length (mm)

Historically, the mean and variance have been considered to be sufficient in describing a distribution. [In fact, the mean and variance are said to be sufficient statistics for a wide class of distributions.] Certainly, information is lost when all data values are reduced to a single quantity that is supposedly describing a typical value (i.e. mean or average) and a single quantity being used to quantify the variation (i.e. standard deviation or variance).

In this class, we will use the E() or expectation notation to denote the mean and Var() to denote the variance. The following notion will be used to describe the mean and variance of the response Hair Length.

Mean = E(Hair Length)

Variance = Var(Hair Length)

When these quantities are estimated from the data, the commonly used “hat” notation will be used.

Estimate of mean from data

Estimate of variance from data

Explained vs. Unexplained Variation

Consider the following Wiki entry for “Unexplained” Variation.

Consider next these concepts in the context of our previous example. There are several reasons why people have different hair lengths (e.g. specific hair styles, days since last haircut, gender, etc.). If such variables are known to have an effect on hair length, then some of the inherent variation in hair length can be explained. However, if the aforementioned variables have no effect on hair length or if the aforementioned variables are ignored all together, then all the variation in hair length is said to be unexplained.

Explained Variation: Variation in a response that can be attributed to one or more other variables
Unexplained Variation: Variation in a response that remains after considering other variables

Comment: When the marginal distribution of a response is being considered, then all the variation in the response is said to be unexplained.

Example 2.2 Consider the following data regarding the top baseball players over a particular season. This data has been sorted by Batting Average, which is one possible measure of a baseball player’s performance or value to a team.

Explained Variation in Context

In baseball, the batting average is computed as follows

For example, for the first observation the batting average is computed as 159/459 = 0.346 or for Mauer, the batting average is computed as 155/486 = 0.319. The differences in the batting average values can be explained (completely explained) by players having a different number of hits and innings played.

Unexplained Variation in Context

Next, consider the variation in Salary. There appear to be factors other than those given here that have an impact on salary. For example, Mauer has the top salary of the players listed here, but his performance, as measured by batting average, is not the highest. The differences in salary cannot be completely explained by At Bats, Hits, and Batting Average. Thus, much of the variation in salary remains unexplained.

Measuring the Unexplained Variation

Consider again the Hair Length data presented in Example 2.1. The unexplained variation is the inherent variation that exists in hair lengths. Up to this point, we have only considered the marginal distribution of hair length (i.e. we have ignored other variables such as gender, hair style, etc.), thus all the variation in hair length is said to be unexplained.

Dotplot
of Hair Length
/ Adding the average
to the display
/
Unexplained variation

The variance of the response is used to quantify the unexplained variation.

where

Questions:

What does it mean if the the residual value is positive? Negative?

What does it mean if the residual is small (i.e. close to zero)?

Suppose the residual value for the particular data point is positive. Is the mean an over-estimate or under-estimate for this particular data value? Explain.

The residual value for the 1st observation in the Hair Length dataset.

On the graphic below, idnetify the residuals for the 1stthree obsevations.

All residual values are shown on the graphic below. Excel can be used to easily obtain the residual value for all observations in your data.

Residuals shown graphically
/ Obtaining residual values in a simple spreadsheet

Notice that when all the residual values are added up the total is 0. This happens whenever the average is used to estimate the mean funciton (i.e. the E(Hair Length) quantity). Visually, we can see this happens because all the positiive residuals cancel out all the negative residuals.

Task: Prove

when the the sample mean is used as an estimate of the mean funciton, i.e.

There are two mathematical options to avoid thecancelling out effect when summing the residuals.

Total Variation =

The first method -- squaring the residuals -- is more generally known as squared error loss or L2 loss. The absolute or L1 loss function is the name given when an absolute value is used to get rid of the negatives.

Wiki entry for Loss Function

Comments:

Whenever an L2 loss function is used, the mean is the qauntity that will minimize the residuals across all observations.
If an L1 loss function is being used, then the median will minimize the residuals across all observations.

In the following, Excel was used to obtain the total variation using the L1 and L2 loss functions.

Total Variation = = 20106

Total Variation = = 3836975

Questions:

Explain how the total variation for squared error loss can be computed from the variance. Realize, JMP can return the numerator of the variance and is identified by the Corrected SS value.

Verify that when the median is used instead of the mean in the absolute loss, the total variation is reduced.

Concept of Conditioning

As stated previously, there are several variables that may influence a person’s hair length. For example, gender is likely to help explain some of the variation in hair lengths – i.e. women tend to have longer hair then men. If you take the gender into consideration when analyzing hair length, then it is said that the conditional distribution of hair lengthgiven gender is being considered.

Conditional Distribution – the distribution of the response variable conditioning on (i.e. taking into consideration) one or more other variables

Visually, the conditional distribution simply means that the distribution of the response will be considered for each gender, separately.

Marginal Distribution
/ Conditional Distributions:
dividing the response into subgroups

The following graph communicates the difference between the marginal distribution and the two conditional distributions.

Akin to our investigation of the marginal distribution, the mean and variance are sufficient quantities for the conditional distributions. Identify each of these quantities for the marginal and conditional distributions in the table below.

Identify the conditional means and variances in the following table.

Quantity / Distribution of
Hair Length / Hair Length | Female / Hair Length | Male
Mean /
/
/

Variance /
/
/

n / n = / n = / n =

The normal kernel density estimates for the marginal and conditional distributions are shown here.

From a modeling perspective, a significant advantage of considering the conditional distribution is possible reduction in the unexplained variation. Consider the substantial reduction in these quantities for the hair length data.

Questions:

Given that a squared error loss function was used, give the formula for how the total unexplained variation is computed for Females. How is it computed for Males?

What is the difference between the total unexplained variation in the marginal distribution and the total unexplained variation in the two conditional distributions?

Explain in context, what would it mean if this difference were zero?

If the difference between the unexplained variation in the marginal distribution and conditional distributions is close to zero, then which of the following is true?

The variable being conditioned on (i.e. Gender) is useful in understanding the response variable.
The variable being conditioned on (i.e. Gender) is *not* useful in understanding the response variable.

Explain.

On the graph below, sketch apossible scenario in which the difference between the unexplained variation in the marginal and conditional distributions is small.

Next, sketch a scenario in which the difference between the unexplained variation in the marginal and conditional distributions is very large (i.e. the total variation in the conditional distributions is very small, say 0).

Proportion of Variation Being Explained

The potential amount of unexplained variation that can be explained by conditioning certainly depends on the total amount in the marginal distribution. For example, in the above example the reduction was

3836975 – 1209170 = 2627805

which is a substantial amount considering that the total unexplained variation in the marginal distribution was 3836975. As a result, the proportion of unexplained variance taken away by considering the conditional distributions is typically used as a measure of overall usefulness of the conditioning variable(s). This proportion is commonly referred to as the coefficient of determination or R2.

Comment: R2 is often misinterpreted in practice. The correct interpretation, albeit not necessarily eloquent, is the proportion of unexplained variation that is explained by the independent variables being considered by the model.

Notation:

Sum of Squares Total (i.e. SSTotal) is commonly used to identify the total unexplained variation in the marginal distribution of the response

Sum of Squares Error (i.e. SSError) is commonly used to identify the total amount of unexplained variation in the conditional distributions

The coefficient of determination using this common notation is expressed as

or more simply

The coefficient of determination or R2 value for our situation would then be computed as

Question: What is the correct Interpretation of this quantity in context?

Wiki entry for Coefficient of Determination

Example 2.3 Consider once again the Impact Crater dataset.

Goal: Determine which of the following conditional distributions is more advantageous to consider if the goal is to reduce the total unexplained variation in Diameter, the response variable of interest here.

Diameter | SandType

Diameter | ProjectileType

Compute the following quantities for the various distributions under investigation here.

Marginal Distribution of Diameter

Quantity / Marginal Distribution
Diameter
Mean /

Variance /

n / n =
Total
Unexplained Variation

Conditional Distribution of Diameter | SandType

Quantity / Conditional Distributions for Sand type
Diameter | Course / Diameter | Fine
Mean /
/

Variance /
/

n / n = / n =
Unexplained Variation
Total Unexplained Variation =______

Conditional Distribution of Diameter | ProjectileType

Quantity / Conditional Distributions for Projectile type
Diameter | Glass / Diameter | Steel / Diameter | Wood
Mean /
/
/

Variance /
/
/

n / n = / n = / n =
Unexplained Variation
Total Unexplained Variation =______

Questions:

What proportion of the total unexplained variation is accounted for by considering the Sand type?

What proportion of the total unexplained variation is accounted for by considering the Projectile type?

Which conditional distribution accounts for more of the unexplained variation in Diameter? Discuss.

Consider the following visual depictions of the conditional distributions. From these visual displays, explain why we’d expect the conditional distribution of Diameter | ProjectileType to account for more of the unexplained variation in Diameter.

Conditional Distribution of
Diameter | SandType
/ Conditional Distribution of
Diameter | ProjectileType