Chapter 10: One- and Two-Sample Tests of Hypotheses

11.1

Chapter 11: Simple Linear Regression and Correlation

We will examine simple linear regression briefly in this chapter. If you are interested in learning more about it, you can take STAT 450 which is a whole semester long course on the topic.

11.1: Introduction to Linear Regression

Suppose we are interested in estimating the average GPA of all students at UNL. How would we do this? (Assume we do not have access to any student records.)

a)Define the random variable: let Y denote student GPA

b)Define the population: all UNL students

c)Define the parameter that we are interested in:  = population mean GPA

d)Take a representative sample from the population: suppose a random sample of 100 students is selected

e)Calculate the statistic that estimates the parameter: = sample mean GPA

f)Make an inference about the value of the parameter using the statistical science: construct confidence intervals or hypothesis tests using the sample mean and sample standard deviation

The diagram below demonstrates these steps. Note that not all GPAs could be shown in the diagram.

What factors may be related to GPA?

1)High school (HS) GPA

2)ACT score

3)Involvement in activities

4)Etc.

Suppose we are interested in the relationship between college and HS GPA and we want to use HS GPA to predict college GPA. How could we do this? (Assume we do not have access to any student records.)

Use similar steps as on page 11.1, but now with regression models.

Data shown as: (HS GPA, College GPA)

Example: HS and College GPA (HS_college_GPA.xls)

A random sample of 20 UNL students is taken producing the data set below (data is different from above).

Student / X (HS GPA) / Y (College GPA)
1 / 3.04 / 3.10
2 / 2.35 / 2.30
3 / 2.70 / 3.00
4 / 2.05 / 1.90
5 / 2.83 / 2.50
6 / 4.32 / 3.70
7 / 3.39 / 3.40
8 / 2.32 / 2.60
9 / 2.69 / 2.80
10 / 0.83 / 1.60
11 / 2.39 / 2.00
12 / 3.65 / 2.90
13 / 1.85 / 2.30
14 / 3.83 / 3.20
15 / 1.22 / 1.80
16 / 1.48 / 1.40
17 / 2.28 / 2.00
18 / 4.00 / 3.80
19 / 2.28 / 2.20
20 / 1.88 / 1.60

Plot of the data observation pairs in a scatter plot or scatter diagram:

Regression allows us to develop an equation, like = 0.71 + 0.70*(HS GPA), to predict College GPA from HS GPA.

Notice that the regression model does not perfectly predict the college GPAs. There is some error in the prediction. This error can be quantified through the use of PDFs!

Goal of Chapter 11:

Develop a model (equation) that numerically describes the relationship between two variables using simple linear regression[CB1].

Algebra Review

X / Y
-1/2 / 0
0 / 1
1 / 3

Y = dependent variable

x = independent variable

b = y-intercept

m= slope of line; measures how fast (or slow) that Y changes as X changes by a one-unit increase

Origins of Regression:

“Regression Analysis was first developed by Sir Francis Galton in the latter part of the 19th Century. Galton had studied the relation between heights of fathers and sons and noted that the heights of sons of both tall & short fathers appeared to ‘revert’ or ‘regress’ to the mean of the group. He considered this tendency to be a regression to ‘mediocrity.’ Galton developed a mathematical description of this tendency, the precursor to today’s regression models.” (From page 6 of Neter, Kutner, Nachtsheim, and Wasserman, 1996)

11.2: The Simple Linear Regression Model

Suppose you are interested in studying the relationship between two variables x and Y (x may be HS GPA and Y may be college GPA)

where

Y= Dependent random variable value

x = Independent variable value (this is assumed to be a

fixed constant here)

Y=+ x +  is the population regression model

= random variable (random error term) that has a normal PDF with E() = 0 and Var() = (the book calls this just 2)
E(Y) = = + x is what Y is expected to be on average for a specific value of x since

E(Y) = E(+ x + )

= E()+ E(x) + E()

= + x + 0

= + x

Note that the book uses both E(Y) and Y|x to denote the same thing.

 = y-intercept for the population regression model
 = slope for the population regression model
and  are parameters that need to be estimated

is the sample regression model (estimated regression model, equation, or line and fitted line)

estimates E(Y) = + x
ais the estimated value of ; Y-intercept for the sample regression model
b is the estimated value of ; slope for the sample regression model
a and b are observed values of statistics
is the estimated or predicted value of E(Y)

Note that x is a constant value – not a random variable. In settings like the GPA example, it makes sense for HS GPA to be a random variable. Even if it is, the estimators derived and inferences made in this chapter would remain the same. STAT 970 discusses this in more detail when X is a random variable.

Below are two nice diagrams showing what is being done here.

Model assumptions

At each possible x value, Y has a normal PDF with E(Y) =  + x and Var(Y) = . Why?

 is a normal random variable with E() = 0 and Var() = . Thus,

Var(Y) = Var(+ x + )

= Var() since +x is a constant

E(Y) = E(+ x + )

= E()+ E(x) + E()

= + x + 0

= + x

For each x, we can then use the normal PDF the same way as we did in Chapter 6. For example,

P[E(Y) - 2 < Y < E(Y) + 2] = 0.954 (see p. 6.24)

 P[+ x - 2 < Y < + x + 2] = 0.954

Please remember that Y is dependent on the particular value of x here.

This result is shown on p. 355 of the book and below illsutrating how Y has a normal PDF at each x.

Also see p. 11.41 for a partially related discussion.

11.3: Least Squares and the Fitted Model

Calculation of a (estimate of ) and b (estimate of )

Suppose we have a sample of size n where we obtain (x1, y1), (x2, y2), …, (xn, yn). The formulas for a and b are found using the least squares method (more on this later):

where x assumes summing from i=1,…,n

a =- b

Example: What is the relationship between sales and advertising for a company?

Let x = Advertising ($100,000)

y = Sales units (10,000)

x / y / x2 / y2 / xy
x1=1 / y1=1 / 1 / 1 / 1
x2=2 / y2=1 / 4 / 1 / 2
3 / 2 / 9 / 4 / 6
4 / 2 / 16 / 4 / 8
5 / 4 / 25 / 16 / 20
 / 15 / 10 / 55 / 26 / 37

(Ex. x2=55)

a= - b= 10/5 – (0.70)*15/5 = 2 – 2.1 = -0.1

= -0.1 + 0.7x

Scatter plot – A plot where each observation pair is plotted as a point

Scatter Plot with sample regression model: graph showing the sampled values and the sample (estimated regression) model.

x / y / / y-
1 / 1 / 0.6 / 0.4
2 / 1 / 1.3 / -0.3
3 / 2 / 2 / 0
4 / 2 / 2.7 / -0.7
5 / 4 / 3.4 / 0.6

For example:

= -0.10 + (0.70)*1 = 0.6 when x=1

Residual (Error)

ei=yi– = observed sample value – predicted model value.

This gives a measurement of how far the predicted value is from the sampled value.

Obviously, want these to be small



Example: Sales and advertising

What does the sales and advertising estimated regression model mean? (Remember advertising is measured in $100,000 units and sales is measured in 10,000 units)

1)Estimated slope b=0.7:

Sales volume (y) is expected to increase by (0.7)*10,000=7,000 units for each 1*100,000=$100,000 increase in advertising (x).

2)Use model for prediction:

Estimate sales when advertising is $100,000.

= -0.10 + 0.70*1 = 0.60

Estimated sales is 6,000 units

Estimate sales when advertising is $250,000.

= -0.10 + 0.70*2.5 = 1.65

Estimated sales is 16,500 units

Least Squares Method Explanation: Method used to find equations for a and b (least_squares_demo.xls).

Below is the explanation of the least squares method relative to the HS and College GPA example.

Notice how the sample (estimated) regression model seems to go through the “middle” of the points on the scatter plot. For this to happen, aand b must be –0.1 and 0.7, respectively. This provides the “best fit” line through the points.

The least squares method tries to find the aand b such that SSE = = is minimized (where SSE=Sum of Squares Error). These formulas are derived through using calculus (more on this soon!).

Least squares method demonstration with least_squares_demo.xls :
Uses the GPA example data set with a=0.7060 and b=0.7005
The demo examines what happens to the SSE and the sample regression line plot if values other than a and b are used as the y-intercept and slope in the sample regression model.
Below are a few cases:

Notice that as the y-intercept and slope get closer to a and b, SSE becomes smaller and the line better approximates the relationship between x and y!

The actual formulas for a and b can be derived through using calculus. The purpose is to find an a and b such that

SSE = =

is minimized. Here’s the process:

Find the partial derivatives with with respect to a and b
Set the partial derivatives equal to 0
Solve for a and b!

Setting the derivative equal to 0 produces,

(1)

And,

Setting the derivative equal to 0 produces,

(2)

Substituting (1) into (2) results in,

Then a becomes

It can be shown that these values do indeed result in a minimum (not a maximum) for SSE. Note that the book equivalently expresses b on p. 356 as:

which results from multiplying the numerator and denominator by n.

Example: HS and College GPA (HS_college_GPA.xls)

In Excel, the Regression analysis tool can be used to do most of the necessary calculations. Select Tools > Data Analysis from the main Excel menu bar to bring up the Data Analysis window. Select Regression and OK to produce the Regression window. Below is the finished window (using Windows 2000).

The Residual option produces the residuals in the output. The Line Fit Plots option produces a plot similar to a scatter plot with a sample regression model plotted upon it. Below is the output generated by Excel.

SUMMARY OUTPUT
Regression Statistics
Multiple R / 0.9144
R Square / 0.8361
Adjusted R Square / 0.8270
Standard Error / 0.2964
Observations / 20
ANOVA
df / SS / MS / F / Significance F
Regression / 1 / 8.0677 / 8.0677 / 91.8059 / 1.72E-08
Residual / 18 / 1.5818 / 0.0879
Total / 19 / 9.6495
Coef. / Stand. Err. / t Stat / P-value / Low. 95% / Up. 95%
Intercept / 0.7060 / 0.1991 / 3.55 / 0.0023 / 0.2877 / 1.1243
X=HS GPA / 0.7005 / 0.0731 / 9.58 / 1.72E-08 / 0.5469 / 0.8541

Using the Excel output:

= 0.7060 + 0.7005x

RESIDUAL OUTPUT
Observation / Predicted College GPA / Residuals
1 / 2.8349 / 0.2651
2 / 2.3496 / -0.0496
3 / 2.6003 / 0.3997
4 / 2.1443 / -0.2443
5 / 2.6851 / -0.1851
6 / 3.7297 / -0.0297
7 / 3.0791 / 0.3209
8 / 2.3314 / 0.2686
9 / 2.5921 / 0.2079
10 / 1.2844 / 0.3156
11 / 2.3788 / -0.3788
12 / 3.2597 / -0.3597
13 / 2.0014 / 0.2986
14 / 3.3864 / -0.1864
15 / 1.5627 / 0.2373
16 / 1.7427 / -0.3427
17 / 2.3048 / -0.3048
18 / 3.5081 / 0.2919
19 / 2.3036 / -0.1036
20 / 2.0209 / -0.4209

Notice the above graph does not look exactly like a scatter plot with sample (estimated) regression line plotted upon it. Below is one way to fix the plot. Note that other steps are often necessary to make the plot more “professional” looking (changing the scale on the axes, adding tick marks, changing graph titles, etc…)

1)Change background from grey to white

a)Right click on the grey background (a menu should appear)

b)Select format plot area to bring up the following window:

i)Select None as the area

ii)Select OK

2)Remove legend

a)Right click in the legend

b)Select Clear

3)Create the line for the sample regression model

a)Right click on one of the estimated y values (should be in pink) and a menu should appear

b)Select Format Data Series to bring up the following window:

i)Under Marker, select None

ii)Under Line, select Automatic

iii)Select OK

The plot should now look like this:

IMPORTANT:

Do not extrapolate beyond the range of data when estimating E(Y).

Example: The GPA data has 0.83x4.32. Do not try to estimate E(Y) (i.e., find a ) when x is 0.5 or 4.5.

11.4: Properties of the Least Squares Estimators

Review of model:

Y = + x +  is the population regression model

= random variable with a normal PDF which has E() = 0 and Var() =
Y has a normal PDF with E(Y) = + x and Var(Y) =
 = y-intercept for the population regression model
 = slope for the population regression model
and  are parameters that need to be estimated

is the sample regression model

estimates E(Y) = + x
a is the estimated value of ; Y-intercept for the sample regression model
b is the estimated value of ; slope for the sample regression model
a and b are observed values of statistics

Add the following:

Suppose we have a random sample of size n. Then the population version of the model could be denoted by

Yi = + xi + i for i=1,…,n.

Each i is assumed to be independent with normal PDFs which have E(i) = 0 and Var(i) = . Thus, each i has the same mean and variance. Also, Yi has a normal PDF with E(Yi) = + xi and Var(Yi) = , and each Yi is independent.

a and b are observed values of statistics, and their corresponding random variables are denoted as A and B.

Our “random variable” version of the sample regression model could be written as .

Mean and variance of A and B

In Section 8.4-8.5, we saw that

and

which was helpful information to help derive confidence intervals and hypothesis tests for . Similarly, we would like to do the same thing here with A and B which are statistics that estimate  and .

First, lets find the E(B).

Note that

where Yi has a normal PDF with E(Yi) = +xi and Var(Yi) = . Then

Thus, B is an unbiased estimator of !

Also, the variance of , denoted by in the book, is

To help simplify the proof, notice that

In a similar manner, it can be shown that

Then

Also, notice that since

Thus,

Note that E(A) and Var(A) can be derived in a similar manner. See p. 362 for their expressions.

Mean square error

Notice that the variance of B has a parameter in it:. We estimate it by calculating a variance measure similar to the one introduced in Chapter 8. The observed value of the estimate of is

The random variable version, , replaces the observed values with the corresponding random variables. It can be shown that E() = !

Notes:

The sample variance presented in Sections8.1-8.2 was . Notice that the denominator had n-1 in it which was the degrees of freedom of the numerator. The numerator had 1 estimator of  in it which caused a “loss” of a degree of freedom. Without going into too much detail, the degrees of freedom of the numerator in is n-2 since it includes the estimates of  and .
is denoted in the book by just s2. I chose to use the  subscript to avoid confusion with notation introduced in Chapter 8!
is often called the mean square error (MSE) since it is similar to an “average” squared error of the regression model.

Since the Var(B) contained the parameter of in it, we will need to replace with in order to come up with an actual estimated variance:

Notice the use of the extra ^ above the variance to help denote an estimated quantity.

What does and then visually represent?

measures the variability of points from the
E(Y) =  + x line in the graph below.

measures the variability of the sampled points from the linein the graph below.

Question: Which plot below is associated with the larger MSE ()? Note that the same x values are used in both plots.

If your goal was to produce a sample regression model which predicted the y values as good as possible, which plot would you prefer?

Practical interpretation of MSE

Rule of thumb: All observed values lie within 2 to 3 standard deviations from the mean.

Here, plays the role of the mean and s plays the role of the standard deviation.

So all of the y values for a given x should lie between 2s

Example: Sales and Advertising

x / y / / y- / (y-)2
1 / 1 / 0.6 / 0.4 / 0.16
2 / 1 / 1.3 / -0.3 / 0.09
3 / 2 / 2 / 0 / 0
4 / 2 / 2.7 / -0.7 / 0.49
5 / 4 / 3.4 / 0.6 / 0.36
 / 1.1

Note: 2s = 1.2111

Example calculation:

For x=4,

2s= 2.7 1.2111 = (1.4889, 3.9111)

For all observations in the data set with x=4, one would expect all the corresponding y values to be between 1.49 and 3.91.

Example: HS and College GPA(HS_college_GPA.xls)

ANOVA
df /

SS

/ MS / F / Significance F
Regression / 1 / 8.0677 / 8.068 / 91.8059 / 1.72E-08
Residual / 18 / 1.5818 / 0.0879
Total / 19 / 9.6495

MSE = = 0.0879 and SSE = 1.5818

11.5: Inference Concerning the Regression Coefficients

Is x linearly related to Y?

Can you use x to predict Y?

Construct a C.I. and/or perform a hypothesis test for to answer these questions.

Suppose  = 0

Population model: Y =  + x + 

Then Y =  + 0x + 

=  + 

Example plot where =3:

As x changes, E(Y) does not change; thus, x is not linearly related to Y. Our inferences about  are typically centered on whether or not =0.

Inferences about 

From Section 8.7, we learned that if

Z is a standard normal random variable
V is a chi-square random variable with  degrees of freedom
Z and V are independent

then has a t-distribution with  degrees of freedom.

With respect to this chapter, we can show that B has a normal PDF (using Chapter 7 material) with E(B) =  and Var(B) =. Thus,

(1)

has a standard normal PDF.

From Section 8.6, we learned that if S2 is the variance of a random sample of size n taken from a normal population having the variance 2, then the statistic

has a chi-squared PDF with =n-1 degrees of freedom.

With respect to this chapter,

has a chi-square PDF with  = n-2 degrees of freedom.

One can show that Z and V here are independent. Then through combining (1) and produces,

Thus, T here has a chi-square PDF with =n-2 degrees of freedom! Using methods from Chapter 9 and 10, we can conclude the following:

Thus, the (1-)100% C.I. for  is

Similarly, the hypothesis test for  = 0 can use the following procedures.

C.I. method:

1)Ho:=0(no linear relationship)
Ha:0(linear relationship)

2)Calculate the(1-)100% C.I.

3)Decide whether or not to reject Ho by checking if 0 is in the interval

4)State a conclusion in terms of the problem

Reject Ho – There is sufficient evidence to show that x is linearly related to [CRB2]Y

Don’t Reject Ho – There is not sufficient evidence to show that x is linearly related to Y

where ____ means to put in what x and Y represent in the problem

Test statisticmethod:

1)Ho:=0 (no linear relationship)
Ha:0 (linear relationship)

2)Calculate the test statistic:

3)State the critical value: t/2, n-2

4)Decide whether or not to reject Ho

5)State a conclusion in terms of the problem

p-value method:

1)Ho:=0 (no linear relationship)
Ha:0 (linear relationship)

2)Calculate the p-value: p-value = 2*P(T>|t|) where T is a random variable with a t-distribution and =n-2.

3)State 

4)Decide whether or not to reject Ho

5)State a conclusion in terms of the problem

Example: Sales and Advertising

Is advertising linearly related to sales? Use =0.05.

1)Ho: =0
Ha: 0

2)We previously calculated . Note that from using the calculations below.

x / y / (x-)2
1 / 1 / 4
2 / 1 / 1
3 / 2 / 0
4 / 2 / 1
5 / 4 / 4
 / 10

Then

3) t0.05/2, 5-2 = 3.182

Since 3.6556 > 3.182 reject Ho.

5)There is sufficient evidence to show that advertising is linearly related to sales.

Example: HS and College GPA (HS_college_GPA.xls)

Is HS GPA linearly related to college GPA? Use =0.05.

Coef. / S. E. / t Stat / P-value / Low 95% / Up 95%
Intercept / 0.7060 / 0.1991 / 3.5457 / 0.0023 / 0.2877 / 1.1243

X=HS GPA / 0.7005 / 0.0731 / 9.5815 / 1.72E-08 / 0.5469 / 0.8541

95% C.I. for :

1)Ho: =0
Ha: 0

2)p-value = 1.72*10-8 = 0.000000172

3)=0.05

4)Since 0.000000172 < 0.05 reject Ho

5)There is sufficient evidence to show that HS GPA is linearly related to college GPA

P-value interpretation: If  is really 0 in the population, then a test statistic value, t, at least this large in absolute value (9.5815) would occur less than 172 times if the hypothesis test process (take a new sample and perform a new hypothesis test) is repeated 1 billion times. This is about 2 times in 10 million!

In other words, this is very unlikely to occur if=0. Thus,  is most likely not 0 and Ho is rejected.

Examine the scatter plot with the sample model plotted upon it to see why this conclusion makes intuitive sense.

Example: Pizza and college GPA (HS_college_GPA.xls)

Suppose we also want to know if there is a linear relationship between the number of times a student ate pizza during their freshman year of college and their college GPA. The same 20 students are in this sample.

Pizza / Y
2 / 3.1
9 / 2.3
3 / 3
5 / 1.9
3 / 2.5
1 / 3.7
4 / 3.4
7 / 2.6
8 / 2.8
4 / 1.6
1 / 2
3 / 2.9
2 / 2.3
15 / 3.2
1 / 1.8
6 / 1.4
4 / 2
5 / 3.8
6 / 2.2
7 / 1.6
SUMMARY OUTPUT
Regression Statistics
Multiple R / 0.0355
R Square / 0.0013
Adjusted R Square / -0.0542
Standard Error / 0.7317
Observations / 20
df / SS / MS / F /

Sig. F

Regression

/ 1 / 0.0122 / 0.0122 / 0.0228 / 0.8817
Residual / 18 / 9.6373 / 0.5354
Total / 19 / 9.6495
Coef. / S. E. / t Stat / P-value / Low 95% / Up 95%

Intercept

/ 2.4689 / 0.2900 / 8.5137 / 9.99E-08 / 1.8596 / 3.0781
Pizza / 0.0075 / 0.0499 / 0.1509 / 0.881716 / -0.097 / 0.1123

Use =0.01 for the hypothesis test.