Reading: Lohr, Chapter 3.2-3.3

Stat 475 Notes 5

I. Comment on Ratio Estimator

We estimate the ratio by .

What about the estimator ?

is not as good an estimator as .

For a large sample, will have small bias and small variance but can still have substantial bias because in general for two random variables and , might not equal .

II. Regression Estimator

The regression estimator is another estimator besides the ratio estimator forwhen we have available an auxiliary variable for which we know .

Suppose a linear regression model holds in the population:

Then .

(1.1)

The regression estimator substitutes the least squares estimates of and from the sample in (1.1) to estimate .

Specifically, the least squares estimates are

and the regression estimator is

Bias and standard error of the regression estimator:

These properties do not assume that the regression model (1.1) holds. The standard error is an estimate of the standard deviation that uses a Taylor series expansions.

Bias:

The bias is approximately zero for large samples.

Let be the sample variance of the residuals from the least squares regression in the sample:

Then

Also, if we are interested in estimating the population total for and we know the population size , then

and

Unlike the ratio estimator, the regression estimator cannot be used to estimate a population total if we do not know .

Comparison of ratio estimator and regression estimator for estimating mean:

Let and . Then

Thus, the ratio estimator and regression estimator perform similarly when which will be the case if the regression line (1.1) goes through the origin. But if the regression line (1.1) does not go through the origin, then

might be much less than and the regression estimator will be much better than the ratio estimator.

As with the ratio estimator, the regression estimator will gain more over the sample mean when y and x are highly correlated.

To decide whether using the ratio estimator, regression estimator or sample mean is a good idea, we should first plot the data. If a simple linear regression model appears to approximately hold with the regression line approximately going through the origin and there is a reasonably high correlation between y and x (say above 0.5), then the ratio or the regression estimator is reasonable. If a simple linear regression model appears to approximately hold but the regression line does not approximately go through the origin, then the regression estimator is the best choice. If a simple linear regression model does not appear to approximately hold, then the regression estimator could have substantial bias and a more advanced regression estimator than we study here could be considered.

Example: To estimate the number of dead trees in an area, we divide the area into 100 square plots and count the number of dead trees on a photograph of each plot. Photo counts can be made quickly, but sometimes a tree is misclassified or not detected. So we select a simple random sample of size 25 of the plots for field counts of dead trees. We know that the population mean number of dead trees per plot is 11.3

The following is a scatterplot of the data and the least squares regression line:

photo=c(10,12,7,13,13,6,17,16,15,10,14,12,10,5,12,10,10,9,6,11,7,9,11,10,10);

field=c(15,14,9,14,8,5,18,15,13,15,11,15,12,8,13,9,11,12,9,12,13,11,10,9,8);

regmodel=lm(field~photo);

plot(photo,field);

abline(a=coef(regmodel)[1],b=coef(regmodel)[2]);

The following is a residual plot for the simple linear regression model:

plot(photo,resid(regmodel),ylab="Residuals",main="Residual Plot for Simple Linear Regression Model");

abline(0,0);

The simple linear regression model appears to approximately hold but the regression line does not approximately go through the origin, so the regression estimator is a better choice than the ratio estimator.

The correlation is reasonably high

> cor(photo,field)

[1] 0.6241967

so the regression estimator may gain over the sample mean.

The regression estimator is computed as follows:

> ybar=mean(field); # Sample mean

> se.ybar=sqrt((1-(25/100))*var(field)/25); # Standard error of sample mean

> se.ybar.reg

[1] 0.4167579

> se.ybar

[1] 0.5222069

> regmodel=lm(field~photo); # Least squares regression of field on photo

> B0hat=coef(regmodel)[1]; # Least squares estimate of intercept

> B0hat

(Intercept)

5.059292

> B1hat=coef(regmodel)[2]; # Least squares estimate of slope

> B1hat

photo

0.6132743

> ybar.reg=B0hat+B1hat*11.3; # Regression estimator, uses that known population mean of photo is 11.3

> ybar.reg

(Intercept)

11.98929

> sereg.sq=sum(resid(regmodel)^2/(length(field)-2)); # Compute standard deviation of residuals from least squares regression

> se.ybar.reg=sqrt((1-(25/100))*sereg.sq/25); # Standard error of regression estimator

> se.ybar.reg

[1] 0.4167579

> ybar=mean(field); # Sample mean

> ybar

[1] 11.56

> se.ybar=sqrt((1-(25/100))*var(field)/25); # Standard error of sample mean

> se.ybar

[1] 0.5222069

The regression estimator has a considerably smaller standard error, 0.417, than the sample mean, 0.522.

Our estimate of the total number of trees with the regression estimator is

An approximate 95% confidence interval for the total number of trees is

Here we use the t-distribution percentile (with n-2=23 degrees of freedom) of 2.07 rather than the normal distribution percentile of 1.96 because of the relatively small sample size.