Stat 475 Notes 6

Reading: Lohr, Chapter 3.3, 4.1-4.2

Corrections for Note 4 Addendum

A first order Taylor expansion of is

We approximate by the variance of the right hand side of the above:

I. Ratio and regression estimation review

Ratio and regression estimation are useful for estimating the mean or population total of a variable y when there is an auxiliary variable x for which we know the population total and that it is highly correlated with y.

The intuition is that when x and y are highly correlated, by comparing the sample mean of x to the population mean of x, we can predict whether the sample mean of y is likely to be an overestimate and underestimate of the population mean of y and we can adjust for this.

Steps in using ratio and regression estimation:

1. Plot the data. Fit a simple linear regression model. Make a residual plot.

2. Based on the residual plot, decide whether the simple linear regression model is a reasonable model for the data. If the simple linear regression model is not a reasonable model for the data, then both the ratio and regression estimator could have large bias. An estimator based on a more appropriate regression model can be considered (see end of notes).

3. If the simple linear regression model is a reasonable model for the data, check whether the regression line approximately goes through the origin, e.g., by testing whether the intercept equals 0. If the regression line approximately goes through the origin, either the ratio or regression estimator can be used. If the regression line does not approximately go through the origin, then the regression estimator should be used.

II. Example of ratio and regression estimation:

An advertising firm is concerned about the effect of a new regional promotional campaign on the total dollar sales for a particular product. A simple random sample of stores is drawn from then regional stores in which the product is sold. Quarterly sales data are obtained for the current 3-month period and the 3-month period prior to the campaign. The total sales among all 425 stores for the 3-month period prior to the campaign is known to be 216,256. Use the data to estimate the total sales for the current period and a 95% confidence interval for the total sales for the current period.

Plot the data:

precampaign.sales=c(208,400,440,259,351,880,273,487,183,863,599,510,828,473,924,110,829,257,388,244);

present.sales=c(239,428,472,276,363,942,294,514,195,897,626,538,888,510,998,171,889,265,419,257);

plot(precampaign.sales,present.sales);

Precampaign sales and present sales are highly correlated:

cor(precampaign.sales,present.sales)

[1] 0.9985922

Fit a simple linear regression model:

regmodel=lm(present.sales~precampaign.sales);

summary(regmodel);

Call:

lm(formula = present.sales ~ precampaign.sales)

Residuals:

Min 1Q Median 3Q Max

-19.037 -7.918 -2.345 8.252 45.423

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 10.10523 7.08325 1.427 0.171

precampaign.sales 1.04975 0.01314 79.871 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.93 on 18 degrees of freedom

Multiple R-squared: 0.9972, Adjusted R-squared: 0.997

F-statistic: 6379 on 1 and 18 DF, p-value: < 2.2e-16

Residual plot:

plot(precampaign.sales,resid(regmodel),xlab="Precampaign Sales",ylab="Residual",main="Residual plot");

abline(0,0); # Draws a horizontal line at 0

The simple linear regression model appears to be a reasonable model for the data.

The p-value for testing whether the intercept of the regression line is 0 is 0.171, so there is no evidence that the intercept is not zero. It is reasonable to use either the ratio or the regression estimator.

Comparison of ratio, regression and standard estimators

# Ratio estimator

# Population mean of precampaign sales

population.mean.precampaign.sales=216256/452;

# Sample estimate of ratio

Bhat=mean(present.sales)/mean(precampaign.sales)

# Ratio estimator of population total

ytotal.ratio=452*Bhat*population.mean.precampaign.sales;

# s_e^2 needed for estimating standard error of ratio estimator of population total

se.ratio.sq=sum((present.sales-Bhat*precampaign.sales)^2)/(length(present.sales)-1);

# Standard error of ratio estimator of total

standard.error.ytotal.ratio=452*sqrt((1-20/452)*se.ratio.sq/20);

# Approximate 95% confidence interval

lci.ratio=ytotal.ratio-1.96*standard.error.ytotal.ratio;

uci.ratio=ytotal.ratio+1.96*standard.error.ytotal.ratio;

# Regression estimator

# Fit a simple linear regression model of present sales on precampaign sales

regmodel=lm(present.sales~precampaign.sales);

# Extract least squares estimates of intercept and slope from regression fit

B0hat=coef(regmodel)[1];

B1hat=coef(regmodel)[2];

# Regression estimator of population total

ytotal.reg=452*(B0hat+B1hat*population.mean.precampaign.sales);

# Variance of residuals from regression

sereg.sq=sum(resid(regmodel)^2/(length(present.sales)-2));

# Standard error of regression estimator of total

standard.error.ytotal.reg=452*sqrt((1-(20/452))*sereg.sq/20); # Standard error of regression estimator

# Approximate 95% confidence interval

lci.reg=ytotal.reg-1.96*standard.error.ytotal.reg;

uci.reg=ytotal.reg+1.96*standard.error.ytotal.reg;

# Standard total estimator that ignores precampaign sales

ybar=mean(present.sales);

ytotal.standard=452*ybar;

standard.error.ytotal.standard=452*sqrt((1-20/452)*var(present.sales)/20);

# Approximate 95% confidence interval

lci.reg=ytotal.reg-1.96*standard.error.ytotal.reg;

uci.reg=ytotal.reg+1.96*standard.error.ytotal.reg;

Method / Estimated Total / SE / 95% CI
Standard / 230,091 / 27,073 / (177,027,
283,154)
Ratio / 231,612 / 1,537 / (228,600,
234,624)
Regression / 231,582 / 1,475 / (228,690,
234,474)

The ratio and regression estimators are about equally as efficient and are much more efficient than the standard estimator.

III. Domain Estimation

Often we want separate estimates of means for subpopulations; the subpopulations are called domains. For example, we may want to take a SRS of visitors who fly to Philadelphia on September 18th and to estimate the proportion of visitors who intend to stay longer than 1 week. For this survey, there are two domains of study: visitors from in-state and visitors from out-of-state. We do not know which persons in the population to which domain until they are sampled, though. Thus, the number of persons in the sample who fall into each domain is a random variable, with value unknown at the time the survey is designed.

Suppose there are D domains. Let be the set of units in the population that are in domain dand let be the set of units in the sample that are in domain d for . Let be the number of population units in and be the number of sample units in . Suppose we want to estimate the population mean in domain d:

A natural estimator of is

(1.2)

(1.2)looks at first just like the sample mean we have studied for estimating the whole population mean. The quantity is a random variable, however: If a different simple random sample is taken, we will very likely to have a different value for . Different samples would have different numbers of out-of-state visitors. Technically, (1.2) is a ratio estimate. To see this, let

Then , and

Because we are estimating a ratio, we use the formula from Notes 4 for calculating the standard error:

,

where is the sample variance in domain d.

If the expected sample size in domain d is large enough, then we expect that and and have the approximation

When the finite population correction factor is about 1, the above is just the standard error we would get if we assumed that we took a sample of fixed size on the domain D. Thus, in a sufficiently large sample, the technicality that we are using a ratio estimator makes little difference in practice for estimating a domain mean.

The situation is a little more complicated when estimating a domain total. If is known, estimation is simple: use . If is unknown thought, we need to estimate it by . Then

The standard error is

.

Example 3.8 from book.

IV. Simulation Studies

(a) Population in which the simple linear regression model approximately holds, the intercept of the regression line is approximately zero and the correlation is high (0.91):

Population regression:

Call:

lm(formula = popy ~ popx)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.47783 0.23496 -2.034 0.0425 *

popx 1.14619 0.02287 50.124 <2e-16 ***

For samples of size 50 from population of 500:

Bias / Root Mean Squared Error
Sample Mean / 0.00 / 0.330
Ratio / 0.00 / 0.135
Regression / 0.00 / 0.135

Root Mean Squared Error (RMSE) =

(b) Population in which there is a linear relationship between E(y|x) and log(x) and the correlation is moderately high (0.66)

Call:

lm(formula = popy ~ popx)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.404816 0.056785 24.74 <2e-16 ***

popx 0.109780 0.005569 19.71 <2e-16 ***

For samples of size 50 from population of 500:

Bias / Root Mean Squared Error
Sample Mean / 0.00 / 0.046
Ratio / 0.00 / 0.052
Regression / 0.00 / 0.035

(c) Population in which there is a linear relationship between E(y|x) and exp(x) and the correlation is moderate (0.43)

Call:

lm(formula = popy ~ popx)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.28542 0.09928 -2.875 0.00421 **

popx 0.38301 0.03521 10.877 < 2e-16 ***

For samples of size 20 from population of 500:

Bias / Root Mean Squared Error
Sample Mean / 0.00 / 0.969
Ratio / -0.04 / 0.899
Regression / -0.11 / 0.890

Code for simulation

# Simulation studies to compare ratio, regression and sample mean estimators

# Simple linear regression model

beta=1.1;

popx=rnorm(500,mean=10,sd=2);

popy=beta*popx+rnorm(500,mean=0,sd=1);

# Linear relationship between E(y|x) and log(x)

beta=1.1

popx=rnorm(500,mean=10,sd=2);

popy=beta*log(popx)+rnorm(500,mean=0,sd=.25);

# Linear relationship between E(y|x) and exp(x)

beta=.01;

popy=beta*exp(popx)+rnorm(500,mean=0,sd=1);

popxmean=mean(popx);

popymean=mean(popy);

samplesize=50;

nosims=50000;

samplemean=rep(0,nosims);

ratioest=rep(0,nosims);

regest=rep(0,nosims);

for(i in 1:nosims){

tempsample=sample(1:500,samplesize,replace=FALSE);

y.sample=popy[tempsample];

x.sample=popx[tempsample];

samplemean[i]=mean(y.sample);

Bhat=mean(y.sample)/mean(x.sample);

ratioest[i]=Bhat*popxmean;

tempreg=lm(y.sample~x.sample);

B0hat=coef(tempreg)[1];

B1hat=coef(tempreg)[2];

regest[i]=B0hat+B1hat*popxmean;

}

bias.samplemean=mean(samplemean)-popymean;

bias.ratioest=mean(ratioest)-popymean;

bias.regest=mean(regest)-popymean;

bias.samplemean;

bias.ratioest;

bias.regest;

rmse.samplemean=sqrt(mean((samplemean-popymean)^2));

rmse.ratioest=sqrt(mean((ratioest-popymean)^2));

rmse.regest=sqrt(mean((regest-popymean)^2));

rmse.samplemean;

rmse.ratioest;

rmse.regest;

plot(popx,popy,xlab="x",ylab="y",main="Population");

# Population regression

popreg=lm(popy~popx);

summary(popreg);

V. Generalized Regression Estimator

Suppose the following model holds in the population:

, i.e.,

Then .

The generalized regression estimator is

where is an estimate of based on the sample.

For simulation (c), consider the generalized regression estimator based on , the model from which the population was generated.

Bias / Root Mean Squared Error
Sample Mean / 0.00 / 0.969
Ratio / -0.04 / 0.899
Regression / -0.11 / 0.890
Generalized Regression / 0.00 / 0.175

The code for estimating the generalized regression estimator in each sample is:

exp.x.sample=exp(x.sample);

tempreg.exp=lm(y.sample~exp.x.sample);

B0hat.exp=coef(tempreg.exp)[1];

B1hat.exp=coef(tempreg.exp)[2];

regest.exp[i]=B0hat.exp+B1hat.exp*mean(exp(popx));

1