Stat 475 Notes 6
Reading: Lohr, Chapter 3.3, 4.1-4.2
Corrections for Note 4 Addendum
A first order Taylor expansion of is
We approximate by the variance of the right hand side of the above:
I. Ratio and regression estimation review
Ratio and regression estimation are useful for estimating the mean or population total of a variable y when there is an auxiliary variable x for which we know the population total and that it is highly correlated with y.
The intuition is that when x and y are highly correlated, by comparing the sample mean of x to the population mean of x, we can predict whether the sample mean of y is likely to be an overestimate and underestimate of the population mean of y and we can adjust for this.
Steps in using ratio and regression estimation:
1. Plot the data. Fit a simple linear regression model. Make a residual plot.
2. Based on the residual plot, decide whether the simple linear regression model is a reasonable model for the data. If the simple linear regression model is not a reasonable model for the data, then both the ratio and regression estimator could have large bias. An estimator based on a more appropriate regression model can be considered (see end of notes).
3. If the simple linear regression model is a reasonable model for the data, check whether the regression line approximately goes through the origin, e.g., by testing whether the intercept equals 0. If the regression line approximately goes through the origin, either the ratio or regression estimator can be used. If the regression line does not approximately go through the origin, then the regression estimator should be used.
II. Example of ratio and regression estimation:
An advertising firm is concerned about the effect of a new regional promotional campaign on the total dollar sales for a particular product. A simple random sample of stores is drawn from then regional stores in which the product is sold. Quarterly sales data are obtained for the current 3-month period and the 3-month period prior to the campaign. The total sales among all 425 stores for the 3-month period prior to the campaign is known to be 216,256. Use the data to estimate the total sales for the current period and a 95% confidence interval for the total sales for the current period.
Plot the data:
precampaign.sales=c(208,400,440,259,351,880,273,487,183,863,599,510,828,473,924,110,829,257,388,244);
present.sales=c(239,428,472,276,363,942,294,514,195,897,626,538,888,510,998,171,889,265,419,257);
plot(precampaign.sales,present.sales);
Precampaign sales and present sales are highly correlated:
cor(precampaign.sales,present.sales)
[1] 0.9985922
Fit a simple linear regression model:
regmodel=lm(present.sales~precampaign.sales);
summary(regmodel);
Call:
lm(formula = present.sales ~ precampaign.sales)
Residuals:
Min 1Q Median 3Q Max
-19.037 -7.918 -2.345 8.252 45.423
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.10523 7.08325 1.427 0.171
precampaign.sales 1.04975 0.01314 79.871 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.93 on 18 degrees of freedom
Multiple R-squared: 0.9972, Adjusted R-squared: 0.997
F-statistic: 6379 on 1 and 18 DF, p-value: < 2.2e-16
Residual plot:
plot(precampaign.sales,resid(regmodel),xlab="Precampaign Sales",ylab="Residual",main="Residual plot");
abline(0,0); # Draws a horizontal line at 0
The simple linear regression model appears to be a reasonable model for the data.
The p-value for testing whether the intercept of the regression line is 0 is 0.171, so there is no evidence that the intercept is not zero. It is reasonable to use either the ratio or the regression estimator.
Comparison of ratio, regression and standard estimators
# Ratio estimator
# Population mean of precampaign sales
population.mean.precampaign.sales=216256/452;
# Sample estimate of ratio
Bhat=mean(present.sales)/mean(precampaign.sales)
# Ratio estimator of population total
ytotal.ratio=452*Bhat*population.mean.precampaign.sales;
# s_e^2 needed for estimating standard error of ratio estimator of population total
se.ratio.sq=sum((present.sales-Bhat*precampaign.sales)^2)/(length(present.sales)-1);
# Standard error of ratio estimator of total
standard.error.ytotal.ratio=452*sqrt((1-20/452)*se.ratio.sq/20);
# Approximate 95% confidence interval
lci.ratio=ytotal.ratio-1.96*standard.error.ytotal.ratio;
uci.ratio=ytotal.ratio+1.96*standard.error.ytotal.ratio;
# Regression estimator
# Fit a simple linear regression model of present sales on precampaign sales
regmodel=lm(present.sales~precampaign.sales);
# Extract least squares estimates of intercept and slope from regression fit
B0hat=coef(regmodel)[1];
B1hat=coef(regmodel)[2];
# Regression estimator of population total
ytotal.reg=452*(B0hat+B1hat*population.mean.precampaign.sales);
# Variance of residuals from regression
sereg.sq=sum(resid(regmodel)^2/(length(present.sales)-2));
# Standard error of regression estimator of total
standard.error.ytotal.reg=452*sqrt((1-(20/452))*sereg.sq/20); # Standard error of regression estimator
# Approximate 95% confidence interval
lci.reg=ytotal.reg-1.96*standard.error.ytotal.reg;
uci.reg=ytotal.reg+1.96*standard.error.ytotal.reg;
# Standard total estimator that ignores precampaign sales
ybar=mean(present.sales);
ytotal.standard=452*ybar;
standard.error.ytotal.standard=452*sqrt((1-20/452)*var(present.sales)/20);
# Approximate 95% confidence interval
lci.reg=ytotal.reg-1.96*standard.error.ytotal.reg;
uci.reg=ytotal.reg+1.96*standard.error.ytotal.reg;
Method / Estimated Total / SE / 95% CIStandard / 230,091 / 27,073 / (177,027,
283,154)
Ratio / 231,612 / 1,537 / (228,600,
234,624)
Regression / 231,582 / 1,475 / (228,690,
234,474)
The ratio and regression estimators are about equally as efficient and are much more efficient than the standard estimator.
III. Domain Estimation
Often we want separate estimates of means for subpopulations; the subpopulations are called domains. For example, we may want to take a SRS of visitors who fly to Philadelphia on September 18th and to estimate the proportion of visitors who intend to stay longer than 1 week. For this survey, there are two domains of study: visitors from in-state and visitors from out-of-state. We do not know which persons in the population to which domain until they are sampled, though. Thus, the number of persons in the sample who fall into each domain is a random variable, with value unknown at the time the survey is designed.
Suppose there are D domains. Let be the set of units in the population that are in domain dand let be the set of units in the sample that are in domain d for . Let be the number of population units in and be the number of sample units in . Suppose we want to estimate the population mean in domain d:
A natural estimator of is
(1.2)
(1.2)looks at first just like the sample mean we have studied for estimating the whole population mean. The quantity is a random variable, however: If a different simple random sample is taken, we will very likely to have a different value for . Different samples would have different numbers of out-of-state visitors. Technically, (1.2) is a ratio estimate. To see this, let
Then , and
Because we are estimating a ratio, we use the formula from Notes 4 for calculating the standard error:
,
where is the sample variance in domain d.
If the expected sample size in domain d is large enough, then we expect that and and have the approximation
When the finite population correction factor is about 1, the above is just the standard error we would get if we assumed that we took a sample of fixed size on the domain D. Thus, in a sufficiently large sample, the technicality that we are using a ratio estimator makes little difference in practice for estimating a domain mean.
The situation is a little more complicated when estimating a domain total. If is known, estimation is simple: use . If is unknown thought, we need to estimate it by . Then
The standard error is
.
Example 3.8 from book.
IV. Simulation Studies
(a) Population in which the simple linear regression model approximately holds, the intercept of the regression line is approximately zero and the correlation is high (0.91):
Population regression:
Call:
lm(formula = popy ~ popx)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.47783 0.23496 -2.034 0.0425 *
popx 1.14619 0.02287 50.124 <2e-16 ***
For samples of size 50 from population of 500:
Bias / Root Mean Squared ErrorSample Mean / 0.00 / 0.330
Ratio / 0.00 / 0.135
Regression / 0.00 / 0.135
Root Mean Squared Error (RMSE) =
(b) Population in which there is a linear relationship between E(y|x) and log(x) and the correlation is moderately high (0.66)
Call:
lm(formula = popy ~ popx)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.404816 0.056785 24.74 <2e-16 ***
popx 0.109780 0.005569 19.71 <2e-16 ***
For samples of size 50 from population of 500:
Bias / Root Mean Squared ErrorSample Mean / 0.00 / 0.046
Ratio / 0.00 / 0.052
Regression / 0.00 / 0.035
(c) Population in which there is a linear relationship between E(y|x) and exp(x) and the correlation is moderate (0.43)
Call:
lm(formula = popy ~ popx)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.28542 0.09928 -2.875 0.00421 **
popx 0.38301 0.03521 10.877 < 2e-16 ***
For samples of size 20 from population of 500:
Bias / Root Mean Squared ErrorSample Mean / 0.00 / 0.969
Ratio / -0.04 / 0.899
Regression / -0.11 / 0.890
Code for simulation
# Simulation studies to compare ratio, regression and sample mean estimators
# Simple linear regression model
beta=1.1;
popx=rnorm(500,mean=10,sd=2);
popy=beta*popx+rnorm(500,mean=0,sd=1);
# Linear relationship between E(y|x) and log(x)
beta=1.1
popx=rnorm(500,mean=10,sd=2);
popy=beta*log(popx)+rnorm(500,mean=0,sd=.25);
# Linear relationship between E(y|x) and exp(x)
beta=.01;
popy=beta*exp(popx)+rnorm(500,mean=0,sd=1);
popxmean=mean(popx);
popymean=mean(popy);
samplesize=50;
nosims=50000;
samplemean=rep(0,nosims);
ratioest=rep(0,nosims);
regest=rep(0,nosims);
for(i in 1:nosims){
tempsample=sample(1:500,samplesize,replace=FALSE);
y.sample=popy[tempsample];
x.sample=popx[tempsample];
samplemean[i]=mean(y.sample);
Bhat=mean(y.sample)/mean(x.sample);
ratioest[i]=Bhat*popxmean;
tempreg=lm(y.sample~x.sample);
B0hat=coef(tempreg)[1];
B1hat=coef(tempreg)[2];
regest[i]=B0hat+B1hat*popxmean;
}
bias.samplemean=mean(samplemean)-popymean;
bias.ratioest=mean(ratioest)-popymean;
bias.regest=mean(regest)-popymean;
bias.samplemean;
bias.ratioest;
bias.regest;
rmse.samplemean=sqrt(mean((samplemean-popymean)^2));
rmse.ratioest=sqrt(mean((ratioest-popymean)^2));
rmse.regest=sqrt(mean((regest-popymean)^2));
rmse.samplemean;
rmse.ratioest;
rmse.regest;
plot(popx,popy,xlab="x",ylab="y",main="Population");
# Population regression
popreg=lm(popy~popx);
summary(popreg);
V. Generalized Regression Estimator
Suppose the following model holds in the population:
, i.e.,
Then .
The generalized regression estimator is
where is an estimate of based on the sample.
For simulation (c), consider the generalized regression estimator based on , the model from which the population was generated.
Bias / Root Mean Squared ErrorSample Mean / 0.00 / 0.969
Ratio / -0.04 / 0.899
Regression / -0.11 / 0.890
Generalized Regression / 0.00 / 0.175
The code for estimating the generalized regression estimator in each sample is:
exp.x.sample=exp(x.sample);
tempreg.exp=lm(y.sample~exp.x.sample);
B0hat.exp=coef(tempreg.exp)[1];
B1hat.exp=coef(tempreg.exp)[2];
regest.exp[i]=B0hat.exp+B1hat.exp*mean(exp(popx));
1