Statistics 475 Notes 13

Reading: Lohr, Chapters 5.3-5.4

I. Two stage cluster sampling

In one-stage cluster sampling, we examine all secondary sampling units (ssu’s) within the selected clusters (primary sampling units). In many situations, though, the units in a cluster may be so similar that examining all units within a cluster wastes resources; alternatively, it may be expensive to measure units in a cluster relative to the cost of sampling clusters. In these situations, taking a subsample within each cluster may be cheaper and more efficient.

In two-stage cluster sampling, we

Select a simple random sample (SRS) of clusters from a population of clusters.
Select a SRS of ssu’s from each selected cluster. The SRS of from the ith cluster is denoted by .

Examples:

(1)The Gallup poll for presidential elections samples approximately 300 election districts from around the U.S. At the second stage, this poll randomly selects approximately five households per district.

(2)Sampling for quality control purposes often involves two stages of sampling. For example, when an investigator samples packaged products, such as frozen foods, he or she commonly samples cartons and then samples packages from within cartons. When sampling requires the detailed investigation of components of products, such as measuring plate thickness in automobile batteries, a quite natural procedure is to sample some of the products (batteries) and then sample components (plates) within these products.

Data example: A garment manufacturer has 90 plants located throughout the U.S. and wants to estimate the average number of hours that the sewing machines were down for repairs in the past three months. Because the plants are widely scattered, the manufacturer’s statistician decides to use cluster sampling, specifying each plant as a cluster of machines. Each plant contains many machines, and checking the repair record for each machine would be time consuming. Therefore, she uses two stage sampling. Enough time and money are available to sample plants and approximately 20% of the machines in each plant. The manufacturer knows she has a combined total of 4500 machines in all plants.

Plant / / / Downtime
(hours) / /
1 / 50 / 10 / 5,7,9,0,11,
2,8,4,3,5 / 5.40 / 11.38
2 / 65 / 13 / 4,3,7,2,11,0,
1,,9,4,3,2,1,5 / 4.00 / 10.67
3 / 45 / 9 / 5,6,4,11,12,
0,1,8,4 / 5.67 / 16.75
4 / 48 / 10 / 6,4,0,1,0,9,
8,4,6,10 / 4.80 / 13.29
5 / 52 / 10 / 11,4,3,1,0,2,
8,6,5,3 / 4.30 / 11.12
6 / 58 / 12 / 12,11,3,4,2,
0,0,1,4,3,2,4 / 3.83 / 14.88
7 / 42 / 8 / 3,7,6,7,8,
4,3,2 / 5.00 / 5.14
8 / 66 / 13 / 3,6,4,3,2,2,
8,4,0,4,5,6,3 / 3.85 / 4.31
9 / 40 / 8 / 6,4,7,3,
9,4,1,5 / 4.88 / 6.13
10 / 56 / 11 / 6,7,5,10,11,
2,1,4,0,5,4 / 5.00 / 11.80

Unbiased estimates of the population total and population mean:

In one-stage cluster sampling, we could estimate the population total by ; the cluster totals were known because we sampled every secondary sampling unit in the cluster. In two-stage cluster sampling, however, since we do not observe every ssu in the sampled cluster, we need to estimate individual cluster totals by

and an unbiased estimator of the population total is

In two-stage sampling, the ’s are random variables. Consequently, the variance of has two components: (1) the variability between clusters; and (2) the variability of the ssu’s within clusters. We do not have to worry about component (2) in one-stage cluster sampling.

The variance of equals the variance of from one-stage cluster sampling plus an extra term because the ’s estimate the cluster the totals. For two-stage cluster sampling,

(1.1)

where is the population variance of the cluster totals and is the population variance among elements within cluster i. The first term in (1.1) is the variance from one-stage cluster sampling and the second term is the additional variance due to subsampling.

To estimate , we substitute estimates and for and respectively.

The standard error is of course the square root of .

If we know the number of units in the population , we can estimate the population mean by

with standard error

R code:

Mi=c(50,65,45,48,52,58,42,66,40,56);

mi=c(10,13,9,10,10,12,8,13,8,11);

yibar=c(5.4,4,5.67,4.8,4.3,3.83,5,3.85,4.88,5);

sisq=c(11.38,10.67,16.75,13.29,11.12,14.88,5.14,4.31,6.13,11.8);

N=90;

n=10;

K=4500;

that.i=Mi*yibar;

that.unb=(N/n)*sum(Mi*yibar);

stsq=sum((that.i-that.unb/N)^2/(n-1));

varhat.that.unb=N^2*(1-n/N)*stsq/n+(N/n)*sum((1-mi/Mi)*Mi^2*(sisq/mi));

se.that.unb=sqrt(varhat.that.unb);

yhat.unb=that.unb/K;

se.yhat.unb=se.that.unb/K;

yhat.unb

[1] 4.80118

se.yhat.unb

[1] 0.192594

Thus, we estimate that the average sewing machine was down for repair for 4.8 hours in the past three months and a 95% confidence interval for this average is approximately

Ratio Estimation: As with one-stage cluster sampling, we can also use a ratio estimator for estimating the population mean where the auxiliary variables are the cluster sizes.

The ratio estimates of the population mean and population total are:

The difference between one-stage and two-stage cluster sampling is that we need to estimate the cluster totals in two-stage cluster sampling.

Using a Taylor series approximation, the variance of the ratio estimator is estimated by

where

and is the average cluster size – either the population average or sample average can be used in the estimate of the variance.

# Ratio estimate

ybarhat.r=sum(that.i)/sum(Mi);

Mbar=mean(Mi);

srsq=sum((Mi*yibar-Mi*ybarhat.r)^2)/(n-1);

varhat.ybarhat.r=(1/Mbar^2)*((1-n/N)*srsq/n+(1/(n*N))*sum(Mi^2*(1-mi/Mi)*sisq/mi));

se.ybarhat.r=sqrt(varhat.ybarhat.r);

ybarhat.r

[1] 4.598831

se.ybarhat.r

[1] 0.2220061

In this example, the ratio estimator has a higher standard error than the unbiased estimator. However, if the population size were unknown, then we would have to use the ratio estimator. Furthermore, the ratio estimator will be better than the unbiased estimator when the cluster sizes vary considerably and the cluster totals are roughly proportional to the cluster sizes .

When the cluster sizes are all the same, the ratio estimator and the unbiased estimator are the same.

II. Sampling weights for cluster samples

Recall from Chapter 4.3 that the sampling weight for an observation in a sample is the inverse of the probability that an observation would be selected in a sample and is the number of units in the population that the observation represents.

For cluster sampling,

Thus, the sampling weight for the jth unit in the ith cluster is

We have

and

In two stage cluster sampling, for each ssu in the sample to represent the same number of ssu’s in the population, needs to be proportional to so is constant. This is approximately true in the above example on downtime of sewing machines.

III. Analysis of cluster samples using the survey package in R

For using the survey package in R to analyze cluster samples, we need to create a variable that specifies which cluster each observation in the sample belongs to, a variable which gives a unique number to each ssu, a variable that gives the sampling weight of each observation and variables that give the overall number of clusters in the population and the cluster size of each cluster in the sample.

# Responses

downtime=c(5,7,9,0,11,2,8,4,3,5,4,3,7,2,11,0,1,9,4,3,2,1,5,5,6,4,11,12,0,1,8,4,6,4,0,1,0,9,8,4,6,10,11,4,3,1,0,2,8,6,5,3,12,11,3,4,2,0,0,1,4,3,2,4,3,7,6,7,8,4,3,2,3,6,4,3,2,2,8,4,0,4,5,6,3,6,4,7,3,9,1,4,5,6,7,5,10,11,2,1,4,0,5,4)

# Cluster sizes plant=c(rep(1,10),rep(2,13),rep(3,9),rep(4,10),rep(5,10),rep(6,12),rep(7,8),rep(8,13),rep(9,8),rep(10,11));

# Number of clusters

fpc1=rep(90,length(plant));

# Size of cluster for each ssu

fpc2=c(rep(50,10),rep(65,13),rep(45,9),rep(48,10),rep(52,10),rep(58,12),rep(42,8),rep(66,13),rep(40,8),rep(56,11));

# Sampling weights

wght=(fpc1*fpc2)/(10*c(rep(10,10),rep(13,13),rep(9,9),rep(10,10),rep(10,10),rep(12,12),rep(8,8),rep(13,13),rep(8,8),rep(11,11)));

# Unique id for each ssu

machine.id=seq(1,length(downtime),1);

# Data frame that contains the information for the survey design

garment.dataframe=data.frame(machine.id,plant,fpc1,fpc2,wght);

# Specify the sample design

garment.design=svydesign(id=~plant+machine.id,weights=~wght,data=garment.dataframe,fpc=~fpc1+fpc2);

# Estimate the population mean of downtime using the ratio estimator

svymean(downtime,design=garment.design);

mean SE

[1,] 4.598 0.2219

The survey package automatically uses the ratio estimator for estimating the population mean for cluster samples. The survey packages uses for estimating the population total:

> svytotal(downtime,design=garment.design)

total SE

[1,] 21602 866.64

IV. Designing a cluster sample

When designing a cluster sample, you need to decide four major issues:

(1)What overall precision (standard error) is needed?

(2)What size should the clusters be?

(3)How many ssu’s should be sampled in each cluster selected for the sample

(4)How many clusters should be sampled?

Issue (1) is faced in any survey design. For issue (2), there are often natural clusters.

We will focus on issues (3)-(4) in the case of equal cluster sizes (in which case ). Let be the cluster sizes and assume we will sample the same number of ssu’s from each cluster.

The ANOVA decomposition can be used to write

where MSB and MSW are the between and within mean squares from the ANOVA table (see Notes 12; Lohr, pg. 138).

Consider the simple cost function

where is the fixed cost of sampling each cluster (not including the cost of measuring ssu’s) and is the additional cost of measuring each ssu. Using calculus, one can easily determine that the values

minimize for fixed total cost C.

Example: An inspector samples cans from a truckload of canned creamed corn to estimate the average number of worm fragments per can. The truck has 580 cases; each case contains 24 cans. The inspector samples 12 cases at random and subsamples 3 cans randomly from each selected case.

case=rep(seq(1,12,1),each=3);

can=seq(1,36,1);

frag=c(1,5,7,4,2,4,0,1,2,3,6,6,4,9,8,0,7,3,5,5,1,3,0,2,7,3,5,3,1,4,4,7,9,0,0,0);

(a) Find an approximate 95% confidence interval for the average number of worm fragments per can

# Cluster sizes

clustersize=rep(24,36);

noclusters=rep(580,36);

# Sampling weights

N=580;

n=12;

mi=rep(3,36);

sampwght=(N*clustersize)/(n*mi);

corn.design.dataframe=data.frame(case,can,clustersize,noclusters,sampwght);

corn.design=svydesign(id=~case+can,weights=sampwght,data=corn.design.dataframe,fpc=~noclusters+clustersize);

svymean(frag,design=corn.design);

mean SE

[1,] 3.6389 0.6102

Thus, an approximate 95% confidence interval for the average number of worm fragments per can is

(b) Suppose a new truckload is to be inspected and it is thought to be similar to this one. It takes 20 minutes to locate and open a case, and 8 minutse to locate and examine each specified can within a case. How many cans should be examined per case? How many cases? Assume your budget is 120 minutes.

To find the MSB and MSW, we can use the one-way analysis of variance.

> aov.frag=aov(frag~as.factor(case)); # as.factor(case) makes case into a categorical variable with categories 1,...,12

> summary(aov.frag)

Df Sum Sq Mean Sq F value Pr(>F)

as.factor(case) 11 149.639 13.604 3.0045 0.01172 *

Residuals 24 108.667 4.528

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Thus, we estimate MSB=13.60 and MSW=4.53.

Then,

Round up to 6 cans per case.

Then,

Round up and sample 2 cases.

The total cost would be 2*(20+8*6)=136, which is overbudget. We can use 5 cans per case and remain within budget, 2*(20+8*5)=120.