Outline for Class meeting 15 (Chapter 5, Lohr, 3/29/06)
Two-stage sampling: Optimizing the design parameters
- Estimators
- The unbiased estimator of total is
B. The ratio estimator of total is
C. The corresponding estimator of mean simply divide each by K.
- Variance
A. The variance of the unbiased estimator is
B. The variance of the ratio estimator is
where
C. To estimate these variances, substitute sample variances for the population variances in the expressions above. That is, calculate
, , and .
III. Comparing the two estimators
A. The unbiased estimator is best when cluster totals have small variance; the ratio estimator is best when the cluster means have small variance. This is the same as for one stage cluster designs.
B. However, the within cluster contribution to variance is the same regardless of the method. So the two estimators are not as different from each other as in one-stage design.
IV. Optimal Design
There are more sample size decisions to make for two-stage than for one-stage sampling. How should we balance the cost of sampling first and second-stage units?
A simplified analysis assumes equal sized psu's, and a design that selects an equal number of ssu's from each (mi = m, Mi = M). In that case, the two estimators are the same and its variance can be expressed as
where MSB and MSW are the between and within mean squares (see p. 138). Then it is typical to find the sample allocation which minimizes the variance above subject to constant cost, where the cost function is . The solution to this problem (see Exercise 5.23) is to choose
where is a measure of intra-cluster correlation similar to ICC, and to choose n as the number of psu's you can afford,
.
Another way to write the optimal (common) second stage sample size is
B. When Mi’s are not constant, what to do?
In practice, there is usually little loss in precision from choosing self-weighting form of estimator. So choose as above.
This generally works OK for either unbiased or ratio estimator.
Example: Find a good design for the book replacement value survey (#6, p. 170)
Suppose c1 = 10; c2 = 2;
data;
input SHELF NUMBER PURCHASE REPLACE;
cards;
2 3 1 13
2 5 1 13
etc.
43 6 1 4
43 12 1 10
43 16 1 7
43 24 3 6
procglm;
class shelf;
model replace = shelf;
run;
The GLM Procedure
Class Level Information
Class Levels Values
SHELF 12 2 4 11 14 20 22 23 31 37 38 40 43
Number of observations 60
The GLM Procedure
Dependent Variable: REPLACE
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 11 25570.98333 2324.63485 4.76 <.0001
Error 48 23445.20000 488.44167