Outline for Class meeting 15 (Chapter 5, Lohr, 3/29/06)

Two-stage sampling: Optimizing the design parameters

  1. Estimators
  1. The unbiased estimator of total is

B. The ratio estimator of total is

C. The corresponding estimator of mean simply divide each by K.

  1. Variance

A. The variance of the unbiased estimator is

B. The variance of the ratio estimator is

where

C. To estimate these variances, substitute sample variances for the population variances in the expressions above. That is, calculate

, , and .

III. Comparing the two estimators

A. The unbiased estimator is best when cluster totals have small variance; the ratio estimator is best when the cluster means have small variance. This is the same as for one stage cluster designs.

B. However, the within cluster contribution to variance is the same regardless of the method. So the two estimators are not as different from each other as in one-stage design.

IV. Optimal Design

There are more sample size decisions to make for two-stage than for one-stage sampling. How should we balance the cost of sampling first and second-stage units?

A simplified analysis assumes equal sized psu's, and a design that selects an equal number of ssu's from each (mi = m, Mi = M). In that case, the two estimators are the same and its variance can be expressed as

where MSB and MSW are the between and within mean squares (see p. 138). Then it is typical to find the sample allocation which minimizes the variance above subject to constant cost, where the cost function is . The solution to this problem (see Exercise 5.23) is to choose

where is a measure of intra-cluster correlation similar to ICC, and to choose n as the number of psu's you can afford,

.

Another way to write the optimal (common) second stage sample size is

B. When Mi’s are not constant, what to do?

In practice, there is usually little loss in precision from choosing self-weighting form of estimator. So choose as above.

This generally works OK for either unbiased or ratio estimator.

Example: Find a good design for the book replacement value survey (#6, p. 170)

Suppose c1 = 10; c2 = 2;

data;

input SHELF NUMBER PURCHASE REPLACE;

cards;

2 3 1 13

2 5 1 13

etc.

43 6 1 4

43 12 1 10

43 16 1 7

43 24 3 6

procglm;

class shelf;

model replace = shelf;

run;

The GLM Procedure

Class Level Information

Class Levels Values

SHELF 12 2 4 11 14 20 22 23 31 37 38 40 43

Number of observations 60

The GLM Procedure

Dependent Variable: REPLACE

Sum of

Source DF Squares Mean Square F Value Pr > F

Model 11 25570.98333 2324.63485 4.76 <.0001

Error 48 23445.20000 488.44167