I. Sufficiency: Review and Factorization Theorem

Statistics 550 Notes 6

Reading: Section 1.5

Motivation: The motivation for looking for sufficient statistics is that it is useful to condense the data to a statistic that contains all the information about in the sample.

Definition: A statistic is sufficientfor if the conditional distribution of given does not depend on for any value of .

Example 1: Let be a sequence of independent Bernoulli random variables with . is sufficient for .

Example 2:

Let be iid Uniform(). Consider the statistic .

We showed in Notes 4 that

For , we have

which does not depend on .

For , .

NOTE: NEED TO THINK MORE ABOUT THIS EXAMPLE AS does seem to depend on .

It is often hard to verify or disprove sufficiency of a statistic directly because we need to find the distribution of the sufficient statistic. The following theorem is often helpful.

Factorization Theorem: A statistic is sufficient for if and only if there exist functions and such that

for all x and all .

(where denotes the probability mass function for discrete data given the parameter and the probability density function for continuous data).

Proof: We prove the theorem for discrete data; the proof for continuous distributions is similar. First, suppose that the probability mass function factors as given in the theorem. Consider . If , then for all . Suppose

We have

so that

Thus, does not depend on and

is sufficient for by the definition of sufficiency.

Conversely, suppose is sufficient for . Then the conditional distribution of does not depend on . Let . Then

Thus, we can take

Example 1 Continued: a sequence of independent Bernoulli random variables with . To show that is sufficient for , we factor the probability mass function as follows:

The pmf is of the form where .

Example 2 continued: Let be iid Uniform(). To show that is sufficient, we factor the pdf as follows:

The pdf is of the form where

Example 3: Let be iid Normal (). The pdf factors as

The pdf is thus of the form where .

Thus, is a two-dimensional sufficient statistic for , i.e., the distribution of is independent of given .

A theorem for proving that a statistic is not sufficient:

Theorem 1: Let be a statistic. If there exists some and such that

(i) ;

(ii) ,

then is not a sufficient statistic.

Proof: First, suppose one side of (ii) equals 0 and the other side of (ii) does not equal 0. This implies that either is in the support of but not or vice versa. If were sufficient, then (i) implies that both must be in the support of and. Hence is not sufficient.

Second, suppose both sides of (ii) are greater than zero so that . If were sufficient, then since the distribution of given is independent of , we must have

(0.1)

The left hand side of (0.1) equals

and the right hand side of (0.1) equals

Thus, from (0.1), we conclude that if were sufficient,

we would have

, so that

Thus, (i) and (ii) show that is not a sufficient statistic.

Example 4: Consider a series of three independent Bernoulli trials with probability of success p. Let . Show that T is not sufficient.

Let and . We have .

But

Thus, by Theorem 1, T is not sufficient.

II. Implications of Sufficiency

We have said that reducing the data to a sufficient statistic does not sacrifice any information about .

We now justify this statement in two ways:

(1)We show that for any decision procedure, we can find a randomized decision procedure that is based only on the sufficient statistic and that has the same risk function.

(2)We show that any point estimator that is not a function of the sufficient statistic can be improved upon for a convex loss function.

(1) Let be a decision procedure and be a sufficient statistic. Consider the following randomized decision procedure [call it ]:

Based on , randomly draw from the distribution (which does not depend on and is hence known) and take action .

has the same distribution as so that has the same distribution as .

Example 2: . is sufficient because is equally likely to be for all . Given , construct to be with probability 0.5 each. Then .

(2) The Rao-Blackwell Theorem.

Convex functions: A real valued function defined on an open interval is convex if for any and ,

is strictly convex if the inequality is strict.

If exists, then is convex if and only if on .

A convex function lies above all its tangent lines.

Convexity of loss functions:

For point estimation:

squared error loss is strictly convex.
absolute error loss is convex but not strictly convex
Huber’s loss functions,

for some constant k is convex but not strictly convex.

zero-one loss function

is nonconvex.

Jensen’s Inequality: (Appendix B.9)

Let X be a random variable. (i) If is convex in an open interval and and , then

(ii) If is strictly convex, then unless X equals a constant with probability one.

Proof of (i): Let be a tangent line to at the point . Write . By the convexity of , . Since expectations preserve inequalities,

as was to be shown.

Rao-Blackwell Theorem: Let be a sufficient statistic. Let be a point estimate of and assume that the loss function is strictly convex in d. Also assume that . Let . Then unless with probability one.

Proof: Fix . Apply Jensen’s inequality with and let have the conditional distribution of for a particular choice of .

By Jensen’s inequality,

(0.2)

Taking the expectation on both sides of this inequality yields unless with probability one.

Comments:

(1) Sufficiency ensures is an estimator (i.e., it depends only on and not on ).

(2) If loss is convex rather than strictly convex, we get in (1.2)

(3) Theorem is not true without convexity of loss functions.

Consequence of Rao-Blackwell theorem: For convex loss functions, we can dispense with randomized estimators.

A randomized estimator randomly chooses the estimate , where the distribution of is known. A randomized estimator can be obtained as an estimator estimator where and U are independent and U is uniformly distributed on (0,1). This is achieved by observing and then using U to construct the distribution of . For the data , is sufficient. Thus, by the Rao-Blackwell Theorem, the nonrandomized estimator dominates for strictly convex loss functions.

III. Minimal Sufficiency

For any model, there are many sufficient statistics.

Example 1: For iid Bernoulli (), , are both sufficient but provides a greater reduction of the data.

Definition: A statistic is minimally sufficient if it is sufficient and it provides a reduction of the data that is at least as great as that of any other sufficient statistic in the sense that we can find a transformation such that .

Comments:

(1) To say that we can find a transformation such that means that if , then must equal .

(2) Data reduction in terms of a particular statistic can be thought of as a partition of the sample space. A statistic partitions the sample space into sets .

If a statistic is minimally sufficient, then for another sufficient statistic which partitions the sample space into sets , every set must be a subset of some . Thus, the partition associated with a minimal sufficient statistic is the coarsest possible partition for a sufficient statistic and in this sense the minimal sufficient statistic achieves the greatest possible data reduction for a sufficient statistic.

A useful theorem for finding a minimal sufficient statistic is the following:

Theorem 2 (Lehmann and Scheffe, 1950): Suppose is a sufficient statistic for . Also suppose that for every two sample points and , the ratio is constant as a function of if . Then is a minimal sufficient statistic for .

Proof: Let be any statistic that is sufficient for . By the factorization theorem, there exist functions and such that . Let and be any two sample points with . Then

Since this ratio does not depend on , the assumptions of the theorem imply that . Thus, is at least as coarse a partition of the sample space as , and consequently is minimal sufficient.

Example 1 continued: Consider the ratio

This ratio is constant as a function of if . Since we have shown that is a sufficient statistic, it follows from the above sentence and Theorem 2 that is a minimal sufficient statistic.

Note that a minimal sufficient statistic is not unique. Any one-to-one function of a minimal sufficient statistic is also a minimal sufficient statistic. For example, is a minimal sufficient statistic for the i.i.d. Bernoulli case.

Example 2: Suppose are iid uniform on the interval . Then the joint pdf of X is

The statistic is a sufficient statistic by the factorization theorem with and .

For any two sample points and , the numerator and denominator of the ratio will be positive for the same values of if and only if and ; if the minima and maxima are equal, then the ratio is constant and in fact equals 1. Thus, is a minimal sufficient statistic by Theorem 2.

Example 2 is a case in which the dimension of the minimal sufficient statistic (2) does not match the dimension of the parameter (1). There are models in which the dimension of the minimal sufficient statistic is equal to the sample size, e.g., iid Cauchy(), .

(Problem 1.5.15).

III. Ancillary Statistics

A statistic is ancillary if its distribution does not depend on .

Example 4: Suppose our model is iid . Then is a sufficient statistic and is an ancillary statistic.

Although ancillary statistics contain no information about when the model is true, ancillary statistics are useful for checking the validity of the model.