9 - Sampling Distributions and Confidence Intervals for m & p

Introduction:

When take a sample of size n from a population and calculate summary statistics like the sample mean , the sample median (med), the sample variance (), the sample standard deviation (s), or the sample proportion ) we must realize that these quantities will ______and hence are themselves ______.

Any random variable in statistics has a probability distribution. We have been talking about three common probability distributions in statistics. When X = # of “successes” in n independent trials we used the binomial distribution to talk about X probabilistically, when X = # of occurrences in a fixed time/space unit we used the Poisson distribution, and finally when X was continuous and had an approximate bell-shaped distribution we used the normal distribution to calculate probabilities and quantiles associated with X.

Because the summary statistics discussed above are random variables they also have a probability distribution that determines the likelihood of certain values of these statistics being obtained. The distribution of a summary statistic, e.g. the sample mean is called the ______.

In this handout we explore the sampling distributions of the sample mean () and the sample proportion ().

Sampling Distribution of

The sample mean () is a random quantity that varies from sample to sample. The probability distribution the sample mean follows is called the sampling distribution of.

The sampling distribution demo I showed in class is found at the following web address:

http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/

The Central Limit Theorem for the Sample Mean (CLT) ~ tells us about the sampling distributions of the sample mean (). There is also a version (which we will see later) that tells us about the sampling distribution of the sample proportion () .

The CLT for says the following:

1.

2.

3. The sampling distribution will be ______if either of the conditions
below are met:

·
or if

·

We now consider applications of the central limit theorem (CLT).

Applications to Decision Making

Example 1: Cholesterol levels of adult males (50-60 yrs. old)

The mean blood cholesterol level of adult males (50-60 yrs. old) is 200 mg/dl with a standard deviation of 20 mg/dl. Assume also that blood cholesterol levels are approximately normally distributed in this population.

a) What is the probability that when taking a sample of size n = 25 that you would obtain sample mean greater than 225 mg/dl?

b) Give a range of values that we would expect the sample mean to fall approximately 95% of the time.

c) Suppose we took sample of adult males between the ages of 50 – 60 who are also strict vegetarians and obtained sample mean of 188 mg/dl. Does this provide evidence that the subpopulation of vegetarians have a lower mean cholesterol level that the greater population of men in this age group? Explain.

Example 2: Mercury Levels Found in Boulder Reservoir Walleyes

Fish consumption guidelines suggest you should limit the number of fish you eat with Hg levels above .25 ppm. Is there evidence to suggest that walleyes from Boulder Reservoir have a mean Hg content exceeding .25 ppm?

Confidence Intervals for the Population Mean (m)

Motivating Example: Suppose we are trying to estimate the mean protein content of zebra mussels, which are becoming an increased part of the diet for ducks on the Mississippi River. A sample of n = 25 zebra mussels are analyzed for their protein content and a sample mean ofunits.

This is called a ______for the population mean (m) because it yields a single value for this unknown quantity.

A better estimate might be 9.14 give or take _____ units, i.e. ______up to ______. This is called an ______ as it gives a range or interval of plausible values for the population mean.

How do we know this if this a good interval estimate? ______

What properties should a good interval estimate have?

·  It

·  dfk

The central limit theorem states that if our sample size (n) is sufficiently large, then which also implies that after standardizing

This means that when we collect our data the probability our observed sample mean will fall within two standard errors of the mean is approximately .95 or a 95% chance, or being more precise we could use 1.96 standard errors because

Which gives

For a 99% chance we use ______and for 90% we use ______in place of 1.96.

Starting with the statement,

we will perform algebraic manipulations to isolate the population mean m in the middle of this inequality instead. By doing this we will obtain an interval that has a 95% chance of covering the true population mean.

Algebraic Manipulations of the Inequality on the Previous Page:

This says that the interval from up to has a 95% chance of covering the true population mean m. This interval is simply the sample mean plus or minus roughly two standard errors. However, this interval cannot be calculated in practice! WHY?

A “simple fix” to this would be replace ____ by the estimated standard deviation from our data _____.

The problem with our “simple fix” is that the distribution of is not standard normal, i.e. N(0,1) therefore the 1.96 value will not necessarily produce the desired level

of confidence.

FACT: If the population we are sampling from is approximately normal then

has a t-distribution with degrees of freedom df = n – 1.

What does a t-distribution look like?

·

Examples: Using the t-table to find confidence intervals

a) n = 20 and 95% confidence t =

b) n = 20 and 99% confidence t =
c) n = 50 and 90% confidence t =

d) n = 10 and 95% confidence t =

The basic form of most confidence intervals is:

General Form for a Confidence Interval for the Mean

For the population mean we have,

or

The appropriate columns in t-distribution table) for the different confidence intervals are as follows:

90% Confidence look in the .05 column (if n is “large” we can use 1.645)

95% Confidence look in the .025 column (if n is “large” we can use 1.960)

99% Confidence look in the .005 column (if n is “large” we can use 2.576)

Example: Suppose we are trying to estimate the mean protein content of zebra mussels, which are becoming an increased part of the diet for ducks on the Mississippi River. A sample of n = 25 zebra mussels are analyzed for their protein content and a sample mean ofunits with a sample standard deviation of s = 2.98 units.

a) Use this information to find a 95% CI for the mean protein content found in the tissues of zebra mussels, assuming that protein content of zebra mussels has a normal distribution.

Suppose a sample of n = 25 freshwater clams was obtained and similar protein analysis was conducted resulting in a sample mean units with a standard deviation of s = 12.12 units.

b) Find a 95% confidence interval for the mean protein content found in the tissue of freshwater clams.

c) Does this interval in conjunction with the interval obtained for zebra mussels provide evidence that freshwater clams are richer in protein than zebra mussels?

Sampling Distribution of the Sample Proportion ()

Just like the sample mean the sample proportion () is random, as it too varies from sample to sample. The sampling distribution of has the following properties:

1. The mean of the sampling distribution is the population proportion (p)

2. The standard deviation of the sampling distribution or the standard error of
and is given by:

where

3. The sampling distribution is approx. normal provided n is “sufficiently large”.

Note: When estimating proportions large sample sizes are generally used
(e.g. n 100)

APPLICATIONS TO DECISION MAKING

Example: New Method for Treating a Certain Illness/Disease

Suppose the current treatment method for certain disease has 70% success rate. A new method has been proposed that will hopefully have a higher success rate. The new method is administered to a sample n = 50 patient and 40 have successful treatment.

Can we conclude on the basis of this result that the new method has a higher success rate?

Using the Binomial Table (this is called the Binomial Exact Test, see Sec. 11.1)

CONFIDENCE INTERVALS FOR THE POPULATION PROPORTION (Sec. 4.5)

Motivating Example: A study of 200 rainbow trout caught on baited size 8 barbed hooks and released with the line cut at the hook (but the hook not removed from the fish) showed that 58 fish died (from the National Symposium on Catch and Release Fishing).

An estimate of the proportion of trout that die when caught and released in this fashion is .29 or 29%. A better estimate might be 29% give or take 4%, i.e. estimating that the actual percentage of that will die to be somewhere between 25% and 33%. This is called an “interval estimate”, as it gives a range or interval of plausible values for the population proportion/percentage. As with the population mean discussed earlier, we wish this interval to be narrow enough to provide useful information about this unknown percentage, yet have a high probability or chance of covering the actual percentage of trout that will die under this catch and release strategy.

The central limit theorem for proportions states that if our sample size (n) is sufficiently large, then . This means that when we take our sample and find our sample proportion,, the probability our observed sample proportion will fall within approximately two standard errors of the population proportion is roughly 95%, or more precisely

ß Recall:

Starting with this statement we can perform some algebraic manipulations to isolate the population proportion, p, in the middle of the inequality above. By doing this we will see that the resulting interval will have a 95% chance of covering the true population proportion (p).

After a wonderful algebraic manipulation of the equality above :

This says that the interval from up to has a 95% chance of covering the true population proportion p. This interval is simply the sample proportion plus or minus roughly two standard errors, i.e. . However, this interval cannot be calculated in practice! WHY?

A simple fix is to replace ______by our sample based estimate ______. Provided the sample size is sufficient large the resulting interval will still have an approximate 95% chance of covering the true population proportion. This gives what we should technically call the estimated standard error of the proportion, but when we say “standard error of the proportion” it is assumed this estimated version is the one we are talking about because in reality the population proportion p is NOT known. If p were known we would not be conducting a study in first place!

General Form for a CI for Population Proportion (p)

Normal Table Values:

95% Confidence we use z = 1.96

90% Confidence we use z = 1.645

99% Confidence we use z = 2.576

Example: Mortality of Released Rainbow Trout with Barbed Hooks (cont’d)

A study of 200 rainbow trout caught on baited size 8 barbed hooks and released with the line cut at the hook (but the hook not removed from the fish) showed that 58 fish died (from the National Symposium on Catch and Release Fishing). Using this information to construct a 95% confidence interval for the percentage of rainbow trout that will die when caught and released using baited size 8 barbed hooks.

Mortality of Released Rainbow Trout Caught with Barbless Hooks

In a similar study of 855 rainbow trout caught using barbless hooks it was found that 26 of them died. Use this information to construct a 95% confidence interval for the percentage of rainbow trout that will die when caught and released using barbless hooks.

Comparing the Mortality Rates

Does this interval suggest that the mortality rate of rainbow trout is lower when caught and released using barbless hooks as opposed to barbless? Explain.

Construct a 2 X 2 contingency table from the combined results of these studies. Then find the RR and OR associated with using barbed hooks when catching and releasing rainbow trout.

10 – Confidence Intervals for the RR and OR

Recall that the =

The RR can only be calculated when the number individuals with and without the disease in the study are random. If a case-control study is used where these numbers are fixed it is inappropriate to calculate the necessary conditional probability to find the RR.

The OR = =

and gives a measure risk associated with the risk factor in terms a multiplicative statement regarding the odds for the having the disease. We now examine confidence intervals for these quantities.

CI for RR:

1) Take natural logarithm of RR to obtain .

2) Compute SE(ln(RR)) =

3) Find to obtain (LCL, UCL)