SAMPLING DISTRIBUTIONS
Recall what we said in Chapter 1:
Statistics are calculated from samples.
Randomsamplesallow probability calculations.
Probability calculations give measures of reliability in statistics.
Sampling is usually done without replacement (i.e., after an each item in the sample is selected, it is not allowed to be picked again)
The only problem with this form of sampling is that the population will usually change its composition as each new item is selected, which makes probability calculations more difficult.
Example: (Population = 7 good items and 3 defectives ones)
To find the probability of getting 1 defective item in a random sample of 3 items:
P(exactly 1 defective) = = = .525
The probability calculation isn’t that hard here, but imagine if you tried this with a larger sample, say n = 300, from a population of, say, N = 1,000 items.
Probability calculations are mucheasier if you sample
with replacement (because then the population never changes its composition). “with replacement” means that any item chosen to be in your sample is “replaced” back into the population and has a chance of being picked again.
Here’s the part that explains why you’d even consider using formulas/results based on sampling “with replacement”:
Fact: When taking small samples from large populations,
sampling “with” or “without” replacement give almost
identical results.
Bottom line: yes, we do use sampling without replacement to obtain our data, but we use the simpler formulas associated with sampling with replacement in our probability calculations.
What’s a “small” sample”? Answer: in practice, as long as n .05N,
then the “with” replacement formulas will be essentially equivalent
to the more complicated “without” replacement formulas.
In Chapter 4, the text more or less assumes that we are sampling without replacement or, that we are using fairly small(n .05N) sample sizes relative to the population size.
Later, in Chapter 5, the text addresses the issue of what todo when
n > .05N; i.e., in those situations when the sample size gets to be more that 5% of the population size. We addressthese issues right now (below in the notes).
In the meantime, it would be nice to know how to obtain a random sample (withoutreplacement) using Excel:
(1)Associate an unique integer from 1 to N with each item in
the population
(2)Use the “=RANDBETWEEN” function to generate a random
sample of integers from 1,2,3,…N. (Note: this function will not
work unless you have first made DATA ANALYSIS an “add-in” for EXCEL)
Example: Double-click on Excel worksheet below. The command
“=RANDBETWEEN(1,100)” has been entered into cell
D3. Click & drag this cell down to generate more
samples from the numbers 1,2,3,...100.
The sampling distribution of a statistic is the probability distribution
of allpossible values off the statistic that can be generated by taking random samples of size n from a population.
Example: (toss a fair die twice; let be the average of the two tosses)
“Population” sampling distribution of
height =
1 2 3 4 5 6 1 2 3 4 5 6
Notice that the sampling distribution has the same mean as the
population, but with a smaller standard deviation.
Fact (section 4.11 of text):
Relationship between means and standard deviations
of population and sampling distribution
=
=
These equalities are always true whenever you
use random sampleswith replacementor when
you sample withoutreplacement and the sample
size, n, is small compared to the population
size, N (i..e, when n .05N)
When the sample size is large compared to the population size,
then a correction factor must be used (when n > .05N):
=
=
These equalities are always true whenever you
use random sampling without replacement.
As mentioned before, you’ll have to wait until Chapter 5 (section 5.6)
to see the text mention this.
Fact: If you happen to know that the population you are sampling
from is normally distributed, then it is a mathematical fact that the
sampling distribution of is exactly normal:
Normal Population
When the population is known to be normal, then
the shape of the sampling distribution of is
exactly normal for any sample size, n (no matter
how small or large n may be).
The situation of “knowing in advance” that a population is normal
usually occurs for populations that people have studied before, where
histograms of data have shown that the data does indeed look
bell-shaped.
●In the many cases where you don’tknow in advance (i.e., before sampling) if a population is normal, or even approximately bell-shaped, the Central Limit Theorem (section 4.11) says you can still treat the sampling distribution of as being normally distributed:
Central Limit Theorem
The shape of the sampling distribution of is
approximately normal for largen (n 30). The
approximation gets better and better as n increases.
●The most commonly occurring situation in practice is when we:
(1)(randomly) sample without replacement
(2)from a large population (so n ≤ .05N)
(3)and use n >30
In this situation,
(1)the sampling distribution of is approximately
normal and
(2)= and =
Example: Problem 4.154 in the text is a typical example of where one could use the Central Limit Theorem. In this problem, not only don’t we know whether the population of all such salaries are normal, it is probably the case that they aren’t at all normal (because dollarized data often follows a skewed distribution – most of the salaries are bunched close together, but there is likely a small right-hand tail representing salaries of people who are paid a good deal above average). But that doesn’t matter! As long as n exceeds 5% of N, the Central Limit Theorem says that the sampling distribution of is still approximately normal. Pretty powerful stuff!
●Chapter 4 doesn’t mention this (the text waits until Chapter 5, section 5.4 to say it), but the Central Limit Theorem also tells us all about the sampling distribution of the sample proportion,.
When talking about a proportion p, such as the proportion of people who buy a certain product, you can “code” the two responses (here, “buy”, “don’t buy”) as 1 (for “buy”) and 0 (for “don’t buy”).
In statistics “coding” is the word we use to describe assigning a number or numerical value to a non-numerical item in the data.
You don’t have to do this, but it can easily be shown that when we code responses with the “0 and 1” scheme, then the sample mean of the 0’s and 1’ in the data is identical to the sample proportion, , of 1’s in the data and that the formulas for the mean and standard deviation of the sampling distribution of the mean (the subject of section 4.11) can be rewritten in terms of p as:
= p and=
Furthermore, “yes/no” or “0/1” data is so easy to get that we almost always have large samples available when using such data. That means that the Central Limit Theorem can be used to show that the sampling distribution of is approximately normal. This is all described later in Section 5.4 of Chapter 5, but I thought it would be nice to introduce it here in Chapter 4 where sampling distributions are first discussed).