Statistics Are Calculated from Samples

SAMPLING DISTRIBUTIONS

Recall what we said in Chapter 1:

Statistics are calculated from samples.



Randomsamplesallow probability calculations.



Probability calculations give measures of reliability in statistics.

Sampling is usually done without replacement (i.e., after an each item in the sample is selected, it is not allowed to be picked again)

The only problem with this form of sampling is that the population will usually change its composition as each new item is selected, which makes probability calculations more difficult.

Example: (Population = 7 good items and 3 defectives ones)

To find the probability of getting 1 defective item in a random sample of 3 items:

P(exactly 1 defective) = = = .525

The probability calculation isn’t that hard here, but imagine if you tried this with a larger sample, say n = 300, from a population of, say, N = 1,000 items.

Probability calculations are mucheasier if you sample

with replacement (because then the population never changes its composition). “with replacement” means that any item chosen to be in your sample is “replaced” back into the population and has a chance of being picked again.

Here’s the part that explains why you’d even consider using formulas/results based on sampling “with replacement”:

Fact: When taking small samples from large populations,

sampling “with” or “without” replacement give almost

identical results.

Bottom line: yes, we do use sampling without replacement to obtain our data, but we use the simpler formulas associated with sampling with replacement in our probability calculations.

What’s a “small” sample”? Answer: in practice, as long as n  .05N,

then the “with” replacement formulas will be essentially equivalent

to the more complicated “without” replacement formulas.

In Chapter 4, the text more or less assumes that we are sampling without replacement or, that we are using fairly small(n  .05N) sample sizes relative to the population size.

Later, in Chapter 5, the text addresses the issue of what todo when
n > .05N; i.e., in those situations when the sample size gets to be more that 5% of the population size. We addressthese issues right now (below in the notes).

In the meantime, it would be nice to know how to obtain a random sample (withoutreplacement) using Excel:

(1)Associate an unique integer from 1 to N with each item in

the population

(2)Use the “=RANDBETWEEN” function to generate a random

sample of integers from 1,2,3,…N. (Note: this function will not
work unless you have first made DATA ANALYSIS an “add-in” for EXCEL)

Example: Double-click on Excel worksheet below. The command

“=RANDBETWEEN(1,100)” has been entered into cell

D3. Click & drag this cell down to generate more

samples from the numbers 1,2,3,...100.

The sampling distribution of a statistic is the probability distribution

of allpossible values off the statistic that can be generated by taking random samples of size n from a population.

Example: (toss a fair die twice; let be the average of the two tosses)

“Population” sampling distribution of

height =



1 2 3 4 5 6 1 2 3 4 5 6

 

Notice that the sampling distribution has the same mean as the

population, but with a smaller standard deviation.

Fact (section 4.11 of text):

Relationship between means and standard deviations

of population and sampling distribution

These equalities are always true whenever you

use random sampleswith replacementor when

you sample withoutreplacement and the sample

size, n, is small compared to the population

size, N (i..e, when n  .05N)

When the sample size is large compared to the population size,

then a correction factor must be used (when n > .05N):

These equalities are always true whenever you

use random sampling without replacement.

As mentioned before, you’ll have to wait until Chapter 5 (section 5.6)

to see the text mention this.

Fact: If you happen to know that the population you are sampling

from is normally distributed, then it is a mathematical fact that the

sampling distribution of is exactly normal:

Normal Population

When the population is known to be normal, then

the shape of the sampling distribution of is

exactly normal for any sample size, n (no matter

how small or large n may be).

The situation of “knowing in advance” that a population is normal

usually occurs for populations that people have studied before, where

histograms of data have shown that the data does indeed look

bell-shaped.

●In the many cases where you don’tknow in advance (i.e., before sampling) if a population is normal, or even approximately bell-shaped, the Central Limit Theorem (section 4.11) says you can still treat the sampling distribution of as being normally distributed:

Central Limit Theorem

The shape of the sampling distribution of is

approximately normal for largen (n  30). The

approximation gets better and better as n increases.

●The most commonly occurring situation in practice is when we:

(1)(randomly) sample without replacement

(2)from a large population (so n ≤ .05N)

(3)and use n >30

In this situation,

(1)the sampling distribution of is approximately

normal and

(2)= and =

Example: Problem 4.154 in the text is a typical example of where one could use the Central Limit Theorem. In this problem, not only don’t we know whether the population of all such salaries are normal, it is probably the case that they aren’t at all normal (because dollarized data often follows a skewed distribution – most of the salaries are bunched close together, but there is likely a small right-hand tail representing salaries of people who are paid a good deal above average). But that doesn’t matter! As long as n exceeds 5% of N, the Central Limit Theorem says that the sampling distribution of is still approximately normal. Pretty powerful stuff!

●Chapter 4 doesn’t mention this (the text waits until Chapter 5, section 5.4 to say it), but the Central Limit Theorem also tells us all about the sampling distribution of the sample proportion,.

When talking about a proportion p, such as the proportion of people who buy a certain product, you can “code” the two responses (here, “buy”, “don’t buy”) as 1 (for “buy”) and 0 (for “don’t buy”).

In statistics “coding” is the word we use to describe assigning a number or numerical value to a non-numerical item in the data.

You don’t have to do this, but it can easily be shown that when we code responses with the “0 and 1” scheme, then the sample mean of the 0’s and 1’ in the data is identical to the sample proportion, , of 1’s in the data and that the formulas for the mean and standard deviation of the sampling distribution of the mean (the subject of section 4.11) can be rewritten in terms of p as:

= p and=

Furthermore, “yes/no” or “0/1” data is so easy to get that we almost always have large samples available when using such data. That means that the Central Limit Theorem can be used to show that the sampling distribution of is approximately normal. This is all described later in Section 5.4 of Chapter 5, but I thought it would be nice to introduce it here in Chapter 4 where sampling distributions are first discussed).