Sampling Variability and an (Empirical) Sampling Distribution for a Proportion
Part 1: Sampling Variability
Note: this part of activity involves group work. Each group consists of 3 students.
Introduction: Presidential elections are a complex process, but essential for democracy as they empower the people to choose their leaders who will be making decisions that are important for the entire nation. This year, on November 8th, the people of the United States will be voting again, for the 58th time in the history of the USA. Currently, the 2016 Republican presidential nominee is Donald J. Trump, while the Democrats have two possible candidates, Hillary Clinton and Bernie Sanders. There may also be Third Party and Independent Candidates; for example, Gary Johnson is currently running as a Libertarian and a Third Party candidate.
We will simulate the elections by drawing chips from a bag. Our population of eligible voters will be represented by 200 chips, some of which are blue (representing Democratic votes), some are red (representing Republican votes), and some are yellow (representing other votes). We will define success as “getting a Democratic vote” (drawing a blue chip). /- Imagine that you have polled two randomly selected groups of 10 eligible voters. Would you expect getting the same combination of Democratic, Republican, and “Other” votes each time? Why?
- Now, suppose that you are analyzing the votes of a randomly selected group of 10 eligible voters. For that purpose, without looking into the bag, take a sample of 10 chips from the bag. Record the number of Democratic votes in your sample. Return the chips into the bag and then shuffle all the chips. Repeat the sampling two more times, so that you have three samples in total. For each sample, record the number of Democratic votes and calculate the corresponding sample proportion ().
Sample size: n = 10 / Sample 1 / Sample 2 / Sample 3
Number of Democratic votes
Sample Proportion ()
- Is the sample proportion () always the same?
The observed variation in the proportion of Democratic votes in your samples is called Sampling Variability. Sampling variability is a result of random chance. Therefore, sampling variability means that just by random chance, each time when we draw a sample, we can get a different sample proportion.
- How could you use the three sample values () to estimate the true population proportion ()?
- Calculate the mean sample proportion: Mean of all = ______
1
Part 2: Sampling Distribution
Note: In this part the entire class works together.
- Let’s now plot all the sample proportions that we obtained in our class. For each sample value make a dot in the plot. This dotplot represents our empiricalSampling Distribution for the proportion of Democratic votes in random samples of size n = 10. (Note: a theoretical sampling distribution includes sample values that correspond to all the possible samples that can be created; since our sampling yields only a subset of all the possible samples, we use the term empirical.)
- How many dots are there? What does each dot in the dotplot represent?
- Has everyone in the class obtained the same proportion of Democratic votes? Why?
- Let’s now estimate the shape, center, and spread of our empirical sampling distribution.
a)What is the approximate shape of our empirical sampling distribution? Given the shape, which would be the best measures for the center and the spread of the empirical sampling distribution?
b)Now, recall the Empirical Rule for a Normal Distribution (a.k.a. the 68-95-99.7 rule). This rule says that the middle 95% of normally distributed data are roughly about _____ standard deviations away from the mean. How much is 95% out of all the dots?
c)To find the middle 95% of the data, we count 2.5% of the dots from each left and right. Thus, we will count _____ dots from the left and _____ dots from the right in order to determine the Lower Limit and the Upper Limit for the middle 95% of data. Use the dotplot to find these values.
Lower Limit: / Upper Limit:d)Note: the standard deviation of the (theoretical) sampling distribution is called the Standard Error. The standard error is determined as , where is the sample size and is the population proportion. In reality, we do not know the true population value , so we usually take one random sample and estimate the standard error using the sample proportion. Since in this case we already have a sampling distribution, we can use a different approach and implement the Empirical Rule to approximate the Standard Error. Namely, we will use the fact that the distance between the lower and upper limit amounts to approximately ______standard deviations of the sampling distribution. Therefore,
Approximation for the Standard Error = =
e)The midpoint between the lower and upper limit represents an approximation for the ______of the sampling distribution:
Approximation for the Mean = =
f)Describe the (empirical) sampling distribution using the above approximations.
Shape: / Center: / Spread:- Again, how can you use the (empirical) sampling distribution to estimate the true population value ?
Conclusion:
When taking random samples from a population, due to random chance, each time that we draw a sample, we usually obtain a different sample with a different sample value. This variation in the sample value is called the ______.By plotting all the sample proportions that we obtained in the class, we created a dotplot which represents our empirical ______for the proportion of Democratic votes in random samples of size n = _____. Since the shape of the empirical sampling distribution is ______, the best measure for the center is the ______and the best measure for the spread is the ______. The standard deviation of the sampling distribution is called the ______; for our empirical sampling distribution, this value is approximately ______. The approximation we obtained for the center of our empirical sampling distribution is ______. We will use this value to ______the unknown population proportion.
Part 3: Sampling Distribution for Increased Sample Size
Note: This is group work. Three students work together to obtain 3 samples, and then the class puts all the results together.
- Let’s now increase the sample size to simulate the polling of 20 randomly selected voters at a time. Without looking into the bag, take 20 chips and record the number of Democratic votes. Return the chips into the bag, and then shuffle all the chips. Repeat the sampling so that you have three samples in total. For each sample record the number of Democratic votes and find the sample proportion ().
Sample size: n = 20 / Sample 1 / Sample 2 / Sample 3
Number of Democratic votes
Sample Proportion ()
- Let’s now focus on the mean of the three sample proportions.
a)Calculate the mean sample proportion: Mean of all = ______
b)Compare this value to the previous case. Write down the values of the two means. What do you think, which mean lies closer to the true population proportion?
Sample size n = 10: Mean of all = ______
Sample size n = 20: Mean of all = ______
- Again, construct a dotplot of all the sample proportions obtained in the class. For each sample value, put a dot on the plot to construct a new (empirical) sampling distribution.
- Let’s estimate the two values that contain the middle 95% of our new (empirical) sampling distribution.
The middle 95% of the dots lie between the ______Limit and the ______Limit. The total number of dots is ______. 95% out of this equals ______dots, so then 5% is ______dots. We will need to count ______dots from the left and ______dots from the right to find the two limits. Therefore,
Lower Limit: / Upper Limit:- Using the two limits, we will now estimate the center and the spread of our new (empirical) sampling distribution.
Approximation for the Mean =
Approximation for the Standard Error =
- Now fill out the table and then compare your approximations for the shape, center, and spread for the two (empirical) sampling distributions that we created in the class.
Sample size / Shape / Center / Spread
n = 10
n = 20
Which similarities/differences do you observe among the two (empirical) sampling distributions?
Part 4: Simulations using Technology
Introduction: Again, we will think of our population as a population of eligible voters, some Democratic, some Republican, and some with “Other” political preference. As earlier, we will define success as “getting a Democratic vote.” The difference is that this time we will use statistical software to help us carry out simulations. For that purpose we will use an online statistics tool called “StatKey,” found at To enable simulations, we will need to input our population data into StatKey. Follow these steps:
-Copy the population data.
Open the excel file “Elections.xlsx” (this file is posted on the class website)
To copy the data, click on the header of the data column and copy (right click copy or ctrl c)
-Input the population data into “StatKey”.
Open the “StatKey” and under “Sampling Distributions” click on “Proportion”
Edit Data Select data (ctrl a) Delete data (del) Paste new data (right click paste or ctrl v)
-Carry out simulations, following the instructions below.
- First, carry out simulations for sample size n = 10.
-Using “Generate 1 Sample” create a random sample. Click onto “Show Data Table” to see the randomized sample that StatKey generated. Count and record the number of Democratic votes, and then find the corresponding sample proportion. Repeat this twice, so that you have 3 randomized simulations. Each time record the mean sample proportion (it’s displayed in the top right corner of the dotplot, as well as under the arrow in the bottom part of the dotplot).
Sample size: n = 10 / Sample 1 / Sample 2 / Sample 3Number of Democratic votes
Sample Proportion ()
Mean Sample Proportion
-Use “Generate 1 Sample” to make 7 more samples. Then use “Generate 10 Samples” to make 90 more samples. Finally, click onto “Generate 100 Samples” a few times. Answer the questions:
a)StatKey has simulated an empirical sampling distribution for us. What does each dot represent? How many dots are there in total?
b)The original population proportion () is a fixed value, yet, the computer generated different sample proportions . Why does this happen?
c)Describe the shape, center, and spread of the simulated empirical sampling distribution.
Shape: / Center: / Spread:- Now carry out simulations for sample size n = 20.
-Using “Generate 1 Sample” create 3 random samples. Each time, click on “Show Data Table” to see the randomized sample that StatKey generated. Count and record the number of Democratic Votes and then find their proportions. Also, each time record the mean value of the sampled proportions.
Sample size: n = 20 / Sample 1 / Sample 2 / Sample 3Number of Democratic votes
Sample Proportion ()
Mean Sample Proportion
-Generate additional random samples, so that the total number of samples is exactly the same as in the previous case.
-Describe the shape, center, and spread of the simulated empirical sampling distribution.
Shape: / Center: / Spread:- If we sampled chips from the bag, it would not be easy to carry out simulations for sample size n = 50. Let’s see what happens when software is carrying out simulations for us.
-Using “Generate 1 Sample” create 3 random samples. Each time, click on “Show Data Table” to see the randomized sample that StatKey generated. Count and record the number of Democratic Votes and find their proportion. Also, each time record the mean of the sample proportions.
Sample size: n = 50 / Sample 1 / Sample 2 / Sample 3Number of Democratic votes
Sample Proportion ()
Mean Sample Proportion
-Generate additional random samples, so that the total number of samples is exactly the same as in the previous two cases.
-Describe the shape, center, and spread of the simulated empirical sampling distribution.
Shape: / Center: / Spread:- Compare the three cases (n = 10, n = 20, and n = 50).
a)Which (simulated empirical) sampling distribution has a shape closest to the shape of a normal distribution?
b)State the center for each (simulated empirical) sampling distribution.
Case n = 10 / Case n = 20 / Case n = 50Mean of the simulated empirical sampling distribution
Compare the three means; which one do you think is closest to the true population proportion?
c)State the spread for each (simulated empirical) sampling distribution.
Case n = 10 / Case n = 20 / Case n = 50Standard Deviation of the simulated empirical sampling distribution (our approximation for the Standard Error)
How do the three measures of spread compare?
d)Make a conclusion how the shape, center, and spread of a (simulated empirical) sampling distribution change with the sample size.
Part 5: Test Your Understanding
Let’s now assess your understanding of the sampling variability and the (empirical) sampling distribution.
- Answer the questions below.
a)What does “Sampling Variability” mean?
b)Briefly (in two sentences or less) describe how the activities that we completed in the class illustrate the concept of Sampling Variability.
c)Briefly explain the concept of the (empirical) Sampling Distribution of proportions.
d)State the synonym for the standard deviation of the (theoretical) sampling distribution.
e)What was your favorite part of the activities we did?
f)What is the one concept that you are still unclear about?
- Let’s carry out some more simulations. Imagine that you are considering the population of 225.8 millions of eligible voters in the USA. Your task is to find an approximation for the number of voters with Democratic preference. The only information you have is obtained by simulating the sampling of 200 random samples of size 80; for each sample, the proportion of Democratic votes is computed and displayed as a single dot in the dotplot below. Use the dotplot to carry out the task.
Answer the following questions:
a)What does the dotplot above represent? Circle the correct answer:
theoretical Sampling Distribution
empirical Sampling Distribution
simulated empirical Sampling Distribution
b)How would you describe the shape of the dotplot?
c)What would be the best measure for the center of the dotplot? State this value.
d)What would be the best measure for the spread of the dotplot? State this value.
e)Use your knowledge of the sampling distribution to estimate the proportion of Democratic votes.
f)Using your answer from the previous part, estimate the total number of Democratic votes.
- Again, we will use StatKey to simulate 100 random drawings from a population of 225.8 millions of eligible voters in the USA. Each time that a random sample is taken, we record the proportion of Democratic votes and then we plot this value in a dotplot. Observe how the shape, center, and spread of the (simulated empirical) sampling distribution changes with sample size.
/ Sample size n = 10
Shape:
Center:
Spread:
Estimated proportion of Democratic votes in the population:
/ Sample size n = 25
Shape:
Center:
Spread:
Estimated proportion of Democratic votes in the population:
/ Sample size n = 50
Shape:
Center:
Spread:
Estimated proportion of Democratic votes in the population:
a)How does the shape of the (simulated empirical) sampling distribution change with sample size?
b)What happens to the spread of the (simulated empirical) sampling distribution as sample size increases?
c)What do you think, which would be the best estimate for the unknown population proportion?
d)Based on your estimate for the proportion of Democratic votes in the population, approximate the number of Democratic votes in the entire population.
Your friend wants to illustrate how we can use a (simulated) empirical sampling distribution to estimate the number of students at the College of the Canyons who are 40 or more years old. Your friend seeks data from 20 quintuplets of COC students. /For the purpose of research, the data are obtained from college administrator, who generates the data by using a computer to simulate a random selection of 20 quintuplets of COC students. For each quintuplet (n = 5) the proportion of students of age 40 or older is computed. Based on this your friend makes a dotplot which represents a (simulated) empirical sampling distribution. You look at the dotplot and conclude that the estimate obtained from this (simulated) empirical sampling distribution is not going to be very good, i.e. your friend should ask the administrator for a new data set in which one variable should be changed. What is it that should be changed? Why?
- How would you determine if a coin is fair of tainted?
Recall, if a coin is fair, the probability of getting the head (or tail) is =______. This means, if, for example, we flip a coin n = 60 times, we would expect to obtain about ______heads on average. However, due to ______, each sequence of 60 flips can yield different number of heads. This means the sample proportion will vary from one sample to another, so if we plot all the sample proportions we will obtain a dotplot of that consists of different values, and this dotplot will represent the ______for the given coin. We can then use the center of the sampling distribution as a point estimate for the probability of getting the head.