Options for Demonstrating Sampling Variability and Sampling Distributions in Teaching Statistics

Robin Beaumont

Options for demonstrating sampling variability and sampling distributions in teaching statistics

Tuesday, 11 October 2011

Contents

Sampling in SPSS and R

1Using SPSS

1.1Using SPSS syntax

1.1.1One sample

1.1.2Multiple samples all the same size and from same distribution.

1.1.3Samples of different sizes

1.1.4Sampling distributions

2Online Apps

3The standard error of the Mean

3.1.1Effect of sample size upon SEM - formula appreciation

4Using SPSS script

4.1.1Alternative script - Distribution.sbs

5In R

6Online presentations and other tools

Sampling in SPSS and R

The aim of this handout is to describe the various options available for teaching the concept of sampling variability along with some student material.

The process usually involves creating samples and then comparing them with both the parent population and amongst themselves (SEM demonstration).

I have offered four ways of doing this below; Using SPSS (two methods) online apps and R.

1Using SPSS

1.1Using SPSS syntax

The traditional way of investigating random samples in SPSS is to use the SPSS syntax window:

1.1.1One sample

Simple example to create a single sample with 1000 cases from a Normal distribution with mean = 100 ; SD=15:

SPSS syntax / Use Analyze the get the results
*example of creating a random sample
* Create 10,000 cases for sample
NEW FILE.
INPUT PROGRAM.
LOOP #1 = 1 TO 10000.
COMPUTE X = RV.NORMAL(100,15).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE. /

And to get a boxplot:

Next exercise is to produce several samples.

1.1.2Multiple samples all the same size and from same distribution.

Variables called V20 to V30,all the same size. I have assumed that you have run the above syntax first if not you need to use the syntax below right:

If have run above script / If have not run above script
NUMERIC V20 to V30.
vector v = V20 to V30.
* loop for sample size
LOOP #case = 1 TO 100.
*loop for each sample
LOOP #i= 1 TO 11.
*now we have to specify both column(sample) and row (sample number)
COMPUTE v(#i) = RV.NORMAL(100,15).
END LOOP.
END LOOP.
EXECUTE. / NEW FILE.
INPUT PROGRAM.
NUMERIC V20 to V30.
vector v = V20 to V30.
* loop for sample size
LOOP #case = 1 TO 100.
*loop for each sample
LOOP #i= 1 TO 11.
*now we have to specify both column(sample) and row (sample number)
COMPUTE v(#i) = RV.NORMAL(100,15).
END LOOP.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.
Typical output:
Descriptive Statistics
N / Mean / Std. Deviation
V20 / 100 / 101.5421 / 14.20531
V21 / 100 / 101.0039 / 15.53362
V22 / 100 / 99.1124 / 14.14247
V23 / 100 / 97.6240 / 14.07071
V24 / 100 / 99.9382 / 14.43248
V25 / 100 / 100.1818 / 13.80487
V26 / 100 / 100.4502 / 15.45697
V27 / 100 / 101.6055 / 15.04477
V28 / 100 / 100.8888 / 14.05551
V29 / 100 / 101.6523 / 14.24829
V30 / 100 / 99.9043 / 14.19884
/

1.1.3Samples of different sizes

Two main ways to do this, you can create all the samples in a single variable and add a Grouping variable or alternatively create several variables with different sample sizes in each. For various reason the former strategy is best however just for interest I have included below the latter option of putting the various samples of different sizes in separate variables:

NEW FILE.
INPUT PROGRAM.
LOOP #count = 1 TO 500.
DO IF (#count <31).
COMPUTE samp30 = RV.NORMAL(100,15).
END IF.
DO IF (#count <51).
COMPUTE samp50 = RV.NORMAL(100,15).
END IF.
DO IF ( #count <101).
COMPUTE samp100 = RV.NORMAL(100,15).
END IF.
COMPUTE samp500 = RV.NORMAL(100,15).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE. / This approach (i.e. separate variable each sample) causes problems when analysing the data as SPSS considers the smaller samples to have missing values! Therefore the better solution is to use a grouping variable that is an identifier indicating the sample each observation(case) belongs to.

The next SPSS syntax script duplicates the above but just creates two variables (one called GROUP the other VALUE) here:

new file.
input program.
loop #i=1 to 30.
compute group=1.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 50.
compute group=2.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 100.
compute group=3.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 500.
compute group=4.
compute value=rv.Normal(100,15).
end case.
end loop.
end file.
end input program.
execute . / The code opposite is not the most elegant you could use one loop with a number of 'DO IF' statements:
new file.
input program.
loop #i=1 to 500.
DO IF (#i<31).
compute group=1.
compute value=rv.Normal(100,15).
end case.
END IF.
DO IF (#i<51).
compute group=2.
compute value=rv.Normal(100,15).
end case.
END IF.
DO IF (#i<101).
compute group=3.
compute value=rv.Normal(100,15).
end case.
END IF.
compute group=4.
compute value=rv.Normal(100,15).
end case.
end loop.
end file.
end input program.
SORT CASES by group(a).
execute .

Both the above SPSS syntax files do the same thing that is produce four samples of different size from a normal distribution with mean 100 SD=15.

Obviously you could easily change the parameters of the distribution or even change the actual distribution, Two alternatives are:

the uniform: rv.Uniform(lower, upper) or

exponential: rv.exp(mean)

Using the Explore command in SPSS shows the SD for each group and also a box plot.

Carrying out the above tasks it is then possible to complete the following table.

Sample size / Minimum value / mean / Maximum value / Standard deviation
30
50
100
500
Theoretical population value

The above exercise will demonstrate;

Standard deviation varies little over sample size - there must be a sample adjustment factor in it!

Mean also varies little (repeated sampling for smaller samples produces wider variation - next exercise) from the population mean of 100

The above exercise can then be repeated changing the sample size to 3, 10, 20, 30

new file.
input program.
loop #i=1 to 3.
compute group=1.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 10.
compute group=2.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 20.
compute group=3.
compute value=rv.Normal(100,15).
end case.
end loop.
loop #i=1 to 30.
compute group=4.
compute value=rv.Normal(100,15).
end case.
end loop.
end file.
end input program.
execute . /

Given these are random samples each person will obtain a different result however what they should notice is that the means(medians in above boxplot) vary less as the sample size gets larger. You could ask them to repeatedly create multiple random samples of varying size then plot the means (technically what we would produce is a sampling distribution of the mean) but at this stage it is probably better to revert to online simulations (see below).

1.1.4Sampling distributions

Student typical explaination:

So far we have looked at the characteristics of one or more samples from a population but what about the characteristics across samples! Why, you may well ask, would we bother with such additional complexity but just consider this:

I have a valuable substance (Guinness) and only want to take as small sample as possible to find an accurate mean value of substance X.

So how can we calculate what would be a small enough sample to produce a accurate mean value?

To answer this question obviously we need to assess the variation of means across samples of a specific size. While we have done this for a small number of samples we will now consider many samples to produce a distribution.

2Online Apps

Go to

Using the app at this website we can ask for repeated samples of different sizes and then plot their means. I have done it for 10,000 samples of size 5 and also size 25

- Students should notice how much more spread out the means are for the smaller samples.

Student explanation:

3The standard error of the Mean

The Standard Error of the Mean provides a measure of the standard deviation of sample means. In other words it is just another standard deviation but now we are at the between sample level rather than within sample level. Because we are working at a different level the name has changed for the same idea concerning spread. From the above exercise, we have both the population data along with information about a set of samples from it. Interestingly all we need to calculate the SEM is information from a single sample. We will now compare the observed answer (for the samples in the above screen shot = 2.23 for samples of size 5) with a specific formula. This formula is known as the SEM (Standard Error of the Mean). = 5/√5 = 2.236

and for the sample size of 25 SEM = 5/√25 = 1

We can see from the above formula that the Standard Error of the Mean is equal to the standard deviation divided by the square root of the sample size. We have samples of size 5 and 25 so we can calculate the SEM from each one. You will notice that the observed SD of the sample means is identical to that using the formula - this is truly amazing We can predict the distribution of means of random samples without carrying out the sampling just using the SEM formula.

3.1.1Effect of sample size upon SEM - formula appreciation

We know that the formulae for the standard error of the mean (SEM) is:

Lets consider what happens to the SEM as the sample size changes. From the above equation the top value (numerator) will remain constant, but the bottom value (denominator) will increase. What happens in this instance, which is a property of all fractions, is that the total value decreases, therefore as sample size increases the variability of the sample means decreases. You can think of it in terms of accuracy, the larger the random sample the more accurate the SEM, a statistician would say that this indicated that it was a consistent estimator

As N increases -> SEM decreases

To learn more about SPSS syntax see the excellent tutorial including datasets and videos at:

4Using SPSS script

SPSS scripts allow users to create additional dialog boxes and several people have produced scripts which provide dialog boxes for creating random samples. This is probably an easier alternative to learning SPSS syntax.

provides three possible scripts

Right mouse click on the "Generate Random variables EN SBS" link select the "Save Link as" option to save the script file to your local drive change the default extension from txt to sbs.

Back in SPSS:

This allows you to create multiple samples of a specific size. You can also run the script several times to create many samples by un-checking the "Replace the working data file" option.

4.1.1Alternative script - Distribution.sbs

You will then be presented with:

/ Type in the sample size you want:

Step 1 - click next to allow you to select:
Step 2 - the distribution, I selected Normal
Step 3 - - you can change the mean, SD.
Once you have created one sample you can create up to 20 different ones each time clicking next
To finish click the Finish button! /
Typical results using the menu option explore:
Case Processing Summary
group / Cases
Valid / Missing / Total
N / Percent / N / Percent / N / Percent
value / dimension1 / 1.00 / 30 / 100.0% / 0 / .0% / 30 / 100.0%
2.00 / 20 / 100.0% / 0 / .0% / 20 / 100.0%
3.00 / 15 / 100.0% / 0 / .0% / 15 / 100.0%
4.00 / 10 / 100.0% / 0 / .0% / 10 / 100.0%
/

5In R

R is not for the lazy! but it is amazingly versatile. This section is for completeness.

# this is a comment
#create a plot x axis=0 to 62 y axis=50 to 150
# Give the axes labels
plot(c(0,62), c(50,150), type="n",xlab="Sample size", ylab="mean")
#sample size 3 to 30 in steps of 2 (=df)
for (df in seq(3,61,2))
{
# number of samples (=60) at each size
for (i in 1:60)
{
# create random samples from a normal distribution of size df
# and store in the vector (column) x
x<- rnorm(df,mean =100, sd=15)
points(df,mean(x)) } # end for each group of samples
} # end for each sample size /

You can see an animated version of the above at: this site has a large number of animations all written in r code using the free R animation package.To the casual visitor all the R code is hidden away they just seeing the beautiful animations.

With more R knowledge one can create more complex examples, the following is taken from Maindonald & Braun 3rd ed. 2010 p. 89. This produces 10,000 simulations of different samples of different sizes from a skewed distribution. The code below can be used as the basic for a large number of similar exercises.

############################### from Miandolald & Braun p.89-90
######## CUP 2010
## uses the lattice library
library(lattice)
##############
# function to generate n sample values
sampvals <- function(n) exp(rnorm(n, mean = 0.5, sd = 0.3))
## Means across rows of a dimension nsamp x sampsize matrix of
## sample values gives nsamp means of samples of size sampsize.
samplingDist <- function(sampsize = 3, nsamp = 1000, FUN = mean)
apply(matrix(sampvals(sampsize * nsamp), ncol = sampsize), 1, FUN)
size <- c(3, 10, 30)
## Simulate means of samples of 3, 9 and 30; place in dataframe
df <- data.frame(y3 = samplingDist(sampsize=size[1]),
y9 = samplingDist(sampsize=size[2]),
y30 =samplingDist(sampsize=size[3]))
###############
## use the strip.custom to customise the strip labelling
doStrip <- strip.custom(strip.names = TRUE, factor.levels= as.expression(size), var.name= " sample size", sep = expression(" = "))
## Then include the argument 'strip=doStrip' in the call to densityplot
###############
## Simulate source population (sampsize = 1)
y <- samplingDist(sampsize = 1)
densityplot(~y3+y9+y30, data=df, outer=TRUE, layout= c(3,1),
plot.points = FALSE, panel = function(x, ...) {
panel.densityplot(x,..., col = "black")
panel.densityplot(y, col = "gray40", lty=2, ...)
}, strip=doStrip) /

6Online presentations and other tools

The new Zealand census at school Website

contains a section on informal inference, called "The eyes have it" which contains animated gifs that people can use in their representations and also anexcellent presentation concerning sampling variability and how this can informally relate to hypothesis testing see:

End of document