Introduction to Statistics
Five – Samples and Variability

An Introduction to statistics

Five –

Samples and Variability

Written by: Robin Beaumont e-mail:

Date last updated 24 February 2010

Version: 1

How this document should be used:
This document has been designed to be suitable for both web based and face-to-face teaching. The text has been made to be as interactive as possible with exercises, Multiple Choice Questions (MCQs) and web based exercises.

If you are using this document as part of a web-based course you are urged to use the online discussion board to discuss the issues raised in this document and share your solutions with other students.

This document is part of a series see:
http://www.robin-beaumont.co.uk/virtualclassroom/contents.htm

Who this document is aimed at:
This document is aimed at those people who want to learn more about statistics in a practical way. It is the fifth in the series.

I hope you enjoy working through this document.

Robin Beaumont

Acknowledgments

My sincere thanks go to Claire Nickerson for not only proofreading several drafts but also providing additional material and technical advice.

Many of the graphs in this document have been produced using RExcel a free add on to Excel to allow communication to r along with excellent teaching spreadsheets see: http://www.statconn.com/ and Heiberger & Neuwirth 2009


Contents

1. Before you start 4

2. Learning Outcomes 4

3. Introduction 5

4. Frequencies, Histograms & Probability 5

4.1 Random Variables 5

4.2 Probability histogram 6

4.3 Functions 7

4.3.1 Importance of Functions in Statistics 7

5. Probability Density Functions (P.D.F) 8

5.1 Continuous Variables 8

5.1.1 Explanation 9

5.2 Definition of PDF 9

5.2.1 Parameters 11

6. Reference ranges - Application of the normal curve 13

7. Basic Differences between Populations and Samples 13

7.1.1 Estimation and expectation operator 13

7.2 Degrees of freedom 14

7.3 Sampling Error 15

7.4 Sampling Distribution of the Mean 15

7.5 Standard Error (SEM) of the Mean 17

7.5.1 Effect of sample size upon SEM 18

7.6 The Central Limit Theorem 19

7.7 Standardized Scores - z 19

7.8 Standard Normal PDF 20

7.9 Sampling Distributions 21

7.10 Importance of Sampling Distributions 21

8. Confidence intervals 22

8.1 Confidence interval for mean (large samples) 22

8.2 Confidence interval for mean (small samples) 24

8.3 Effect of sample size on Confidence interval width 26

9. Summary 27

10. References 28

1.  Before you start

Prerequisites

This document assumes that you have worked through the previous documents; 'Data', 'Finding the centre', graphics and Measuring spread. You can find copies of these at:

http://www.robin-beaumont.co.uk/virtualclassroom/contents.htm

A list of the specific skills and knowledge you should already possess can be found in the learning outcomes section of the above documents. I would also suggest that you work through the Multiple Choice Questions (MCQs) section of those documents to make sure you do understand the learning outcomes.

You do not require any particular resources such as computer programs to work through this document, however you might find it helpful if you have access to a statistical program or even a spreadsheet programme such as Excel.

2.  Learning Outcomes

This document aims to provide you with the following information. A separate document provides a set of guided exercises. After you have completed it you should come back to these points, ticking off those with which you feel happy.

Learning outcome / Tick box
Be able to describe the relationship between a frequency histogram and a probability histogram. / q
Be able to discuss the PDF concept / q
Be able to discuss the differences between populations and samples. / q
Be able to discuss the importance of understanding the concept of ‘random’ sampling / q
Explain how standard scores are calculated and what affect this has upon raw scores. / q
Descript the standard normal PDF and how it is used to calculate reference ranges / q
Be able to explain the standard error concept and sampling distributions / q
Be able to describe the concept of bootstrapping / q
Be able to discuss three different methods of estimating the Confidence interval (CI) for the mean / q
Be able to discussion how sample size, and estimation method influences the CI / q

3.  Introduction

This document begins to lay the ground for understanding how it is possible to make sensible statements about the parent populations from which our samples came.

Unfortunately, to understand how this is possible, something that appears rather simple to the unwary, we need to discuss a variety of topics, which initially may seem to be completely unconnected! I hope that in the end I will help guide you to see the bigger picture.

4.  Frequencies, Histograms & Probability

The link between histograms and probabilities can best be explained by way of diagrams. Consider the frequency histogram opposite which is of a previous year’s assignment results:

4.1  Random Variables

Score range / n / Relative frequency
(probability) [n x .020833)
40-44 / 11 / 0.229163
45-49 / 10 / 0.208330
50-54 / 8 / 0.166664
55-59 / 3 / 0.062499
60-64 / 3 / 0.062499
65-69 / 7 / 0.145831
70-74 / 3 / 0.062499
75-79 / 2 / 0.041666
80-84 / 0 / 0.000000
85-90 / 1 / 0.020833
total / 48 / 0.999984

For any score obtained in the class l can discover how likely it is from the above frequency distribution. In other words for any student in our sample the above represents all the possible scores they might have gained. A variable with these characteristics is known as a random variable. That is we do not know its outcome in advance. Note that it is NOT a probability but a value that represents an outcome.

Chebyshev (1821–1884) in the middle of the 19th century defined a random variable as: a real variable which can assume different values with different probabilities.’
(Spanos 1999 p.35).

Example: The possible outcomes for tossing a coin are {head, tail}. These events are not random variables allocating the values; head=1 and tail=2 and thus redefining the events as {1,2} is a random variable which can take two probability values, 0.5 and 0.5.

Knowing how students are fascinated by assignment results l felt they would be interested in the probability of obtaining a particular one. Let us consider how l might go about converting the above histogram into a probability histogram.

From my knowledge of probability, I know:

1.  If I consider all 48 outcomes from the students as being equally probably each has a probability of approximately 0.02 ( = 1/48 = 0.020833).

2.  The total of all the individual probabilities must equal 1

4.2  Probability histogram

I will make the total area of the histogram equal 1. Realising that l can easily achieve this by making each column of unit width (a rectangle of 8 x 1 cm = 8 sq. cm). The Y axis now changes from actual scores to multiples of 0.0208. Clearly the probability distribution, which it will now be instead of a frequency distribution, must produce the same results as l would have got using the traditional method. To check this lets consider the eleven students who scored 40 - 44. As they are mutually exclusive events (what one student gets does not affect what another scores?) I can use the additive rule. Therefore the probability of getting a score of 40 - 44 is:

0.0208 +0.0208 +0.0208 +0.0208 +0.0208 +0.0208 +0.0208 +0.0208 +0.0208 +0.0208 +0.0208 = 0.2288

This agrees with the probability distribution value obtained by reading the y axis value. Not only can I use this new probability distribution to let students know the probability of obtaining a particular score but l can use it to find probabilities of obtaining a range of scores. For example consider the probability of obtaining a score of 55 or more, This will just be the area to the right of the previous bar (55 - 54).

Writing it as a probability statement:

p(score >=55) = 0.0624+0.0624+0.145+0.0624+0.0416+0.0208

= 0.3946 (=19/48)

"A student will obtain a score of 55 or more 39 times in every hundred on average"

In the above probability distribution each score is represented as a probability and the total of all the probabilities = 1. Notice that we can only work out probabilities of scores for the ranges provided, we could have not worked out the probability of a score of 45.75 (in technical terms the distribution is discrete).

4.3  Functions

The way we have used events and assignment scores to generate frequency and probability distributions are really examples of functions. In a function you basically plug into it one or more variables with appropriate values and get a result out the other end. We plugged in a range of exam scores and out came the probability of obtaining them.

Here are some examples:

·  The mean = Sx/n

·  S is known as the add (summation) operator. i.e. add together all the values x can take.

·  VO2Max

·  myfunction Þ f(x) = 1 if x > 5 else = 0

The last line shows the standard way of representing a function = f(x). The x indicates that you can only plug one value into it. A function in the form of f(x,y) would require two variables. No matter how many values you supplied the function you would still only receive a single answer.

Question: List some other functions you have come across so far.

Answer: probability = p(A) with 'p' being the function and 'A' being the variable. In this case the variable is of a particular type called a 'random variable'. E.g. p(throwing a die produces a 6) = 1/6

Other functions include; median, range, interquartile range, sample variance, sample standard deviation etc.

4.3.1  Importance of Functions in Statistics

Consider this example. What is the probability of winning a prize if you are one of 20 people and you all stand equal chance of winning? You could solve this in one of two ways. Firstly, you know that the total probability is 1 and because you all have equal chance your individual chance will be 1/20 = 0.05 However there is a function that describes this situation. The function is:

f(x) = 1/(x) Where x is the total number of equally likely outcomes.

I therefore plug in my value: f(20) = 1/20 = 0.05 as before

So in some instances we can use mathematical functions to obtain probabilities rather than collect large amounts of data. Some of the above example were very simple while the exam scores one was a little more complex, the approach taken with the exam scores leaves us to a very important special type of function used in statistics the 'Probability Density Function' which we will now consider, as it is the key to inferential statistics.

5.  Probability Density Functions (P.D.F)

Probability Density Functions (PDF’s) are particular types of probability function that allow probabilities to be obtained for continuous variables. But first a short diversion what do we mean exactly by ‘continuous variable’. The three examples below are designed to get you thinking about this a little more.

5.1  Continuous Variables

Example 1 -Consider the problem of measuring the exact age of a group of students, what does this mean? Do we measurement each to the nearest month, day, minute, second or even split second. In the end is it possible to get a perfectly accurate measure?

Example 2 -The description in the box below provides the second example.

Consider a point half way between two posts say X1 and X2, call the point Nirvana if it exists I should be able to divide the interval in three, discard the out two thirds and then repeat the process with the new interval – will I ever reach the point if I repea this process a certain number of time?

i.e. X1 ------X2

This can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts which can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts

which can be divided up into three equal parts

......


Example 3 - Consider another situation, the tossing of a coin:

As the number of trials increases:

Þ Vertical lines get closer - the width of each bar will become infinitely small.

Þ Line becomes a continuous curve

Þ Curve never touches x axis 'asymptote' = every event has a probability no matter how small

Þ Area still = 1

5.1.1  Explanation

All the above examples have one aspect in common, they all involve continuous variables. These are variables that can take an infinite number of values between any two values you suggest. In all three examples when searching for a particular 'point' we found an infinite number of them we did not know existed before and paradoxically, at least in the second one, discovered that the actual one avoids detection!

5.2  Definition of PDF

Returning once again to the assignment scores example assume now that students would score any mark between 40 and 90 e.g. 47.6666 . What practical implications would the problem of the 'lost point' have? Most importantly, we would no longer be able to request from our function specific probabilities, as we would not have a value. However, a mathematical device called the 'density function' comes to the rescue. This device can be thought of as providing each individual infinite probability the random variable can take with a body or density. Now we in effect take the area under the curve to be the value we are looking for (again like our past exam scores histogram).