Standard Deviation (Measures of Spread)
Statistics Prof. Miller
OK There are three aspects of every data set that a statistician needs to know, and we have discussed two of them: what is in the middle (measures of central tendency) and what is the shape of the distribution (is it normally distributed?) Two down, one very important one to go. The third aspect of the data set that we study is how spread out are the values. One quick way to distinguish between data sets in this regard is the range, the difference between the highest and lowest values. Like the mode and mean though, the range can be misleading because one or two extreme values could cause the range to change significantly from what it would have been without those outliers.
So, we need a more robust measure. We use the standard deviation.
The idea used to compute the standard deviation is that it should be the average distance from the mean for all of the observations. This idea is not strictly adhered to when we derive the formula. The reason we stray from this idea has to do with the “big picture” issue of using this statistic, the standard deviation of the sample, as an estimator for the standard deviation for the population.
Let’s develop a formula using the data from the example of “student A” who had test scores 78, 79, 80, 81, 82. But let’s change the context of test scores and imagine that this data represents a sample of children’s weights.
The first step is to find the mean which is 80 (400/5, or just obviously the value in the middle since 78 is 2 below which is balanced by the 82 (2 above), etc.)
Now since the idea is to find the average distance from the mean, subtract the mean from each observation to find how far each is from the mean (deviations from the mean).
Well, since that gives some negative and some positive values, the total comes out to be 0 (again, intuitively obvious as noted above). What should we do? Well, there are two options. One option makes intuitive sense, but that is not what we actually do. What makes sense is to use the absolute value (basically ignore the minus sign for a negative number.) That would give us a distance from the mean and allow us to compute the total of all the distances from the mean. Dividing by n would yield a statistic that is called the mean absolute deviation from the mean. The problem with this is that that statistic does not have the desirable properties for an estimator of the associated parameter, the mean absolute deviation of the population. (I have played around with this- the statistic obtained from the sample would typically underestimate the parameter. A good estimator is equally likely to miss by being too high or by being too low.)
So, another way to resolve the issue of negative deviations from the mean is to square the deviations, since a negative number multiplied by itself yields a positive product. Doing that yields an estimator that is an unbiased estimator of the associated parameter. The next step is to add the squares of the deviations. The next logical step would be to divide by n since we are seeking the average distance from the mean, in a sense. But once again, we deviate a bit from the original idea (deviate, get it?). Instead of dividing by n, we divide by n – 1. The reason for this is interesting (big picture again) but not all that important for now. The reason is that we have actually already made a mistake. Our mistake is that we were supposed to find deviations from THE mean. We actually found deviations from the sample mean as opposed to the true population mean. So, all of our deviations are off by a little because we used only an estimate of the mean. To adjust for that, we work with only n – 1 of our data values and leave the last one available to be fudged later to fix that error we made. Finally, we take the square root to counteract the squaring we did, in a sense. (Not precisely true since ; but it does have the effect of giving the proper units to the answer since before taking the square root the units would be square pounds.) So, here is a formula for standard deviation for a sample:
The standard deviation for a population that s estimates is called (sigma, lower case).
Here is the example worked out:
x / /78 / -2 / 4
79 / -1 / 1
80 / 0 / 0
81 / 1 / 1
82 / 2 / 4
400 So / 0 / 10
There is an equivalent form of this formula which is useful when computing the standard deviation by hand. Since this is an online course and you presumably have access to a computer, you probably will not need this formula. However, I will use it occasionally because some of my materials are for seated and online courses. By the way, Excel has a function for computing the standard deviation of a sample =stdev( . .. ).
Equivalent form of standard deviation formula:
Also, for data given in a frequency distribution (ungrouped), the standard deviation is given by:
or
For example,
x / f / xf / / /18 / 3 / 54 / -1.75 / 3.0625 / 9.1875
19 / 6 / 114 / -0.75 / 0.5625 / 3.375
20 / 7 / 140 / 0.25 / 0.0625 / 0.4375
21 / 3 / 63 / 1.25 / 1.5625 / 4.6875
24 / 1 / 24 / 4.25 / 18.0625 / 18.0625
20 / 395 So, / 0 / 35.75
Note that 1.4 seems to be reasonable since it represents, in a sense, the average distance of the ages from the mean age. Another way to check for reasonableness is an important relationship between the standard deviation and the range. Recall that an outlier is an observation that lies more than two standard deviations away from the mean. So, excluding these “weirdos”, all of the observations lie within two standard deviations of the mean. That implies that the range is approximately four times the standard deviation (two sd’s on either side of the mean) This works for the age example above the range 6 years (from 18 to 24) and the sd is about ¼ of that (1.4 is close to 1.5 years).
Let’s investigate this idea further.
Percent of observations within 1, 2, and 3 standard deviations from the mean: What percent of the observations might we expect to be within 1 sd of the mean? It is not possible for 100% to be within 1 sd of the mean since our intuitive idea of the sd is the average distance from the mean. It is therefore impossible that all of the observations are less than the average distance from the mean (imagine a town in which every student is above average!!?!!?) Hence, we might expect 50% of the data to be within 1 sd of the mean, but we could not require that since it is only true that ½ of values are less than a median, not a mean- and we did stray quite a bit from the intuitive idea of average distance from the mean when we derived the formula for sd. Now, since we defined an outlier as an observation that is more than two sd’s from the mean, we expect a relatively high percentage (90’s?) within 2 sd’s of the mean- all except the weirdos. More guidelines in a minute but first an example:
The following data are the birth weights (in pounds) of 12 babies born in a hospital:
3.6 4.9 4.9 5.1 6.6 7.2 7.3 7.3 7.5 7.5 8.4 11.7
(a) Find the mean and standard deviation for the given data.
(b) What proportion of the data lies within:
· one standard deviation of the mean
· two standard deviations of the mean
· three standard deviations of the mean?
(a)
(b)Within 1 s.d.: - s, + s = 6.83 - 2.10 , 6.83 2.10 4.73, 8.93
Proportion of data in this interval is 83% (all of the babies weigh between 4.73 and 8.93 lb except the 3.6 lb and the 11.7 lb babies)
Within 2 s.d.s: - 2s, + 2s = 6.83 22.1 , 6.83 22.1 2.63, 11.03
Proportion of data in this interval is 92% (only the 11.7 lb baby, who is now considered an outlier, is outside of the interval within 2 sd.s of the mean)
Within 3 s.d.s: - 3s, + 3s = 6.83 32.1 , 6.83 32.1 0.53, 13.13 Proportion of data in this interval is 100%
The guidelines related to these percentages are the percentages of data that lie within 1, 2 and 3 sd’s of the mean for a normal distribution. These percentages are known as the Empirical Rule.
Empirical Rule: For a normal distribution, approximately
· 68% of the data lies within 1 sd of the mean,
· 95% of the data lies within 2 sd’s of the mean,
· 99.7% of the data lies within 3 sd’s of the mean.
The Empirical Rule can be used as a test for normality. For example, for the babies’ weights above, we found that 83% of the babies were within 1 sd of the mean, 92% of the babies weights are within 2 sd’s of the mean and 100% of the babies weights were less than 3 sd’s from the mean. These percentages are “relatively close” to those for a normal distribution so that this data is approximately normal.
Another use of the Empirical Rule is to make specific statements about populations that are normally distributed. (lots of this on Test 2, it won’t be on Test 1)
Batteries’ lives are normally distributed with a mean of 13 hours and a standard deviation of 0.3 hours. What percent of batteries last:
a) More than 13 hours
Answer is 50% (since 13 is the median (13 is given as the mean, but for a normal distribution mean = median)
b) Exactly 13 hours Answer: 0 More on this later, but it is reassuring since in the last problem it would not matter if the problem was more than 13 or 13 or more)
c) Between 12.4 and 13.6 hours
Answer is 95% by the Empirical Rule (12.4 is 2 sd’s below 13 and 13.6 is 2 sd’s above the mean)
d) More than 13.3 hours
Answer is 16% ()
e) Between 12.7 and 13.6 hours (Answer is 81.5%- see if you can determine why)
Again, this type of problem is not on Test 1, but is of paramount importance later so it is worthwhile taking a look already. The other reason for discussing it now is how it relates to our next topic . . .
How does one know whether to use the 68%, the 95%, or the 99.7%? It depends on how many sd’s from the mean is the given value. That value has name. It is called the z-score or standard score. The name standard score comes from the use of this idea to re-scale any normal distribution to a standard scale so that all normal distributions can be studied using the same “yard stick.” So, the analysis and use of the Empirical Rule depended on the number of sd’s away from the mean for a certain given value or values. That is the next topic, called z-score.
A Z-score indicates how many sd’s an observation is away from the mean and in which direction.
Example: Suppose that the distribution of babies’ weights is normally distributed with a mean of 6.83 lb (mean of population, ) and a sd of 2.1 lb. (sd of population, ) What is the z-score of an 11 lb baby?
Well, we have an idea from our analysis of the percent within 2 sd’s from the mean. The 11 lb baby was not in that interval so we know its z-score is a bit greater than 2 (but less than 3)
More precisely, the definition requires that we subtract the mean from the observation (how far from the mean . .) and then divide by the sd (how many sd’s from the mean). So, here it is: The Most Important Formula in this Course:
For this example,
The 11.7 lb baby is 2.32 sd’s above the mean (and is an outlier.)
For the 3.6 lb baby,
The 3.6 lb baby is 1.54 sd’s below the mean (it is the lowest weight, but is not an outlier.)
Notice that z-score is a bit different from the statistics and graphs discussed up to this point. A z-score is a characteristic of an observation, not a characteristic of a data set like mean, standard deviation, and shape of the distribution.
Here is one last example. A problem that I solved using z-scores:
Mill College was in trouble with the State Education Department because the State Ed department did not approve of the exam being used for admissions. State Ed mandated that a new test be administered instead. Mill College asked that I determine a cut-off score on the new approved exam that would admit a similar proportion of students. The old test had a mean of 120 and a sd of 17. The college had used a cut-off score on the old test 0f 89. I administered the new test to a large group of students and found a mean of 79 with a sd of 7. (Both test score distributions had the same shape) Obviously the two tests were scored on different scales. Hence, I used z-scores to compare the two tests on a common, standard scale.
Solution: First find the z-score for the cut-off score on the old test.
Now find the cut-score on the new test, given its parameters, that will have that same z-score. There are two approaches. Either use “the most important formula in this course”, plugging in z and solving for x. Or, do some algebra on the formula to derive a new form of it for use in cases like this. Solving for x yields: . Using this new form of the formula for this example,