Chapter 5: Introduction to Statistical Inference
1. Sampling Distributions
As you have learned, it is close to impossible to test the entire population in regard to whatever theory or hypothesis you’ve concocted; it would be way too time intensive and hard to confirm that you found every person in whatever population you are studying.
So sometimes we decide to make inferences about populations by using data from a sample of subjects. Using a subgroup of the population, we can find point estimates (the sampling mean and standard deviation), which may or may not be exactly right. To combat the probable error between the actual mean and the sampling mean, we create a buffer on each side of the point estimate to ensure the population mean is contained somewhere in this interval estimate (aka confidence interval).
You can also think of this visually as a bull’s-eye. The point estimate is the black center point. As you edge away from that center point, there is still some chance that the actual population statistics are within each band, but the chances get slimmer and slimmer as you become a worse shot. As a researcher, you are hoping to get results as close to the actual population parameters as possible with as few data points as you can get away with (because collecting data can be expensive!).
To make inferences about the populations based on the samples we collect, we turn to sampling distributions. In layman terms, the sampling distribution of the mean is created when one finds the mean for an infinite number of samples and creates a distribution with those values. Luckily, the statistics gods decreed that the sampling distribution of the mean will follow the ND rather closely even if the actual population is only somewhat close to being normally distributed. Additionally, as the size of the samples increases, the sampling distribution of the means resembles the ND ever more closely than the population being sampled (this law is known as the Central Limit Theorem).
As an example, imagine you are deciding if you should buy a ticket to the Kings of Leon concert from some guy on StubHub. Before you pay $75 for the ticket, you want to try to figure out the average price that people have paid to go to the concert. So you start talking to your friends, reading some online forums, going to Ticketmaster, etc., to find out how much other people are paying. You find the following ticket prices that were paid: 70, 45, 90, 100, 50, 100, 110, 90, 60, 50, 45, 50, 50, 55, 35, 70, 90, 120, 100, 80, 85, 65, 100, 140, 45. Now, if you found the mean for samples of three tickets, the means of those samples taken from this larger sample could range from 123.33 (average of 120, 110, 140) down to only 41.66 (average of 45, 45, 35), because it could happen that you randomly choose all high-priced tickets or all low-priced tickets when sampling, which obviously wouldn’t give you a very accurate picture. However, if you took the mean of all 25 ticket prices, that equals 75.8. Now, if you had the time to waste surfing the Internet looking for another 25 ticket prices, there is a much better chance that this new mean is going to fall somewhere closer to 75.8 than either 41.66 or 123.33. In essence, it would be abundantly harder to find 25 rather than 3 ticket prices (especially at random) that fall so close to the high or the low range. So now you can see why it ensures a more accurate picture of the population when you have larger samples to look at when creating the sampling distribution. Plus, now you can feel perfectly fine buying that $75 ticket to see your favorite band!
Now you try these examples:
1. Calculate the point estimate (i.e., mean) for the number of students in the current statistics classes at NYU: 32, 35, 40, 31, 29
2. Calculate the point estimate (i.e., mean) for the number of students in the current economics classes at Columbia University: 99, 65, 74, 83, 99, 100, 80, 65, 63, 89, 90, 102, 110, 79, 65, 120
3. Which of these examples would give you a more accurate point estimate for the actual population for each type of course? Why?
2. The Standard Error of the Mean
It is important to keep in mind that, as the samples get larger, the standard deviation of the sampling distribution of the mean decreases. As noted in the example from the previous section, as the samples get larger, there is less possibility for great variability from one sample to the next. Sure, it would be inspiring if you found 25 people who paid an average of $42 for a KOL ticket, but it’s highly unlikely if the true mean is around $75. So keep in mind, it’s the extremes (aka outliers) that increase a standard deviation, so diluting the extremes (which often happens when you’re averaging them together with a bunch of random scores) gives you less variability between samples.
Of course, there will still be some difference between the sample mean and the population mean, otherwise known as an error. To figure out the standard deviation of the sampling distribution we have been describing, which is called the standard error of the mean (SEM), you simply need to use the following formula:
(Note: we used M as the subscript for the SEM in the preceding formula, but you will more often see used for the mean in textbooks.) In the ticket-price example, if the standard deviation of the actual population is $12, then the standard error of the mean for samples of 25 each would equal: 12/(√25) = 2.4. This smaller value denotes the idea that sample mean will be closer to one another than individual ticket prices drawn randomly from the population. Also, the smaller the SEM, the better estimate you have for the population mean. In essence, you’re getting closer and closer to the bull’s-eye!
Now try these examples:
4. If the population standard deviation corresponding to the following set of numbers is 6.5, calculate the SEM for: 32, 35, 40, 31, 29. How does this compare to the standard deviation of the five numbers?
5. If the population standard deviation corresponding to the following set of numbers is 26.2, calculate the SEM for: 99, 65, 74, 83, 99, 100, 80, 65, 63, 89, 90, 102, 110, 79, 65, 120. How does this compare to the standard deviation of just these 16 numbers?
6. In both of these examples, why is it important to note the difference between the samples’ standard deviations and the corresponding SEMs? Explain why the sample size plays such an important role in the SEM formula.
3. The z-Score for Sample Means
Not only can they be applied to individuals, but z-scores can be used with sampling distributions as well and are useful when comparing samples to each other. This time, you’ll use the following formula (which should look VERY familiar to you):
It’s basically the same formula as the one you used before; however, you will be using the standard error of the mean, a hypothesized population mean, and the sample mean you calculated. Continuing with the Kings of Leon ticket example, if you assume that $75 is the hypothesized population mean (we’re assuming this based on its being the supposed face value of the ticket), you can plug in the rest of the values to determine the z-score for the sample mean you collected: z = (75.8 – 75)/2.4 = .33. Although we can’t make a definite decision regarding how significant this result actually is without further information (which will be explained shortly), since the z-score is a mere .33, you can be somewhat confident that what you collected wasn’t that far off from the actual population mean. So once again, you can feel pretty good about purchasing your ticket to see your favorite band!
Now it is time for you to compute z-scores based on the data from the preceding class-size examples:
7. You want to find out if this year’s mean class size is similar to the average number of students that have been in statistics classes at NYU in the past. If the population mean for the past courses offered is 45.4 and the σ = 6.5, calculate the z-score for the average class size for the current year (32, 35, 40, 31, 29). What does this z-score tell us? Does this sample seem representative of the population?
8. You want to find out if this year’s mean class size is similar to the average number of students that have been in economics classes at Columbia in the past. If the population mean for the past courses offered is 88.0 and the σ = 26.2, calculate the z-score for the average class size for the current year (99, 65, 74, 83, 99, 100, 80, 65, 63, 89, 90, 102, 110, 79, 65, 120). What does this z-score tell us? Does this sample seem representative of the population?
9. By simply eyeballing the numbers in each example, were you surprised by the differing results of the two tests? How can you explain why the possibly counterintuitive answers occurred?
4. Null Hypothesis Testing and Statistical Decision Making
Null hypothesis testing (NHT) is the standard in behavioral sciences when you’re trying to determine whether something interesting has occurred in your experiment. In essence, you’re trying to prove that what you found with your data is not easily obtained by chance if your experimental manipulation does not work at all. You often want your results to be unique, different, special, exceptional, unusual, anomalous, unparalleled—just think of all those adjectives you learned in health class when trying to boost your self-esteem. Not only should you feel those things, but you want your data to be all of those things, too! And to determine just how significant (think “not easily found by chance”) your findings are, you use NHT to hopefully reject the idea that nothing (null) is going on.
Thus, NHT is a way to determine what constitutes a significant (i.e., relatively reliable) finding, as opposed to something that could happen easily as a result of sampling error. (Think back to that possible sample of three tickets with an average price of $42—not likely to occur from ordinary sampling, but because of an oddly skewed sample, it could occasionally, but fortunately not easily, occur.) So, to control the possibility of reporting unusual sampling accidents as significant findings (which are called Type I errors), we have to create a criterion of significance (α, the Greek letter alpha) to determine what proportion of sampling accidents (the experiment actually didn’t work at all!) will be considered significant. In other words, how many oopsies are we willing to allow a chance to sneak through?
Typically, the proportion of “null” experiments allowed to become significant (i.e., the Type I error rate) is controlled by setting alpha equal to .05. Basically, this means that an average of 5% of true null hypotheses that are tested have a possibility of fooling us (by appearing significant, when in actuality, there is no effect in the population). For a one-tailed test, that means the entire 5% appears in just one tail. However, it is much more likely that you will be performing a two-tailed test, which means that you will end up placing 2.5% in each tail. These percentages translate visually into areas under the curve that represent all the results that can occur when the null hypothesis is true, as you can see in the next figure:
You’ll notice that z-scores of ±1.96 correspond to a p value of .025 on each side, which helps give you guidance about what you typically need to attain to find statistical significance. As you will see in later chapters, several factors will play a role in determining the critical value for a significance test, but the z-score of 1.96 is a particularly important value that you should keep in mind. Keep repeating to yourself, 1.96, 1.96, 1.96—trust us, it’s a keeper! And in layman’s terms, this denotes the idea that anything roughly 2 standard deviations above or below the mean may very well have some significance that deserves further inspection.
Also, keep in mind, however, that sometimes researchers are looking for a stricter criterion, and they will opt to use the guideline of α = .01, which requires the critical z-scores to be ±2.58 for a two-tailed test, with .005 in each tail.
One more point to solidify in your mind is the reason that we set an alpha level at all. It may seem silly to allow any “nulls” to slip through and produce significant results, but the reason for accepting that possibility is so we don’t too often retain a null when, in fact, it should be rejected (which would result in a Type II error). It’s all about trade-offs! We do not want to wind up declaring that an experiment isn’t significant when it involves a fantastic new drug that fights cancer, simply because we set the bar too high (i.e., alpha level too small); the smaller the alpha we use, the fewer Type I errors get committed, but more Type II errors will get made. On the other hand, we don’t want to lead other researchers into wasting tons of money on an acne medication that doesn’t actually work, but slipped through the cracks because the alpha level (Type I error rate) was too lenient. Just keep reminding yourself that we need to strike a balance between the rates of Type I and Type II errors, so that your results are most helpful to the scientific community.