Introduction to Statistical Inference

Dr. Tom Pierce

Department of Psychology

Radford University

Researchers in the behavioral sciences make decisions all the time. Is cognitive-behavioral therapy an effective approach to the treatment of traumatic stress? What type of evaluation system results in the highest levels of employee productivity? Should I eat these chips on the desk in front of me? And almost all of these decisions are based on data (not the chips thing). The problem for researchers is that there’s almost never a way to know for sure that they’ve made the right choice. No matter what conclusion the researcher comes to, they could be wrong. And while that doesn’t sound like a great position to be instatisticians can tell us something for sure about the decision we’ve made. They can tell us the odds that we’re wrong. This may still not seem very comforting, but, if you think about it, if you knew that the odds of making a mistake were one in a thousand, you’d probably be okay with that. You’d beconfident that your decision was correct, even if you couldn’t know for sure. That’s what statistical inference is like. In every situation covered in this book, no matter how complex the design, we’ll always know two things: just how confident we need to be in order to adopt a particular conclusion and how confident we can be of this conclusion. These may be decisions based only on a set of odds, but at least we’ll always know for sure what those odds are.

What we’re going to do next is describe a situation where a researcher has to make a decision based only on some odds. The situation is relatively simple, requiring a decision about a single raw score and using a statistic you’re already familiar with (a Z-score). However, this example presents every major concept in statistical decision making. In this way, we can show you the stepsand reasoning involved in a test of statistical inference without having to deal with any real math at all. Later on, when we get to data from other designs we’ll be able to apply an already familiar strategy to these new situations. So, if you’re okay with howthe tests work in this chapter, you’ll be okay with how statistical inference works in every chapter to follow.

The case of the wayward raw score

One variable we use in a lot of our studies is reaction time. Let’s say that 20 older adults do a two-choice reaction time task where the participants are instructed to press one button if a stimulus on a computer screen is a digit and another button if the stimulus is a letter. The task has 400 trials. From this set of older adults we’re going to have 400 trials from each of 20 participants for a total of 8000 reaction times. Now, let’s say, for the sake of argument, that this collection of 8000 reaction times is normally distributed. The mean reaction time in the set is .6 seconds and the standard deviation is .1 seconds. A graph of this hypothetical distribution is presented in Figure 3.1.

Figure 3.1

A problem we run into is that the reaction times for three or four trials out of the 8000 are up around 1.6 seconds. The question we need to answer is whether to leave these reaction times in the data set or to throw them out. They’re obviously scores that are very different from the others, so maybe we’re justified in throwing them out. However, data is data. Maybe this is just the best the participants could do on these particular trials; so, to be fair, maybe we should leave them in.

One thing to keep in mind is that the instructions we gave people were to press the button on each trial as fast as they could while making as few errors as they could. This means that when we get the data, we only want to include the reaction times for trials when this is what was happening – when people were doing the best they could – when nothing went wrong that prevented them from doing their best. So now, we’ve got a reaction time out there at 1.6 seconds and we have to decide between two options, which are:

The reaction time of 1.6 seconds belongs in the data set because this is a trial where nothing went wrong. It’s a reaction time where the person was doing the task the way we assumed they were. Option 1 is to keep the RT of 1.6 seconds in the data set. What we’re really saying is that the reaction time in question really is a member of the collection of 8000 other reaction times that makes up the normal curve.

Alternatively…

The reaction time does not belong in the data set because this was a trial where the participant wasn’t doing the task the way we assumed that they were. Option 2 is to throw it out. What we’re saying here is that the RT of 1.6 seconds does NOT belong with the other RTs in the set. This means that the RT of 1.6 seconds must belong to some other set of RTs – a set of RTs where something went wrong, causing the mean of that set of reaction times to be higher than .6 seconds.

In statistical jargon, Option 1 is called the null hypothesis and says that our one event only differs from the mean of the other events by chance. If the null hypothesis in this case is really true, it means there was no reason or cause for the reaction time on this trial to be this slow; it just happened by accident. The symbol “HO” is often used to represent the null hypothesis.

In general, the null hypothesis of a test says that we got the results we did just by chance. Nothing made it happen, it was just an accident.

In statistical jargon, the name for Option 2 is the alternative hypothesis and says that our event didn’t just differ from the mean of the other events by chance or by accident – it happened for a reason. Something caused that reaction time to be a lot slower than the other ones. We may not know exactly what that reason is, but we can be pretty confident that SOMETHING happened to give us a really slow reaction time on that particular trial. The alternative hypothesis is often symbolized as “H1”.

Of course, there’s no way for the null hypothesis and the alternative hypothesis to both be true at the same time. We have to pick one or the other. But there’s no information available that can tell usfor sure which option is correct. Again, this is something we’ve just got to learn to live with. Psychological research is never able to prove anything, or figure out whether an idea is true or not. We never get to know for sure whether the null hypothesis is true or not, so there’s nothing in the data that can prove that a RT of 1.6 seconds really belongs in our data set or not. It’s always possible that someone could have a reaction time of 1.6 seconds just by accident. There’s no way of telling for sure what the right answer is. So we’re just going to have to do the best we can with what we’ve got. We have to accept the fact that whichever option we pick, we could be wrong.

The choice between Options 1 and 2 basically comes down to whether we’re willing to believe we could have gotten a reaction time of 1.6 seconds just by chance. If the RT was obtained just by chance, then it belongs with the rest of the RTs in the set (and we should keep it). If there’s any reason other than chance for how we could have ended up with a reaction time that slow – if there was something going on besides the conditions that I had in mind for my experiment – then the RT wasn’t obtained under the same conditions as the other RTs (and we should throw it out).

So what do we have to go on in deciding between the two options? Well, it turns out that the scores in the data set are normally distributed and we already know something about the normal curve. It turns out that we can use it to tell us exactly what the odds are of getting a reaction time that’s this much slower than the mean reaction time of .6 seconds.

For starters, if you convert the RT of 1.6 seconds to a standard score, what do you get? Obviously, if we convert the original raw score (a value of X) to a standard score (a value of Z), we get…

…a value of 10.0. The reaction time we’re making our decision about is 10.0 standard deviations above the mean. That seems like a lot! The symbol Zxcan be read as “the standard score for a raw score (X)”.

So what does this tell us about the odds of getting a reaction that far away from the mean just by chance? Well, right away you know that roughly 95% of all the reaction times in the set will fall between standard scores of –2 and +2 and that 99% will fall between stand scores of –3 and +3. So automatically, we know that the odds of getting a reaction time with a standard score of +3 or higher must be less than 1% – and our reaction time is ten standard deviations above the mean! If the normal curve table went out far enough it would show us that the odds of getting a reaction time with a standard score of 10.0 is something like one in a million!Our knowledge of the normal curve, combined with the knowledge of where our raw score falls on the normal curve gives us something solid to go on when making our decision. We now know that the odds are something like one in a million that our reaction time belongs in the data set.

That brings us to the question of what the odds would have to be like to make us believe that a score didn't belong in the data set.An alpha level tells us just how unlikely a null hypothesis would have to be before we just can’t believe it anymore.In this example, the alpha level would tell us how unlikely a reaction time would have to be before we just can’t believeit could belong in the set. For example, an investigator might decide they’re not willing to believe that a reaction time really belongs in the set if they find that the odds of this happening areless than 5%. If the odds of getting a particular reaction time turn out to be less than 5% then it’s different enough from the mean for the investigator to bet they didn’t get that reaction time just by chance.It’s different enough for them to bet that this reaction time must have been obtained when the null hypothesis was false.

Odds are often expressed as numbers between zero and 1.0. This means if we want to tell someone we’re using odds of 5% for our alpha level, we could simply write the expression “α = .05”, which can be translated as “reject the null hypothesis if the odds are less or equal to than 5% that it’s true”. This tells the researcher about that conditions that have to be met in order to choose one option (the alternative hypothesis) over another (the null hypothesis). Therefore this expression is an example of a decision rule. A decision rule is an “if-then” statement that simply says “if such-and-such happens, then do this”.

So we now have a decision rule for knowing when to reject the null hypothesis: reject the null hypothesis if the odds are less than 5% that it’s true. But having a decision rule expressed in terms of a percentage doesn’t help us much when the number we’re making our decision about isn’t a percentage, but a single raw score. We need something more concrete. We need to know what a raw score has to look like to know that the odds are less than 5% that it belongs in the set. In other words, we need to know how far away from the mean we need to go to hit the start of the 5% of scores that are furthest away from the mean – the 5% of score that are least likely to belong there.

Figure 2.1 shows a graph of the normal curve. Using an alpha level of .05, we’re saying we want to keep the 95% of reaction times that are closest to the mean (i.e., we fail to reject the null hypothesis) and get rid of the 5% of reaction times that are furthest away from the mean (i.e., we reject the null hypothesis). In other words, we want to get rid of the least likely 2.5% on the right-hand side of the curve and the least likely 2.5% on the left-hand side of the curve.

Figure 2.1

Now we need to identify the two places on the scale that correspond to the starting points for the most extreme 2.5% of scores on the high side of the curve and the extreme 2.5% on the low side of the curve. These are places where we’re willing to change our mind about tossing out a reaction time; any reaction time this far away from the mean or further is a reaction time we’re willing to toss out. In general, a place on the scale where you change your mind about a decision is known as a critical value. Fortunately, the normal curve gives us a way of translating a value expressed as a percentage into a value expressed as a standard score. Specifically, the Normal Curve Table in Appendix X tells us that we have to go 1.96 standard deviations away from the center of the curve to hit the start of the outer 5%.

So, we can now say thatif the standard score for a reaction time is at or above a positive 1.96 or if it’s at orbelow a negative 1.96, it falls in the 5% of the curve where we’re willing to reject the null hypothesis. This is our decision rule. It states the conditions that have to be met to say a reaction time doesn’t belong in the set. In shorthand form, the decision rule now becomes:

If Zx ≥ +1.96 or if Zx ≤ -1.96, reject HO.

We already know the reaction time in question is 10.0 standard deviations above the mean, so it fits one of the two conditions for rejecting the null hypothesis. Our decision is therefore “reject the null hypothesis”. A decision is always a statement about the null hypothesis and will always be either“reject the null hypothesis” or “fail to reject the null hypothesis”. A conclusion, however,is what the researcher has learned from making the decision. In this example, because the decision is to reject the null hypothesis, the researcher can draw the conclusionthat the reaction time does not belong with the other reaction times in the data set and should be thrown out.

The important thing to note about this example is that it boils down to a situation where one event (a raw score in this case) is being compared to a bunch of other events to see if it belongs with them. If you’re okay with this and with how the decision got made in this example, you’re going to be okay with every test we talk about in the rest of the book. That’s because all of those different tests are going to work the same way – they all use the same strategy. It’s always going to come down to seeing if one number belongs with a bunch of other examples of the same kind of number. It won’t really matter if the letter we use to label that number is a capital “X” for a raw score (like here), a value for “t” in a t-test, or a value for “F” in an F-test. What we do with those numbers will always be the same. It’ll always come down to one number compared to a bunch of other numbers to see if it belongs with them.

The Z-Test

The example in the last section was one where we compared one raw score to a bunch of other raw scores. Now let’s try something a little different.

Let’s say you’ve been trained in graduate school to administer I.Q. tests. You get hired by a school system to do the testing for that school district. On your first day at work the principal calls you into their office and tells you they’d like you to administer an I.Q. test to the 25 seventh graders in a classroom. The principal then says that all you have to do is answer a simple straitforwardquestion: Are the students in that classroom typical/average seventh graders or not?

Now, before we start. What would you expect the I.Q. scores in this set to look like? The I.Q. test is set up so that that the mean I.Q. for all of the scores in the population is 100 and the standard deviation of all the I.Q. scores for the population is 15. So, if you were testing a sample of seventh graders from the general population, you’d expect the mean to be 100.

Now, let’s say you test all 25 students. You get their I.Q. scores and find that the mean for this group is 135. 135! Are you thinking these are typical/average seventh graders or not? Given what you know about I.Q. scores, you’re probably not.But why not? What if the mean had turned out to be 103?Is this a believable result from typical/average seventh graders.Probably. How about if the mean was 106? Or 109? Or 115? At what point do you change your mind from “yes, they were typical/average seventh graders” to “no, they’re not”. What do you have to go on in deciding where this cutoff point ought to be? At this point in our discussion your decision is being made at the level of intuition. But this intuition is informed by something very important; it’s informed by your sense of the odds of getting the results you did. Is it believable that you could have gotten a mean of 135 when the mean of the population is 100? No, not really. It seems like the odds of this happening are pretty gosh-darned low.