Descriptive Statistics
Dr. Tom Pierce
Department of Psychology
Radford University
Descriptive statistics comprise a collection of techniques for understanding what a group of people looks like in terms of the measure or measures you’re interested in.
In general, there are four classes of descriptive techniques. First, frequency distributions are used to display information about where the scores in a data set fall along the scale going from the lowest score to the highest score. Second, measures of central tendency, or averages, provide the best single numbers to use in representing all of the scores on a particular measure. Third, measures of variability provide information about how spread out a set of scores are. Fourth, the original raw scores one collects are often transformed to other types of scores in order to provide the investigator with different types of information about the research participants in a study. As standard score is a very good example of a transformed score that provides much more information about an individual subject than a raw score can.
Frequency distributions
Let’s say that you obtain Beck Depression Inventory scores from each of 400 research participants. The scores on this measure can range anywhere from 1 to 73. Typically, scores fall somewhere between 35 and 55. You’ve got 400 numbers to have to keep track of here. If someone asks you how the scores in your study came out you could say “well, subject number one had a score of 38, subject two had a 25, subject three had a 46, …”. You get the idea. This is too many number for anyone to able to look at them and be able to get a general ideas about where most of the scores fall on the scale and how spread out the scores are around this point on the scale. The job of a frequency distribution is to take a very large set of numbers and to boil it down to a much smaller set of numbers – a collection of numbers that is small enough for the pathetically limited human mind to keep track of at one time. A good frequency distribution allows the consumer to extract the necessary information about the scores in the data set while working within our cognitive limitations.
Regular frequency distribution
The most straight-forward example of a frequency distribution goes like this. Let’s say that you’re given ratings of teaching effectiveness for the students in a large Introduction to Psychology class. There are 400 students in the class. The questionnaire provides students with 15 statements and the student is asked to pick a number between one and five that indicates the degree to which they agree or disagree with each statement. One of these statements is “The instructor in this course is outstanding”. A response of “5” indicates that the students agrees with the statement completely. A response of “one” indicates that the student disagrees with the statement completely. A regular frequency distribution will allow the instructor to see how many of the students rated him or her on every possible score ranging from one to five. In other words, how many students gave the instructor a “one”, how many gave them a “two, and so on. You get the idea. This information is often displayed in the form of a table.
Table 1.1
X f
------
5 150
4 200
3 40
2 10
1 0
-----
400
There are two columns of numbers in this table. There is a capital X at the top of the column on the left. Every possible raw score that a subject could provide is contained in this column. A capital X is used to label this column because a capital X is the symbol that is usually used to represent a raw score. The column on the right is labeled with a small-case letter f. The numbers in this column represent the number of times – or the frequency -- that each possible score actually showed up in the data set. The letter f at the top of the column is just short-hand for “frequency”.
Thus, this dinky little table contains everything the instructor needs in order to know every score in the data set. Instead of having to keep track of 400 numbers, the instructor only has to keep track of five – the number of times each possible score appeared in the data set. This table is said to represent a frequency distribution because the it shows us how the scores in the set are distributed as you go from the smallest possible score in the set to the highest possible score. It basically answers the question “where did the scores fall on the scale” This particular example is said to represent a regular frequency distribution because every possible score is displayed in the raw score (capital X) column.
Interval frequency distribution
A little bit different situation where a frequency distribution might come in handy is in displaying the IQ scores collected from 90 people. In a random sample of people drawn from the population, what would you expect these IQ scores to look like? What is the lowest score you might expect to see in the set? What’s the highest score you might reasonably expect to see in the data set? It turns out that the lowest score in this particular set is 70 and the highest score is 139. Is it reasonable to display these data in a regular frequency distribution? No! So why not?
What would you have to do to generate a regular frequency distribution for these data? You’d start with the highest possible score at the top of the raw score column and in each row below that you’d list the next lowest possible raw scores. Like this…
Table 1.2
X
---
139
138
137
136
You get the idea. There are SEVENTY possible raw scores between 70 and 139. That means there would be 70 rows in the table and 70 numbers that you’d have to keep track of. That’s too many! The whole idea is to keep the number of values that you have to keep track of to somewhere between five and ten.
So what can you do? A regular frequency distribution isn’t very efficient when the number of possible raw scores is greater than ten or twelve. So it’s not a good idea to keep track of how often every possible raw score shows up in the set. You’re best bet is to be satisfied with keeping track of how many times you have scores that fall within a range of possible scores. For example, how times did you get scores falling between 130 and 139? How many times did you get scores falling between 120 and 129? One-hundred-ten and 119? You get the idea. The table below presents a frequency distribution in which the numbers in the frequency column represent the number of times that scores fell within a particular interval or range of possible raw scores. This makes this version of a frequency distribution an interval frequency distribution or, as it is sometimes referred to, a grouped frequency distribution.
Table 1.3
X f
------
130-139 3
120-129 7
110-119 15
100-109 28
90–99 23
80-89 9
70-79 5
---
90
As you can see, when you add up all of the numbers in the frequency column you get 90. The table accounts for all 90 scores in the set and it does this by allowing you to only having to keep track of seven numbers – the seven frequencies displayed in the frequencies column. The interval frequency distribution retains the advantage of allowing you to get a sense of how the scores are distributed as you go from the low end of the scale to the highest. BUT you retain this simplicity at a price. What do you have to give up? You have to give up some precision in knowing exactly where any one score fell on the scale of possible scores. You might know that three scores fell between 130 and 139, but you have no way of knowing from the table whether there were three examples of a score of 130, or two 135s and a 137, or three examples of a 139. An interval frequency distribution represents a tradeoff between two competing benefits – the benefit of being precise and the benefit of being concise. When the measure you’re working with has a lot of possible scores it will be impossible to enjoy both benefits at the same time. And anyone who says otherwise is itching for a fight!
Cumulative frequency distribution
Sometimes the investigator is most interested in the pattern one sees in a running total of the number of scores that have been encountered as one goes from the lowest score in the set to the highest. For example, in the IQ data set, I might want to know the number of people with IQ scores at or below the interval of 70-79, then how many have scores at or below the interval of 80-89, 90-99, and so on. A cumulative frequency distribution is a table or graph that present the number of scores within or below each interval that the investigator might be interested in, not just the number of scores within each interval.
Constructing this type of frequency distribution is easy. The frequency that corresponds to each interval is nothing more than the number of subjects that fell within that interval PLUS the number of subjects that had scores below that interval. In Table 1.4, the cumulative frequencies for each interval are contained in the column labeled “cf”.
Table 1.4
X f cf
------
130-139 3 90
120-129 7 87
110-119 15 80
100-109 28 65
90–99 23 37
80-89 9 14
70-80 5 5
Cumulative frequencies are a standard tool for presenting data in a number of fields, including those from operant conditioning experiments and behavior modification interventions. In these studies, the cumulative frequency displays the number of correct responses up to that point in the experiment. Psychologists have found that the shape of the distribution changes when different schedules of reinforcement are used.
Graphs of frequency distributions
Tables are one way of presenting information about where the scores in a data set are located on the scale going from the lowest score in the set to the highest. Another common strategy for presenting the same information is to provide a graph of a frequency distribution. The idea here is that the X-axis of the graph represents the full range of scores that showed up in the data set. The Y-axis represents the number of times that scores were observed at each point on the scale of possible scores.
[Figure 1.1]
The points in the graph in figure 1.1 contain all of the information about the distribution of IQ scores that was contained in the frequency distribution table presented in Table 1.3. Obviously, the higher the point in the graph, the more scores there were at that particular location on the scale of possible scores. If we “connect the dots” in the graph, we get what’s called a frequency polygon.
An alternative way of displaying the same information is to generate a bar graph. In a bar graph the number of scores occurring for each individual score (for a regular frequency distribution) or for each grouping of scores (for an interval frequency distribution) is represented by the height of a bar rising above a particular score or interval on the x-axis.
[Figure 1.2]
Shapes of frequency distributions
There are several ways in which the shapes of frequency distributions can differ from each other. The first one that we’ll talk about is in terms of whether the shape of a frequency distribution is symmetrical or not. A distribution is said to be symmetrical when the shape on the left hand side of the curve is the same as the shape on the right hand side of the curve. For example, take a look at a graph of a normal distribution. See Figure 1.3.
[Figure 1.3]
If you were to fold the left hand side of the curve over on top of the right hand side of the curve, the lines would overlap perfectly with each other. The distribution on the right side is a mirror image of the left side. Because of this we know that the normal curve is a symmetrical distribution.
Skewness. A distribution is said to be asymmetrical (i.e., without symmetry) if the two sides of the distribution are not mirror images of each other. For example, the frequency distribution below (Figure 1.4) displays reaction times on a choice reaction time task for one college-age research participant.
[Figure 1.4]
Obviously, if you were to fold the right side onto the left side, the lines wouldn’t be anywhere close to overlapping. The reaction times in the set are much more bunched up at the lower end of the scale and then the curve trails off slowly towards the longer reaction times. There are many more outliers on the high end of the scale than the low end of the scale. This particular shape for a frequency distribution is referred to as a skewed distribution. Distributions can be skewed either to the right or to the left. In this graph the longer tail of the distribution is pointing to the right, so we’d say that the distribution is skewed to the right. Other people would refer to this shape as being positively skewed (the tail is pointing towards the more positive numbers). If the scores in a distribution were bunched up at the higher end of the scale and the longer tail were pointing to the left (or towards the negative numbers), we’d say that the distribution was skewed to the left or negatively skewed.