The CEO Makes $100,000 Per Year

Mean

This is one of the more common statistics you will see. And it's easy to compute. All you have to do is add up all the values in a set of data and then divide that sum by the number of values in the datasheet. Here's an example:

Let's say you are writing about the World Wide Widget Co. and the salaries of its nine employees.

The CEO makes $100,000 per year,

Two managers make $50,000 per year,

Four factory workers make $15,000 each,

and

Two trainees make $9,000 per year.

So you add $100,000 + $50,000 + $50,000+ $15,000 + $15,000 + $15,000 + $15,000 + $9,000 + $9,000 (all the values in the set of data), which gives you $278,000. Then divide that total by 9 (the number of values in the set of data).

That gives you the mean, which is $30,889.

Not a bad average salary. But be careful when using this number. After all, only three of the nine workers at WWW Co. make that much money. And the other six workers don't even make half the average salary.

So what statistic should you use when you want to give some idea of what the average worker at WWW Co. is earning? It's time to learn about the median.

Median

Whenever you find yourself writing the words, "the average worker" this, or "the average household" that, you don't want to use the mean to describe those situations. You want a statistic that tells you something about the worker or the household in the middle. That's the median.

Again, this statistic is easy to determine because the median literally is the value in the middle. Just line up the values in your set of data, from largest to smallest. The one in the dead center is your median.

For the World Wide Widget Co., here are the worker's salaries:

$100,000

$50,000

$15,000 $15,000

$15,000

$9,000

That's 9 employees. So the one halfway down the list, the fifth value, is $15,000. That's the median. (If halfway lies between two numbers, split 'em.)

Comparing the mean to the median for a set of data can give you an idea how widely the values in your data set are spread apart. In this case, there's a somewhat substantial gap between the CEO at WWW Co. and the rank and file. (Of course, in the real world, a set of just nine numbers won't be enough to tell you very much about anything. But we're using a small data set here to help keep these concepts clear.)

Statisticians have a value, called a standard deviation, that tells them how widely the values in a set are spread apart. A large SD tells you that the data are fairly diverse, while a small SD tells you the data are pretty tightly bunched together. If you'll be doing a lot of work with numbers or scientific research, it will be worth your time to learn a bit about the standard deviation.

Standard Deviation

I'll be honest. Standard deviation is a more difficult concept than the others we’ve covered. And unless you are writing for a specialized, professional audience, you'll probably never use the words "standard deviation" in a story. But that doesn't mean you should ignore this concept.

The standard deviation is kind of the "mean of the mean," and often can help you find the story behind the data. To understand this concept, it can help to learn about what statisticians call normal distribution of data.

A normal distribution of data means that most of the examples in a set of data are close to the "average," while relatively few examples tend to one extreme or the other.

Let's say you are writing a story about nutrition. You need to look at people's typical daily calorie consumption. Like most data, the numbers for people's typical consumption probably will turn out to be normally distributed. That is, for most people, their consumption will be close to the mean, while fewer people eat a lot more or a lot less than the mean.

When you think about it, that's just common sense. Not that many people are getting by on a single serving of kelp and rice. Or on eight meals of steak and milkshakes. Most people lie somewhere in between.

If you looked at normally distributed data on a graph, it would look something like this:

The x-axis (the horizontal one) is the value in question... calories consumed, dollars earned or crimes committed, for example.

And the y-axis (the vertical one) is the number of datapoints for each value on the x-axis... in other words, the number of people who eat x calories, the number of households that earn x dollars, or the number of cities with x crimes committed.

Now, not all sets of data will have graphs that look this perfect. Some will have relatively flat curves, others will be pretty steep. Sometimes the mean will lean a little bit to one side or the other. But all normally distributed data will have something like this same "bell curve" shape.

The standard deviation is a statistic that tells you how tightly all the various examples are clustered around the mean in a set of data. When the examples are pretty tightly bunched together and the bell-shaped curve is steep, the standard deviation is small. When the examples are spread apart and the bell curve is relatively flat, that tells you that you have a relatively large standard deviation.

Computing the value of a standard deviation is complicated. But let me show you graphically what a standard deviation represents...

One standard deviation away from the mean in either direction on the horizontal axis (the red area on the above graph) accounts for somewhere around 68 percent of the people in this group. Two standard deviations away from the mean (the red and green areas) account for roughly 95 percent of the people. And three standard deviations (the red, green and blue areas) account for about 99 percent of the people.

If this curve were flatter and more spread out, the standard deviation would have to be larger in order to account for those 68 percent or so of the people. So that's why the standard deviation can tell you how spread out the examples in a set are from the mean.

Why is this useful? Here's an example: If you are comparing test scores for different schools, the standard deviation will tell you how diverse the test scores are for each school.

Let's say Springfield Elementary has a higher mean test score than Shelbyville Elementary. Your first reaction might be to say that the kids at Springfield are smarter.

But a bigger standard deviation for one school tells you that there are relatively more kids at that school scoring toward one extreme or the other. By asking a few follow-up questions you might find that, say, Springfield's mean was skewed up because the school district sends all of the gifted education kids to Springfield. Or that Shelbyville's scores were dragged down because students who recently have been "mainstreamed" from special education classes have all been sent to Shelbyville.

In this way, looking at the standard deviation can help point you in the right direction when asking why data is the way it is.

The standard deviation can also help you evaluate the worth of all those so-called "studies" that seem to be released to the press everyday. A large standard deviation in a study that claims to show a relationship between eating Twinkies and killing politicians, for example, might tip you off that the study's claims aren't all that trustworthy.

Of course, you'll want to seek the advice of a trained statistician whenever you try to evaluate the worth of any scientific research. But if you know at least a little about standard deviation going in, that will make your interview much more productive.

Okay, because so many of you asked nicely...

Here is one formula for computing the standard deviation. A warning, this is for math geeks only!

Writers and others seeking only a basic understanding of stats don't need to read any more in this chapter. Remember, a decent calculator and stats program will calculate this for you...

Terms you'll need to know

x = one value in your set of data

avg (x) = the mean (average) of all values x in your set of data

n = the number of values x in your set of data

For each value x, subtract the overall avg (x) from x, then multiply that result by itself (otherwise known as determining the square of that value). Sum up all those squared values. Then divide that result by (n-1). Got it? Then, there's one more step... find the square root of that last number. That's the standard deviation of your set of data.

I told you it was for math geeks only.

Margin of Error

Margin of Error deserves better than the throwaway line it gets in the bottom of stories about polling data. Writers who don't understand margin of error, and its importance in interpreting scientific research, can easily embarrass themselves and their news organizations.

Check out the following story that moved in the summer of 1996 on a major news wire:

WASHINGTON (Reuter) - President Clinton, hit by bad publicity recently over

FBI files and a derogatory book, has slipped against Bob Dole in a new poll released Monday but still maintains a 15 percentage point lead.

The CNN/USA Today/Gallup poll taken June 27-30 of 818 registered voters showed Clinton would beat his Republican challenger if the election were held now, 54 to 39 percent, with seven percent undecided. The poll had a margin of error of plus or minus four percentage points.

A similar poll June 18-19 had Clinton 57 to 38 percent over Dole.

Unfortunately for the readers of this story, it is wrong. There is no statistical basis for claiming that Clinton's lead over Dole has slipped.

Why? The margin of error. In this case, the CNN et al. poll had a four percent margin of error. That means that if you asked a question from this poll 100 times, 95 of those times the percentage of people giving a particular answer would be within 4 points of the percentage who gave that same answer in this poll.

(WARNING: Math Geek Stuff!)

Why 95 times out of 100? In reality, the margin of error is what statisticians call a confidence interval. The math behind it is much like the math behind the standard deviation. So you can think of the margin of error at the 95 percent confidence interval as being equal to two standard deviations in your polling sample. Occasionally you will see surveys with a 99 percent confidence interval, which would correspond to 3 standard deviations and a much larger margin of error.

(End of Math Geek Stuff!)

So let's look at this particular week's poll as a repeat of the previous week's (which it was). The percentage of people who say they support Clinton is within 4 points of the percentage who said they supported Clinton the previous week (54 percent this week to 57 last week). Same goes for Dole. So statistically, there is no change from the previous week's poll. Dole has made up no measurable ground on Clinton.

And reporting anything different is just plain wrong.

Don't overlook that fact that the margin of error is a 95 percent confidence interval, either. That means that for every 20 times you repeat this poll, statistics say that one time you'll get an answer that is completely off the wall.

You might remember that just after Dole resigned from the U.S. Senate, the CNN et al. poll had Clinton's lead down to six points. Reports attributed this surge by Dole to positive public reaction to his resignation. But the next week, Dole's surge was gone.

Perhaps there never was a surge. It very well could be that that week's poll was the one in 20 where the results lie outside the margin of error. Who knows? Just remember to never place too much faith in one week's poll or survey. No matter what you are writing about, only by looking at many surveys can you get an accurate look at what is going on.