Understanding Statistics by Using Spreadsheets to Generate and Analyze Samples

SPORTSCIENCE

/ sportsci.org

Perspectives / Research Resources

Understanding Statistics by Using Spreadsheets to Generateand AnalyzeSamples

Will G Hopkins

Sportscience 11,23-36, 2007 (sportsci.org/2007/wghstats.htm)
Institute of Sport and Recreation Research, AUT University, Auckland 0627, New Zealand. Email. Reviewers: Bernadette Murphy, Faculty of Health Sciences, University of Ontario Institute of Technology, North Oshawa, OntarioL1H 7L7; Patria Hume, Institute of Sport and Recreation Research, AUT University, Auckland 0627, New Zealand.

The random number and probability distribution functions in Excel allow the user to easily generate samples that simulate data typical of anykind of biomedical study.The act of generating thesamplesshould provide the user with animplicit understanding of fundamental statistical concepts, includingvariables, probability, independence, sampling variation, linear modeling,random error, fixed effects,random effects, and individual responses. Analysis of the samples, whichis essentially an attempt to recover the formulae that generated the samples,shouldreinforcethese concepts and developothers related to statistical inference, including bias, confidence limits, statistical significance, and chances of benefit and harm. The spreadsheets accompanying this article provide examples of generation and analysis of data for reliability and validity studies and forsimple and covariate-adjusted comparisons of group means without and with repeated measurement.An example is also given for generation ofa binary variable for data simulating events, such as the occurrenceof injuries, but the analysis by generalized linear modeling is currently not available in these spreadsheets. KEYWORDS: confidence limits,data analysis, probability,random number, research design, simulation
Reprintpdf· Reprintdoc· Spreadsheets

RAND() and Coin Tossing

Simple Sample Statistics

Reliability and Error of Measurement

Log Transformation

Percent Error of Measurement

Validity and Error of the Estimate

Comparison of Means in Two Groups

Means in Two Groups Plus a Predictor

Pre-post Controlled Trial

Binary Outcome

Other Designs

References

When I began to use an advanced statistical package, I found that I could come to terms with the output by analyzing data with an obvious effect that I had made up. This exercise also gave me a deeper understanding of statistics. In this article I explain how you can go through a similar process with spreadsheets.

In the beginning I used my imagination to get numbers representingan effect, such as a correlation between weight and height or a difference in mean IQ between boys and girls. I soon realized that it would be better to use the stats package itself to generate a known correlation or a known difference between means. If the analysis reproduced the known effect, I could be more confident that I was using the package correctly.

What did I mean by aknown effect? There is a true value for an effect in a population,but when you do a study,you get only a sample value, which is never exactly the true value. So when you generate a sample containing an effect, the trick is toreproduce the behavior of real samples: the known effect is the true or population value, and you make up a sample that gives something likethe true value. If you were to make up many samples, every sample would give a different value scattered around the true value. On average, the sample value would be the true value. Or you could make up a very large sample, in which case the sample value should be very close to the true value.

Of course, in real life you can never know the true value of an effect in a population, so when you analyze real data, you can't check if you've got the right answer. But with made-updata, you can. And it's not simply a matter of making lots of samples or a very large sample. With any statistical analysis of a sample,you have to derive the confidence interval for the effect. Do this with your made-up sampleand you should find that the confidence interval includes the right answer–except that there's a small chance it won't: 10%, if you choose 90% for the level of your confidence interval. If you are confused at this point, you won'tbe when you build up some simple data from scratch. All these ideas will emerge naturally.

Let's now see how to make samples with a spreadsheet. In the following sections I will work my way through the accompanying workbook, which has a spreadsheet for eachsection. At the end of each section I list the concepts for which you should have developed an implicit understanding.Within each spreadsheet I have filled cells with the colors of the rainbow torepresent the sequence in which the cells were created: red first through to violet and sometimes to greyandwhite.(My apologies to the color blind.) You'll probably learn moreby recreatingeachcell's formula in an adjacent blank cell or in a new blank spreadsheet rather than by simply clicking your way through the filled cells.

RAND() and Coin Tossing

At the heart of sampling is the notion of a random sample of a population. In Excel you access randomness via thefunction RAND(). Nothing goes in the brackets, by the way: all Excel functions have brackets, whether or not they have an argument. A similar function is PI(), which generates the value of π.If you can't remember a function's exact name, access it and allthe other available functions in twoways: either click on the fx symbol towards the left-hand end of the lowest menu bar, or select Insert/Function… from the top menu bar. A window will open that you can navigate through to find the function of interest.

So, put or type =RAND() into a cell (in upper or lower case, followed by pressing Enter, which I won't bother to state again) and you will get a number between 0 and 1, such as 0.230592. That's only the first six decimal places of a number that has 15 decimal places, but we don't have to worry about all those extra digits. Each digit is chosen randomly from 0 through 9. The result is a random number between 0 and 1. Put =RAND() into another cell and you will get another similar random number. But notice that the first cell you put =RAND() into has now changed to yet another random number! Whenever you get into and out of any cell, even if you put nothing into it, Excel updates all the values of RAND everywhere in the spreadsheet. To make this updating happen rapidly (which is often useful), hit Ctrl-D ("copy down", so make sure the cursor is not sitting in the first row of the table or in a blank cell immediately underneath a non-blank cell).

A number that ranges between 0 and 1 can be interpreted as a probability: values closer to 0 represent more unlikely, and values closer to 1 represent morelikely. You can interpret probability as the proportion of times you expect something to happen. So let's have a bit of fun using IF and RAND to simulate coin tosses. Type =IF(RAND()<0.5,"heads","tails") into a cell. Now click on that cell, then click on the little black box in the bottom right-hand corner and drag down for 10 or so cells. You have just simulated 10 or so tosses of a coin. Now do the same again, but change the 0.5 to 0.9 and see what happens when you toss the coin 10 times. To change the value in a lot of cells all at once, put =RAND() in one cell, say G5 as shown, then put=IF(RAND()<$G$5,"heads","tails") in another cell. To do it using the full functionality of Excel, type =IF(RAND()< then click on cell G5, then hit the "Function 4" (F4) key and $ signs will appear immediately. (Toggle the F4 key and you cycle through all four possible combinations of $ and no $--for advanced users!Mac users:sorry, F4 doesn't work for you.) The $ signs "freeze" the reference to a cell when you do a copy operation, as you will see when you complete this cell and copy it down for 10 or so cells, as above. Now change the contents of G5 to see what happens when you have a weird weighted coin thatcomes up heads or tails more often than a fair coin.

Concepts: random,probability.

Simple Sample Statistics

I will start with a sample that gives nothing more than some simple statistics–the mean and standard deviation (SD)–for a single continuous variable. Excel has a function, NORMSINV, that converts a probability into a value of a normally distributed variable drawn from a population with a mean value of 0 and an SD of 1. You don't have to understandwhat NORMSINV does; just give NORMSINV the probability generated by RAND, et voila, a value of a variable! The rest of this paragraph is for the clever orcurious. NORM and S stand for the standard unit normalprobability distribution, and INV stands for the inverse thereof. In fact, it is the inverse of the cumulative normal probability distribution: =NORMSINV(p) interprets p as a cumulative probability by returning a value of a normally distributed variable with mean of 0 and SD of 1, such that p is the proportion of the population with values less than the value it returns. (Draw a normal distribution and shade the area under the curve to the left of a value: that's p. Also, p here is not the p value of null-hypothesis tests.) Small values of p give large negative values of the unit normally distributed variable, p=0.5 gives the value zero, and values of p close to 1 give large positive values of the variable. p=0.025 and p=0.975 give -1.96 and 1.96, which you will recognize if you've done stats courses.

Let's generate that sample. Use the layout in the spreadsheet as a guide to label and make a column of RAND values and a column of NORMSINV of those values. (Having made one cell, copy it down by dragging down the little black box in the corner as before, or by highlighting the cell and blank cells below it, then hitting Ctrl-D.) Then make up a few names or copy my column of gender-neutral names to emphasize that we are about to make values of a subject characteristic like peak power.

Next, decide on a mean and SD (e.g., 400 and 50, as shown). The mean is a measure of where the middle of the population falls, and the SD is a measure of the scatter of the values about the middle. Generate an observed value of peak power from these and the NORMSINV value using this equation: peak power = mean + SD*(the NORMSINV value). Use the F4 key to freeze the cells for mean and SD, then copy the cell down to generate your sample of 100 or so. Get a feel for the way the RAND value makes the NORMSINV value, and how the NORMSINV value combines with the mean and SD to make the observed value. It's obvious that the mean is a measure of middle of the numbers, and the SD is a measure of their scatter about the middle.

Each value we have generated is called an observation, short for the observed value of a variable. Notice that we generated the observations with only three things: a mean, a standard deviation, and a random number. Although the observations for different subjects are related by the mean and SD, the random number makes it impossible to say anything about the specific value of any one observation, given the value of any other observation.The observations are therefore said to be independent. Most analyses are based on the assumption that the values of each subject are independent of the values of all other subjects. Repeated measurements on the same subjects in a data set are, of course, not independent: you get a similar value whenever you re-measure the same subject, as we will see in the next spreadsheet.

Display a frequency histogram of the values as shown on the spreadsheet. (See instructions in the spreadsheet for this unbelievably clunky aspect of Excel.) With bigger and bigger sample sizes, the frequency histogram looks more and more like a classic normal distribution. What is a normal distribution, anyway? Just another one of those incredible manifestations of the laws of nature, which here make most variables have the same expected shape for their sampling distribution: their distribution of values. What's also incredible is that someone worked out its equation. Of course, real variables do nothave perfect normal distributions, but they don't have to for statistical analyses.

Now for something more advanced... Derive the sample mean, sample SD and sample size using the AVERAGE, STDEV and COUNT functions. Click the cursor in a blank cell somewhere, then hit Ctrl-D a few times and watch what happens to the sample values. What you are seeing is sampling variation. Notice how the sample values hover around the population value.

Finally, some really advanced statistics… Create a confidence interval(or confidence limits) for the mean and for the SD. A confidence interval represents an inference or conclusion about the population or huge-sample value of a statistic. When you publish a 90% (say) confidence interval for a statistic, you are saying to the world: "I did this study with a sample, so I can't be sure about the true or population value, but if I were to do the study with a huge sample, the value of the statistic would probably fall in here; by probably I mean there's a 90% chance."

The confidence interval is defined by the lower and upper confidence limits. Generate these for the mean using the TINV function and for the SD using the CHIINV function. These functions are based on the shape of the sampling distribution of values you get for the mean and the SD. The formula for the confidence limits for the mean is simple enough: mean ± t*SD/√(sample size), where t is the value of a t statistic (given by TINV) appropriate for the level of confidence and the sample size. The formula for the confidence limits for the SD is a bit more complicated, soget the formulae into your spreadsheet by copying my cells. Check that the formulae point to the right cells by double clicking and noting where the colored boxes fall.Drag the colored boxes to the right cells if they aren't right. Click on the cells below ±CL and CL to see how I derived the way of showing confidence limits as a single plus-or-minus number for the uncertainty in the mean and a single times-or-divide number for the uncertainty in the SD.

We seldom use confidence limits for a simple sample mean, but confidence limits for some kinds of SD are becoming more common. Regardless, I've got you to do some confidence limits here to show you what happens when you draw repeated samples. Do your population values of the mean and SD fall between their confidence limits? Hit Ctrl-D repeatedly and check each time. Do it until you get a sense that, yes, maybe one time in 10 the true limits do not enclose the true value. That's what a 90% confidence interval means: there's a 90% chance that it encloses the true value.

Play with the data. Try a 50% confidence interval to get a quicker sense of whether it really is 50%. Make the samplea lot smaller or biggeras follows: click the cursor in the far left of a row (not the first row of the sample) to highlight it, drag down to highlight multiple rows, then either right-click Delete…or right-click Copy and right-click Insert Copied Cells. (Do this operation below the level of the figure, to avoid problems.) Notice the effect of a change in sample size on the sampling variation and on the width of the confidence interval, but notice also with repeated use of Ctrl-D that the confidence interval stays true to its level.

Concepts:independent, observation, normal distribution, sampling distribution, population and sample mean and standard deviation, sampling variation, confidence interval or limits, inference.

Reliability and Error of Measurement

Measurement error and the concept of reliability enter naturally at this point. All we do here is tweak the values of a variable to make them more like the values you would measure in reality. The tweaking consists of adding two terms: another RAND, to represent random measurement error, and a constant term to represent the kind of systematic error, offset or bias that occurs when there is a learning, familiarization, fatigue or other order effect with repeated measurement on the same subjects. The spreadsheet shows two trials (measurements) on a sample of subjects, and an additional eight trials on one subject.You can change the random and systematic measurement errors for each trial, but the random error is shown as the same for all trials, and the systematic error is shownas disappearing after the first two trials. There is no need to make values for Trials 3 to 10 for more than the first subject, but I have put $ signs in the right places for the first subject in case you do.

Notice how each subject's second measurement is similar to his or her first. If you were to lump all the observations for both trials together, the observations would no longer all be independent: given one observation, you know there is a similar value for another observation, the observation from the same subject. Repeated measurements on the same subjects produce repeated measures (here, the variables represent the values in each trial) and each set of repeated measurements makes a cluster of observations (here, the pairs of measurements, representing a cluster for each subject).