Statistics – A Survival Manual
Part 1 - Measures Of Central Tendency
"What's the price of bacon?" "How is the weather in Southern California this time of year?" "How tall does the Douglas fir grow?" These are examples of questions that are really looking for a figure that would be representative of your total population, that would give a fairly good general impression, a measure of central tendency. Some form of "average" is generally used to provide the answer. The mean and median are the most common measures of central tendency.
Mean
The mean, occasionally called the arithmetic mean, is the most common measure of central tendency. This measure of central tendency allows the value (magnitude) of each score in a sample to have an influence on the final measure. To obtain a mean, all of the observations (X) in a sample are added together, then that sum is divided by the total number (n) of observations in the sample.
Mean = X/n
Median
The median is a second measure of central tendency. Unless you have a particularly large group of numbers, this is a rather simple measure. To determine the median, sort your data so the observations either increase or decrease in value. The median is then simply the middle observation in your data set.
If the number of observations in your data set is odd, the middle observation can be found by dividing the total number of observations by two, then rounding the resulting number up to the nearest whole number. For example, if you have 25 observations in your data set, the median would be obtained by dividing 25 by 2, giving 12.5. Rounding up to the nearest whole number gives 13, making observation 13 in your ordered data set the median value for your sample.
If the number of observations in the data set is even, the median is obtained by averaging the values for the middle two observations after the data set has been ordered. For example, in a data set with 24 values, the middle two observations would be observations 12 and 13. If the value for observation 12 was " 100" and the value for observation 13 was " 102", the median for the data set would be "101 ".
Part 2 - Measures Of Variability Or Dispersion
Means and medians are expressions of central tendency. They provide a general picture of what a group of numbers is like, however, measures of central tendency can be misleading. For example, let us say the average height of two basketball teams is 180 cm. On team A, all of the players are exactly 180 cm, but on team B, one is 180 cm, two are 160 cm and two are 200 cm. Knowledge of this variation, or dispersion from the mean would be meaningful to the coach of team A. A number of measures of dispersion are in common use; the range and the interquartile range (used in conjunction with the median) and the variance, standard deviation and coefficient of variance (used in conjunction with the mean) being the most common.
The Range
When the nature of the distribution is not known, or when the distribution is known to be other than normal (e.g., skewed), the range will give a rough idea of the dispersion. The range is the difference between the high and low scores. The range does not tell us anything about the nature of the distribution, only its extent. If our sample observations were 18, 17, 17, 12, 11, 11, the range would be 7, (18 - 11 = 7).
The InterguartileRange
Quartiles, like the median, represent points in an ordered set of observations. The first quartile (Ql) is a point below which 25% of the observations fall; 50% of the cases fall below Q2 and 75% below Q3. The interquartile range includes those 50% of the scores that fall between Q I and Q3. When the distribution of numbers is not normal because of the extremes on either or both ends, an inspection of the middle 50% may prove to be most revealing in describing a group of numbers. Inspect the following set of ordered data. In this data set, the first quartile includes observations 1 through 5, the second quartile includes observations 6 through 10 the third quartile includes observations 11 through 15, and the interquartile range extends from 17 to 61.
Obs. # / Value / Obs. # / Value1 / 10 / 11 / 35
2 / 10 / 12 / 39
3 / 11 / 13 / 42
4 / 13 / 14 / 50
5 / 15 / 15 / 61
6 / 17 / 16 / 73
7 / 21 / 17 / 95
8 / 23 / 18 / 102
9 / 26 / 19 / 113
10 / 30 / 20 / 140
The Variance
Another group of measures of dispersion is based on the distance of each measurement from the center of the distribution (i.e. the mean or median). The first of these is the variance. To obtain the variance of a set of numbers, follow these steps:
1.Subtract each observation from the measure of central tendency (usually the mean) to obtain the "deviation" of each observation.
2.Square each of these deviations.
3.Add together all of your squared deviations.
4.Subtract 1 from the total number of observations in your sample (n- 1)
5. Divide your "sum of squared deviations" by this number.
Bingo! You've got the variance of your data set. Just as a side note, dividing by "n-1 " gives what is technically know as the sample variance.
The Standard Deviation
With a normal distribution, the mean is the most accurate and descriptive measure of central tendency because it considers the magnitude (or value) of each score. In like manner the standard deviation considers the magnitude of each score and therefore is the preferred measure of dispersion if the distribution is normal. Although the variance is a useful starting point in describing in the variability of a normal data set, the standard deviation is more commonly used (for reasons we will not get into now). Once you have the variance of a set of scores, getting the standard deviation is simple. All you have to do is take the square root of the variance.
Coefficient of Variance
The final measure of dispersion is the coefficient of variation (CV for short). The CV is a useful measure of dispersion because it allows us to compare the magnitude of variation between two sets of data when the means (and the ranges) differ significantly. To obtain the CV, simply divide the standard deviation by the mean, and multiply the result by 100.
Part 3 - A Test Of Location
Perhaps one of the most basic questions we ask in science is "Do these two [fill in the blank with your favorite research topic] differ?" In most cases, we begin answering that question by using a statistical test that looks for differences in the location of the measure of central tendency for the two "things". The most commonly used test for differences in central tendency between two samples is the t-test.
The t-test should only be applied to independent samples. This means the selection of a member of one group has no influence on the selection of any member of the second group. If you do not understand what this means, you need to ask me about it, because this concept of independence between or among samples is a very basic assumption in many of the statistics commonly used in the biological sciences.
The t-test incorporates not only a measure of central tendency (the mean) for the samples, it also incorporates measures of dispersion (the standard deviation) and the total number of observations in each sample (the sample size). Our ability to detect a true difference between two samples depends not just on the absolute difference between the means, but on the amount of dispersion and the size of our sample. In general, for the same difference between means, as variability decreases, our ability to detect a statistically significant difference increases. Conversely, as sample size increases , our ability to detect a difference increases.
Once our arithmetic manipulations are completed, the result is applied to a table of values (often called a "values of t" table, see the attached sheet). The table has several columns of values or levels of significance. The researcher must decide how much flexibility can be allowed. This will depend on the degree of accuracy of the measuring instrument. Most scientific research sets the significance level at .05. This means there is only a 5 percent chance an error has been made in saying the means of the two groups were not exactly alike.
The t-test
The formula for the t-test was developed by W.S. Gossett in the early 1900's. As a brewery employee he developed a system of sampling the brew as a periodic check on quality. He published his works under the name Student which explains why you hear this formula called the Student t from time to time.
In order to proceed with a t test, the "standard error of the difference" between the means of the two groups must be determined. Just as the name implies, this is a measure of variability in our estimate of the difference between the two means. To illustrate the t-test, look at the following table. The numbers represent average fish lengths from an urban and rural stream.
Mean Length / Standard Deviation / Sample SizeUrban Stream / 80 / 9.16 / 30
Rural Stream / 84 / 7.83 / 30
At first glance this would seem to suggest fish in the rural steam are larger. But, we must ask the question, is the 4 mm difference a chance difference or does it really represent a true difference. If we were to take another sample from each stream and do it again, would we get the same results?
To answer this question, you must first compute the standard error of the difference (sdiff). The following formula is appropriate for independent samples, if the distributions are normal. Once you compute the standard error of the difference you can determine if a true (statistically significant) difference exists with the aid of the t-test and tables or an appropriate computer program. If we plug our numbers in from the data table, we find that the "Standard Error of the Difference" for our 4 mm length difference is 2.29 mm. The next step in the process is to use our value for sdiff to calculate a t-score. The formula for the t-score is given on the next page.
As mentioned above, in order to determine if there is a true difference between the means, you need a table of values of t. You also need to know the degrees of freedom used in the problem. Degrees of freedom are determined by adding the number of scores in each group and subtracting 2, [n1 + n2 - 2].
Looking at the attached sheet, the left hand column is labeled df (for degree of freedom), Look down that column until you find the appropriate df for your problem and read across to the column labeled .05 for two tailed tests. If the figure you found in the t-test is larger than the .05 figure there is a true difference between the means.
For the example above, looking down the column for degrees of freedom we pick row 50. The degrees of freedom for this problem would be 58, (n, + n2 - 2) but in that there is no row 58 we would drop to row 50, the more conservative alternative. The two tailed values in this row are .05 = 2.009, .01 = 2.678 and .001 = 3.496. Our t value was 1.746 which is smaller than 2.009 so we can say there is not a significant difference between the means at the .05 level or 5% level. It should be noted that this only tells us if there is a significant difference, not the direction of any difference; to conclude the latter requires more than the intent of this survival manual.
Formula to Calculate the Standard Error of the Difference
Formula to Calculate t-scores
Part 4 - Correlation
Even though we often think of statistical techniques as a way of telling if two things, groups or sets of data are different, we can also use statistical analysis to ask if two things are related. For example, we might want to know whether the rate of growth in a bacterial colony is related to temperature. Alternatively, we might be interested in the relationship between the density of two types of plants that use similar types of resources. These and many other questions can be tested using a statistical process called correlation.
There are several different formulas to determine correlation. The most frequently used is the Pearson product moment correlation, named after Karl Pearson. This procedure requires that the numbers be at least interval in nature. It also requires that the data be paired.
We can use the bacterial colony question as an example of what is meant by paired data. Each item of the pair must have a common denominator. In this instance, the number of colonies counted and the temperature both come from the same petri dish. The temperature in a given dish is one of the paired numbers and the number of bacterial colonies is the other member of the pair. The pair of scores for dish #1 (see the next table) are 100 (Number of colonies) and 85 (temperature).
Dish Number / Number of Colonies / Temperature1 / 100 / 42
2 / 70 / 34
3 / 85 / 34
4 / 75 / 32
5 / 65 / 30
6 / 60 / 29
From the data above it can easily be seen that the warmer the dish, the greater the number of colonies. Knowing this may be adequate for your purposes, or you may wish to have a more precise measure of the degree of relationship. You may even be interested in graphically showing this phenomenon.
Scattergram
The graphical procedure to show correlation is called a scattergram. To develop a scattergram you first place one set of the numbers in a column in descending order and label it X. In our example of temperature and bacterial growth, we can label the number of colonies as the X column. The column of the paired figures (temperature) is listed next to the X column and labeled the Y column. Do not break up the pair. Except in the instance of a perfect positive correlation the Y column will not be in perfect descending order.
On graph paper, a horizontal scale is marked off for the X column near the bottom of the page. The values increase from left to right. A vertical scale is drawn to intersect one line to the left of the lowest X value. The vertical scale represents the temperature on the Y column. The horizontal scale is the X axis and the vertical scale the Y axis. The length of the scale on the X axis should be about equal to the length of the scale on the Y axis. There is no rule that requires this, however the procedure helps space the scattergram which makes it easier to interpret. Each pair of figures is plotted on the graph. A dot is placed where the X and Y values intersect for each pair of figures. The dots are not connected.
Graph the data presented above on a sheet of graph paper. If you have done this correctly, the dots should form nearly a straight line from lower left to upper right. This configuration would be considered a positive correlation, and the more nearly the dots form a straight line the stronger the relationship between the two variable.
If the dots had formed a pattern from upper left to lower right, the correlation would have been regarded as negative; that is to say there was indeed a relationship, but the conclusion would be that the variables plotted had an inverse affect on each other. If we had continued to increase the temperature above 50, we might have seen a decline in the number of colonies as we exceeded the temperature tolerance of the bacterial species we were testing. If we had only graphed the response to temperatures above 50, we might have observed a negative relationship. If the dots appeared randomly over the scattergam, you would interpret this as no relationship between the X and Y column measurements.
Computation
As mentioned above, there are several formulas for determining correlation. At this point we shall onlybe concerned with the raw score method. The formula is:
Correlation coefficients should always fall between (+1.0) and (-1.0); it can not be anything else. The closer to 1.0, the more perfect the relationship is between the two sets of numbers in a positive way, that is, when one figure goes up, the corresponding paired number also goes up and in like manner when one of the paired numbers goes down, so will the other. The negative correlation supports the opposite conclusion, when the firstnumber goes up, its pair goes down or vice versa. The closer to zero, the less likelihood of any relationship.
Part 5 – Contingency Table Analysis
Introduction
The types of statistics you have seen so far allow you to describe measures of central tendency and dispersion (means, medians, ranges, standard deviations, etc.), test for differences among populations (the t-test), or look for relationships between two measures (correlation). One additional group of statistical techniques that is often useful in ecological studies are tests that allow us to compare the frequency of observations in various categories between two different sets of observations. Our comparisons may be between a set of real world observations and some expected distribution of observations, or we may want to compare frequency of counts between two sets of real world observations drawn from different areas or populations.
You have already encountered the first type of comparison in your introductory biology courses when you compared observed phenotypes in a population to those that would be expected if the population were in Hardy-Weinberg equilibrium. This general class of tests is known as a "goodness-of-fit" test since we are attempting to determine if our observations "fit" our expectations. The most common goodness-of-fit test is the chi-square goodness-of-fit test.
The second type of comparison is more common in ecological field studies. In this case, we usually have two variables that we have used to classify our observations. Each variable has two or more categories. Based on these categories we can set up an table, with n columns (where n = the number of categories for the first variable) and p rows (where p = the number of categories for the second variable). Each observation can then be assigned to one of the cells in our table. The question we hope to answer is whether the value for the column variable has any influence on the values of the row variables. Statistically, we are asking the following question:
"Is the value of our second variable independent of the value for our first variable?"
This type of test is known as a chi-square contingency table analysis. The null hypothesis for contingency table tests is "The effect of our first variable is not influenced by our second variable." If this seems a little fuzzy, hang on while we run through an example. If it still seems fuzzy after that, come and talk to me about it!