Simple Statistical Analyses

There are very few simple Yes or No questions in biology. Most biological questions are answered by gathering quantitative data (rainfall rates, Potassium influx rates, area of an animal’s home range, etc.) and making INFERENCES from this information. When you begin working with quantitative data you come in contact with two sources of potential errors:

1) biological variability (no two individual animals are exactly alike) and
2) random events (or chance occurrences).

How do you determine if our quantitative measurements are accurately describing the “real” biological situation you are studying? The basic theme of this laboratory is the use of several simple statistical tests to first, extract worthwhile information about a POPULATION when you only have data from a small portion (SAMPLE) of the entire population; and secondly, to determine whether you can say with some degree of confidence that two samples come from different (or similar) populations. This question is very frequently encountered in many branches of science.

A population is considered here to be some set of items, which are all similar and which reside within a definable boundary. It’s up to the investigator to decide what “population” she is going to work with. Examples of defined populations include, all residents of Kne‘ohe between the ages of 30 and 60; all pineapple plants in a 5 acre plots the heights of all trees on a particular site; the monthly rainfall values at UH over the last 10 years, etc. Note that individuals or measurements on individuals may compose a population. Since it is usually impractical, or even impossible, to measure all of the items under study in a population, (i.e. a CENSUS) you generally try to describe the characteristics of a population by taking measurements of a portion of the population. The portion is our sample. The methods used to obtain unbiased samples will be extensively dealt with below, but for the moment you need not worry about how you get your sample, but as you are sampling in this exercise think of the possible sources of bias. Once the sample is obtained, you will estimate the value of parameters, which describe the population and assign some CONFIDENCE LEVEL. to these descriptions. Then you will compare our population estimates to determine if they are different.

II.  Making the Measurements

In the lab you will be provided with several bags containing Koa Haole (Leucaena leucocephala (Lam. de Wit) seedpods. Two of the bags of seedpods were collected from one tree while the third bag of seedpods came from a different tree. The goal of today’s exercise is to

1) determine which two bags came from the same tree and which bag came from a
different tree.

To do this you will make measurements and observations of the seedpods and analyze
the data you collect..

BUT there are far too many seedpods in each bag for you to measure them all (i.e. do a
census) so you will have to work with only some seedpods from each bag – which ones??

This will be the second goal of today’s exercise

2) How can you obtain an unbiased sample of seedpods to measure from the population of seedpods in each bag?

Each bag will be coded with a letter and you will collect and analyze samples from all three bags. Your first chore will be to determine how to get an unbiased sample from the bag. This is not an easy task. Your group should discuss this among yourselves and try to

1) think about possible sources of bias in getting the sample

2) devise ways to avoid or reduce each of these sources of bias.

Your next chore will be to determine what the sample size should be. Obviously the total number of seedpods in the bag is too large for you to be able to measure them all (also remember the contents of each bag are themselves a sample [of the tree]). On the other hand too small a sample may not give you a good estimate of the population parameters. Can you devise an objective mechanism to tell you what an appropriate sample size should be? What information would you need to know before you can answer that question?

Before you begin, check with the TA or teacher to go over your sampling protocol.

There are many traits you could determine for the seedpods including chemical and physical factors. For the purposes of this exercise you will limit the traits measured to two, one continuous (seedpod length) and one discrete, number of seeds per pod. Measure the length of each pod using metric values (CM) and also record the number of seeds in that pod. Record all pertinent data in your lab notebook.

III.  Sample Population Statistics

It is always important to get a “feel” for your data before you spend much time in more complex analyses. Simply looking at the data (as in a list of measurements) isn’t likely to be much help so graphic visualizations are used. A very basic graphic visualization type is the FREQUENCY HISTOGRAM. To make a frequency histogram you need to group the measurement data into discrete intervals. This is more straightforward for discrete variables (such as number of seeds per pod) but may require some thinking for continuous variables (such as pod length). The choice of interval is up to you but there is usually some optimum number of intervals that maximizes the visual information available. This, of course, may be different in different circumstances. But remember if you want to make comparisons between data sets you may want to use the same X axis for all of them. Tabulate your data values in the interval categories to get that the numbers of measurements in each category. These are then plotted as histogram with the interval along the horizontal (X) axis and the frequency along the vertical (Y) axis.

For many biological samples, the resulting frequency distribution is often “bell shaped”. This bell shaped curve is informally referred to as a NORMAL DISTRIBUTION (The real definition of a normal distribution is given by the relationship between two parameters [which we will discuss below]). Why should most measurements on biological material be distributed in such a manner, with most of the measurements clustering about an “average” value, but with a few extreme values on either side? The answer lies partly in the genetic variability, which is inherent in all biological material. No two individuals are genetically exactly alike. While closely related individuals have many of the same genes in common, slight genetic differences do exist. Thus, if the fur color of rabbits has a genetic basis, then a good camouflage color like brown will be the most common color in the population. However, a few rabbits will carry genes for both lighter and darker coat color, especially if fur color is under the control of several genes. But you also see normal distributions in situations where genetics – or even biology – play no part so there is more (lots more) to this tendency for values to be distributed this way.

The nice thing about the normal distribution is that it is symmetrical, so that any particular normal distribution can be completely specified (or described) if you know just two parameters, the central value the MEAN (or average); and the distance from the mean to the point of inflection (the STANDARD DEVIATION). You are vary familiar with the concept of a mean or an average. The standard deviation is basically a measure of how broadly the values are scattered around the mean.

Statistical analyses dealing with normal (also called parametric) distributions rely heavily on these two parameters mean and standard deviation (or its square the variance). The majority of this exercise will be devoted to the methods of calculating these statistics and seeing what can be done with them. But you must always remember that parametric statistics should only be used when the distribution of data is known to be normal. (See Ch.6 in Sokal & Rolf for a more extensive discussion of the normal distribution – Ch. 4 & 5 discuss other useful descriptive statistics and distributions.

IV.  Calculation of sample Mean and Standard Deviation:

V. 

The distribution of sample values is normal, the mean and standard deviation may be calculated from the sample measurements by following the procedures given below.

1.  The mean

= S Xi /n

where is the symbol for the mean

Xi is the ith individual measurement

n is the total number of measurements (the sample size)

S is a symbol, which means “the sum of all measurements

from i = 1 to I = n”

2.  The standard deviation

S = Ö{S (Xi - )2 / (n-1)}

or a similar formula obtained through algebra, which is more convenient when using calculating machines,

S = Ö{S Xi2 –[(SY1)2/n] / (n-1)

where s is the symbol for the standard deviation and the rest of the notation is the same as given for the mean.

Many pocket calculators are programmed to calculate the mean and standard deviation directly from your data but you should study these formulae to get an understanding of the meaning of these statistics. Other statistics (e.g. mode, median, range, etc.) also have their uses but you will use them less in this lab.

VI.  Comparing Two Populations

In almost all cases where you compare data obtained from two samples they will be different, even if they are from the same population. Imagine you flip a coin 10 times and keep a record of the results (e.g. H T T H T etc.) its unlikely (and more and more unlikely as you flip the coin more and more times) that you will get the same sequence of heads and tails. BUT what about the mean number of heads? If it is an honest coin you expect that on average you will get about 50% heads in a sample of appropriate size (and exactly 50% heads in a sample of infinite size). But in two sets of 10 coin tosses are you going to always get 5 heads and 5 tails even if it is an honest coin with no bias in the tossing? There are statistical tests to help you determine whether two samples that are somewhat different may in fact have come from the same population. An assumption of the tests you will use in this exercise is that the underlying distributions are normal. These parametric tests are based on asking the question

“What are the chances that two unbiased samples taken from the same population will differ by the amount that my two samples differ?”

Note there are actually two populations under consideration here the real world population that you took your samples from and an ideal population with certain known features. The tables you will use (or can find in any statistics book) are based (at least theoretically) on drawing many samples of a given size (n) from this ideal population and determining how much difference is found. That is (in theory), the table-maker takes thousands of samples of size 2, thousands of size three, thousands of size four, and so on and so on, and gets the distribution of the parameters (say mean and standard deviation) of each set of 1000 samples. Since the table-maker knows that the samples were drawn from the same population the differences she finds are those expected when two (or more) samples are drawn from what is really the same source population. As you would expect most of the pairwise differences are small (e.g. for most pairs of means the differences are not very big) but out of 1000 samples you would be sure to find a couple of means that were pretty different by chance alone. The tables you will use are constructed such that for two samples (so far you are talking about pairwise tests, but all this stuff can be generalized to more than two samples) of a specific size (n), drawn from the same population, you can look up the chances (probability) that the differences between your two samples from the real world are bigger than you would expect if they had been drawn from the same population. So to enter the table you need to have some statistic (a single number) that compares your two populations (in this exercise you will use one ratio and one difference) the sample size [or actually the degree of freedom (in our case df = n-1)] and the level of probability that will make you happy. What does this happiness depend on? You’ll think back to the ideal population and the thousands of samples (or pairs of samples with their comparative statistic). As you saw most sample parameters will be similar (so the ratio will be close to 1 if you are looking at ratios, or the difference will be close to zero if you are looking at differences), but in a few cases the numbers will be pretty big. In 1000 pairs of samples there will be a few with a really big difference just by chance. With a 1000 sample pairs you can count how many are greater than some value (what you will come to call the significance level – though significant to who or what is never really clear). You will find that in 1000 samples only100 have a difference (or ratio) greater than some value. So then you can say: “In only 10% (100 out of 1000) of cases of pairs of samples drawn from the same population will the differences (or ratios) be bigger than this”. Of course you can do this for 5% 1% .01% or whatever. {Don’t worry about the poor table maker all this is done by algorithms on computers today}

So the end result of all this is that you can decide on some chance of being wrong (that is saying the difference (or ratio) you found is due to chance when it actually isn’t) that you are willing to accept then set that value as your level of rejection of the null hypothesis (that there is really no difference). To get this measure of the difference between the estimates of the population measurements for two samples in this exercise, you will perform two statistical tests. This will allow you to objectively state (with a certain degree of confidence, say 95% sure) that you think that your two samples are from one population or two.