Statistics for Biology
Descriptive Statistics
Repeated measurements in biology are rarely identical, due to random errors and natural variation. If enough measurements are repeated they can be plotted on a histogram, like the one on the right. This usually shows a normal distribution, with most of the repeats close to some central value. Many biological phenomena follow this pattern: eg. peoples' heights, number of peas in a pod, the breathing rate of insects, etc.
The central value of the normal distribution curve is the mean (also known as the arithmetic mean or average). But how reliable is this mean? If the data are all close together, then the mean is probably good, but if they are scattered widely, then the calculated mean may not be very reliable. The width of the normal distribution curve is given by the standard deviation (SD), and the larger the SD, the less reliable the data. For comparing different sets of data, a better measure is the 95% confidence interval (CI). This is derived from the SD. The purpose of taking a random sample from a lot or population and computing a statistic, such as the mean from the data, is to approximate the true mean of the population. A 95% confidence interval provides a range of values which is likely to contain the population parameter of interest (in this case, the average) in approximately 95% of the cases. You can be pretty confident that the real mean lies somewhere in this range. Whenever you calculate a mean you should also calculate a confidence limit to indicate the quality of your data.
In Excel the mean is calculated using the formula =AVERAGE(range), the SD is calculated using =STDEV(range), and the 95% CI is calculated using =CONFIDENCE(0.05,STDEV(range),COUNT(range)).
This spreadsheet shows two sets of data with the same mean. In group A the confidence interval is small compared to the mean, so the data are reliable and you can be confident that the real mean is close to your calculated mean. But in group B the confidence interval is large compared to the mean, so the data are unreliable, as the real mean could be quite far away from your calculated mean. Note that Excel will always return the results of a calculation to about 8 decimal places of precision. This is meaningless, and cells with calculated results should always be formatted to a more sensible precision, usually 2 decimal places (Format menu > Cells > Number tab > Number).
see also: https://explorable.com/statistics-tutorial
Plotting Data
Once you have collected data you will want to plot a graph or chart to show trends or relationships clearly. With a little effort, Excel produces very nice charts. First enter the data you want to plot into two columns (or rows) and select them.
Drawing the Graph. Click on the chart wizard . This has four steps:
1. Graph Type. For a bar graph choose Column and for a scatter graph (also known as a line graph) choose XY(Scatter) then press Next. Do not choose Line.
2. Source Data. If the sample graph looks OK, just hit Next. If it looks wrong you can correct it by clicking on the Series tab, then the red arrow in the X Values box, then highlight the cells containing the X data on the spreadsheet. Repeat for the Y Values box.
3. Chart Options. You can do these now or change them later, but you should at least enter suitable titles for the graph and the axes and probably turn off the gridlines and legend.
4. Graph Location. Just hit Finish. This puts the chart beside the data so you can see both.
Changing the Graph. Once you have drawn the graph, you can now change any aspect of it by double-clicking (or sometimes right-clicking) on the part you want to change. For example you can:
· move and re-shape the graph
· change the background colour (white is usually best!)
· change the shape and size of the markers (dots)
· change the axes scales and tick marks
· add a trend line or error bars (see below)
Lines. To draw a straight "line of best fit" right click on a point, select Add Trendline, and choose linear. In the option tab you can force it to go through the origin if you think it should, and you can even have it print the line equation if you are interested in the slope or intercept of the trend line. If instead you want to "join the dots" (and you don't often) double-click on a point and set line to automatic.
Error bars. These are used to show the confidence intervals on the graph. You must already have entered the 95% confidence limits on the spreadsheet beside the X and Y data columns. Then double-click on the points on the graph to get the Format Data Series dialog box and choose the Y Error Bars tab. Click on the red arrow in the Custom + box, and highlight the range of cells containing your confidence limits. Repeat for the Custom - box.
The standard error of the mean (SEM) is calculated by dividing the standard deviation by the square root of number of measurements that make up the mean (often represented by N). In this case, 5 measurements were made (N = 5) so the standard deviation is divided by the square root of 5. By dividing the standard deviation by the square root of N, the standard error grows smaller as the number of measurements (N) grows larger. This reflects the greater confidence you have in your mean value as you make more measurements. You can make use of the of the square root function, SQRT, in calculating this value. 2 X SEM = 95% CI
The standard deviation is a measure of the fluctuation of a set of data about the sample mean. The SEM is an estimate of the fluctuation of a sample mean about the "true" population mean. The error bars are attempting to give a range of plausible values for the population mean (at that point in time), given the fact that sample means will fluctuate from sample to sample. "The standard deviation is a measure of the variability within a sample population, and that SEM is a measure of how well that sample population represents the whole population."
Problems
1. Here are the results of an investigation into the rate of photosynthesis in the pond weed Elodea. The number of bubbles given off in one minute was counted under different light intensities, and each measurement was repeated 5 times. Use Excel to calculate the means, standard deviation and 95% confidence limits of these results, then plot a graph of the mean results with error bars and a line of best fit.
light intensity(Lux) / repeat 1 / repeat 2 / repeat 3 / repeat 4 / repeat 5
0 / 5 / 2 / 0 / 2 / 1
500 / 12 / 4 / 5 / 8 / 7
1000 / 7 / 20 / 18 / 14 / 24
2000 / 42 / 25 / 31 / 14 / 38
3500 / 45 / 40 / 36 / 50 / 28
5000 / 65 / 54 / 72 / 58 / 36
There is a bewildering variety of statistical tests available, and it is important to choose the right one. This flow chart will help you to decide which statistical test to use, and the tests are described in detail on the following pages.
Statistics to Test for a Correlation
Correlation statistics are used to investigate an association between two factors such as age and height; weight and blood pressure; or smoking and lung cancer. After collecting as many pairs of measurements as possible of the two factors, plot a scatter graph of one against the other. If both factors increase together then there is a positive correlation, or if one factor decreases when the other increases then there is a negative correlation. If the scatter graph has apparently random points then there is no correlation.
There are two statistical tests to quantify a correlation: the Pearson correlation coefficient (r), and Spearman's rank-order correlation coefficient (rs). These both vary from +1 (perfect correlation) through 0 (no correlation) to –1 (perfect negative correlation). If your data are continuous and normally-distributed use Pearson, otherwise use Spearman. In both cases the larger the absolute value (positive or negative), the stronger, or more significant, the correlation. Values grater than 0.8 are very significant, values between 0.5 and 0.8 are probably significant, and values less than 0.5 are probably insignificant.
In Excel the Pearson coefficient r is calculated using the formula: =CORREL(X range, Y range) . To calculate the Spearman coefficient rs, first make two new columns showing the ranks (or order) of the two sets of data, and then calculate the Pearson correlation on the rank data. The highest value is given a rank of 1, the next highest a rank of 2 and so on. Equal values are given the same rank, but the next rank should allow for this (e.g. if there are two values ranked 3, then the next value is ranked 5).
In this example the size of breeding pairs of penguins was measured to see if there was correlation between the sizes of the two sexes. The scatter graph and both correlation coefficients clearly indicate a strong positive correlation. In other words large females do pair with large males. Of course this doesn't say why, but it shows there is a correlation to investigate further.
Linear Regression to Investigate a Causal Relationship.
If you know that one variable causes the changes in the other variable, then there is a causal relationship. In this case you can use linear regression to investigate the relation in more detail. Regression fits a straight line to the data, and gives the values of the slope and intercept of that line (m and c in the equation y = mx + c).
The simplest way to do this in Excel is to plot a scatter graph of the data and use the trendline feature of the graph. Right-click on a data point on the graph, select Add Trendline, and choose Linear. Click on the Options tab, and select Display equation on chart. You can also choose to set the intercept to be zero (or some other value). The full equation with the slope and intercept values are now shown on the chart.
**You can do this in Logger Pro -Analyze: Linear Fit, and look at the Correlation value.
In this example the absorption of a yeast cell suspension is plotted against its cell concentration from a cell counter. The trendline intercept was fixed at zero (because 0 cells have 0 absorbance), and the equation on the graph shows the slope of the regression line.
The regression line can be used to make quantitative predictions. For example, using the graph above, we could predict that a cell concentration of 9 x 107 cells per cm3 would have an absorbance of 1.37 (9 x 0.152).
The larger the absolute value (positive or negative), the stronger, or more significant, the correlation. Values grater than 0.8 are very significant, values between 0.5 and 0.8 are probably significant, and values less than 0.5 are probably insignificant
Correlation Coefficient,r
The quantityr, called thelinear correlation coefficient, measures the strength and the direction of a linear relationship between two variables.The linear correlation coefficient is sometimes referred to as thePearson product correlation coefficient
The larger the absolute value (positive or negative), the stronger, or more significant, the correlation. Values grater than 0.8 are very significant, values between 0.5 and 0.8 are probably significant, and values less than 0.5 are probably insignificant
Coefficient of Determination,r2 orR2
Thecoefficient of determinationis such that 0r21, and denotes the strength of the linear association betweenxandy.
Thecoefficient of determinationrepresents the percent of the data that is the closestto the line of best fit.For example, ifr= 0.922, thenr2= 0.850, which means that 85% of the total variation inycan be explained by the linear relationship betweenx andy(as described by the regression equation). The other 15% of the total variation inyremains unexplained.
Thecoefficient of determinationis a measure of how well the regression line represents the data. If the regression line passes exactly through every point on the scatter plot, it would be able to explain all of the variation. The further the line is away from the points, the less it is able to explain.
http://mathbits.com/MathBits/TISection/Statistics2/correlation.htm
TTest to Compare Two Sets of Data
Another common form of data analysis is to compare two sets of measurements to see if they are the same or different. For example are plants treated with fertiliser taller than those without? If the means of the two sets are very different, then it is easy to decide, but often the means are quite close and it is difficult to judge whether the two sets are the same or are significantly different. To compare two sets of data use the ttest, which tells you the probability (P) that there is no difference between the two sets. This is called the null hypothesis.
P varies from 0 (impossible) to 1 (certain). The higher the probability, the more likely it is that the two sets are the same, and that any differences are just due to random chance. The lower the probability, the more likely it is that that the two sets are significantly different, and that any differences are real. Where do you draw the line between these two conclusions? In biology the critical probability is usually taken as 0.05 (or 5%). This may seem very low, but it reflects the facts that biology experiments are expected to produce quite varied results. So if P5% then the two sets are the same (i.e. fail to reject the null hypothesis), and if P5% then the two sets are different (i.e. reject the null hypothesis). For the t test to work, the number of repeats should be at least 5.