Statistics with Excel

Statistics with Excel

Statistical Tests in Excel

This handout is by no means comprehensive, but does include some tests you might use in analyzing your nature experience data in Excel.

In statistical tests, a test statistic is calculated and compared to the critical value of that statistic for the degrees of freedom and desired significance level. The probability of obtaining a particular value of the test statistic by chance alone is calculated. If the test statistic is higher than the critical value, the probability of getting that result by chance is very small (a 5% or less chance), and we can reject the null hypothesis. If the test statistic is less than the critical value, the probability of calculating that value will be higher than 5% and the null hypothesis can not be rejected.

Parametric tests: t test and ANOVA

The following tests are parametric. This means that they require several assumptions to be made. One assumption is that the data are normally distributed. Another is that the variances are equal (unless otherwise indicated). Parametric methods work best with continuous variables. Discontinuous variables, such as counts, or derived variables may be used if they meet the assumptions. Sometimes transforming the data by taking the log or square root may make the data meet the normality assumption.

T test: testing for differences between two groups

T test: paired two-sample for means, Use when natural pairing of observations in the sample, such as when a group is sampled before and after an experiment. Assumes the variances of both populations are equal (although the help section of Excel states the opposite).

T test: Two sample assuming equal variances.

This test is for two independent groups of data. It assumes that both groups have equal variances.

T test: Two samples assuming unequal variances.

This test is for two independent groups of data. It assumes that the two groups have unequal variances.

Example of a t test:


After choosing a t test, a window appears:

Variable 1 range: choose cells containing data for first group

Variable 2 range: choose cells containing data for second group

Hypothesized mean difference: default is 0, but it can be set to any value.

Labels box: check if you included the cells containing the data labels. If not checked, and the labels were included in the ranges, an error message of "Non-numeric data included" will appear.

Alpha: default is 0.05 (5%). This is the significance level.

The output options are three buttons (choose one):

Output range: must specify the cell where you want the output to begin (it will take up three columns and up to thirteen rows).

New worksheet ply: may type in name you want for output sheet

New workbook: if you choose this option, the output will be placed in a separate file.

Output from the t test

The output from the t test is shown below. The most important information is the t statistic and the p value (probability value). The t stat (t statistic) is the test statistic calculated for the t test. It is compared to the critical value of the t statistic. The p value is the probability of having as high a value of the t test statistic by chance alone. Values above 0.05 correspond to non-significant results and the null hypothesis can not be rejected. Values of 0.05 or less correspond to significant results, leading to the rejection of the null hypothesis of no significant differences between the two groups. When running the test, you must decide if your alternative hypothesis is that the means of the two groups differ, in which case you would use the p value and critical t value for the two-tailed test. If your alternative hypothesis is that the two groups differ in a specific way (the variable being measured is either higher or lower in value in one group than the other), you would use the one-tailed test.

Table 1. Results from the paired t test.

t-Test: Paired Two Sample for Means
Variable 1 / Variable 2
Mean / 4 / 3.5
Variance / 1.428571429 / 2.28571429
Observations / 8 / 8
Pearson Correlation / 0.632455532
Hypothesized Mean Difference / 0
df / 7
t Stat / 1.183215957
P(T<=t) one-tail / 0.137672774
t Critical one-tail / 1.894577508
P(T<=t) two-tail / 0.275345549
t Critical two-tail / 2.36462256

ANOVA: Testing for differences among three or more groups.

Excel is a bit limited in its ability to perform ANOVAs. There are three choices.

ANOVA: single factor analysis

Use when testing if the groups differ by a single factor and the data are continuous, such as length or weight, or if the data are discontinuous with a range of at least 30.

ANOVA: Two-factor with replication

This test allows you to analyze three or more groups that differ by two factors, such as food type and physical state. Each factor has more than a single value.

ANOVA: Two-factor without replication

This test is similar to the previous test, except for each factor combination, there is only a single value.

Example of performing a single factor ANOVA in Excel:

ANOVA: single factor analysis

Choosing Tools, Data Analysis, ANOVA: Single Factor, brings up a window:


Input range: Choose the cells containing the data from the groups. The buttons below the input range allow you to indicate whether the data are grouped by columns (one column per group) or rows (one group per row).

The rest of the window is similar to the t test window.

Output for the ANOVA:

The output from the single factor ANOVA is shown below. A brief description is provided, followed by the results of the ANOVA. The most important information is the F statistic and the p value. P values less than 0.05 mean that the groups significantly differ. P values above 0.05 correspond to no significant differences among the groups. The results of the ANOVA below should reported like this, "Groups 1-3 significantly differed in their weights (F2,21= 4.10, p<0.05)" or "The three groups showed significant weight differences (F2,21= 4.096, p=0.03)".

Table 2. Results for the single factor ANOVA.

ANOVA: Single Factor
SUMMARY
Groups / Count / Sum / Average / Variance
Group 1 / 8 / 32 / 4 / 1.42857143
Group 2 / 8 / 28 / 3.5 / 2.28571429
Group 3 / 8 / 1236 / 154.5 / 44384
ANOVA
Source of Variation / SS / df / MS / F / P-value / F crit
Between Groups / 121204 / 2 / 60602 / 4.09586308 / 0.03148292 / 3.46679485
Within Groups / 310714 / 21 / 14795.9048
Total / 431918 / 23

Example of performing a two-factor ANOVA with replication in Excel:

In order to perform a two-factor ANOVA, your data must be arranged in a matrix with one factor in columns and the second factor in rows. Here is a sample data set which is divided by two factors: physical state (liquid and solid) and food type (lipid, protein, and carbohydrate).

Lipid / Protein / Carbohydrate
Liquid / 59 / 28 / 30
62 / 19 / 28
75 / 26 / 40
Solid / 2 / 57 / 19
26 / 34 / 42
13 / 39 / 23

Choosing Tools, Data Analysis, ANOVA: Two Factors, brings up a window:


Choose the input range by selecting the cells containing the data and the labels. Start at the cell just above the label “Liquid” and to the left of the cell containing the label “Lipid”. Type in the number of rows per sample (in this case, 3). Choose “new worksheet ply” and type in a name. Your ANOVA window should now look something like this:


The output looks like this:

Table 3. Output from the 2 factor ANOVA with replication.

ANOVA: Two-Factor With Replication
SUMMARY / Lipid / Protein / Carbohydrate / Total
Liquid
Count / 3 / 3 / 3 / 9
Sum / 196 / 73 / 98 / 367
Average / 65.3333333 / 24.3333333 / 32.6666667 / 40.7777778
Variance / 72.3333333 / 22.3333333 / 41.3333333 / 386.194444
Solid
Count / 3 / 3 / 3 / 9
Sum / 41 / 130 / 84 / 255
Average / 13.6666667 / 43.3333333 / 28 / 28.3333333
Variance / 144.333333 / 146.333333 / 151 / 275.5
Total
Count / 6 / 6 / 6
Sum / 237 / 203 / 182
Average / 39.5 / 33.8333333 / 30.3333333
Variance / 887.5 / 175.766667 / 83.4666667
ANOVA
Source of Variation / SS / df / MS / F / P-value / F crit
Sample [rows, i.e., state] / 696.888889 / 1 / 696.888889 / 7.23831506 / 0.01965488 / 4.74722128
Columns [food type] / 256.777778 / 2 / 128.388889 / 1.33352568 / 0.29993738 / 3.88529031
Interaction / 3881.44444 / 2 / 1940.72222 / 20.1575303 / 0.00014566 / 3.88529031
Within / 1155.33333 / 12 / 96.2777778
Total / 5990.44444 / 17

The first part describes the groups by physical state (liquid, solid) and by type of food (lipid, protein, or carbohydrate). The second part is the ANOVA table. It calculates an F value for each of the two factors: physical state and food type. Sample is the physical state, columns are the food types. It also calculates an F statistic for the interaction between physical state and food type. In this example, the number of ants feeding on liquids and solids were significantly different from each other (p = 0.0197), but the number of ants feeding on lipids, proteins, and carbohydrates were not (p = 0.2999). The interaction was significantly different from zero. This means that the response of the ants to one factor (such as carbohydrates) depended upon the second factor (liquid or solid).

Nonparametric tests

Non parametric tests have fewer assumptions about the data. Specifically, they do not require the data to be normally distributed or to have equal variances. They may be used with any type of variable: continuous, discontinuous, ordinal, ranked, or derived. They are less powerful than parametric methods which means that sometimes they will not be able to detect significant differences among groups when they really exist. Excel does not seem to offer nonparametric tests, but they are relatively simple to compute by hand.

A goodness of fit test: The Chi-square (2) test

The chi-square (2) test is a nonparametric test used to analyze frequency distributions of discrete variables. The data are usually presented in tables showing the expected and observed frequencies for various categories. The null hypothesis is that the observed and expected frequencies are not different from each other. The alternative hypothesis is that they do differ.

The calculations for the 2 test are relatively easy (see p. 15 in Brower et al. 1998). This is a test in Excel that is NOT performed by using the menus. The order in which you do the test is backwards from how you would do it by hand, First, you must have your expected and observed values in two columns or rows, as in the example below.

A / B / C
27 / Substrate / observed / expected
28 / Sand (50%) / 8 / 15
29 / Gravel (30%) / 18 / 9
30 / Silt (20%) / 4 / 6

Second, you choose the function "chitest". The formula is "=chitest(observed range, expected range)". The result is the probability value for the 2 test statistic. If this is below 0.05, your expected and observed values significantly differ. Third, you choose the function "chiinv". The formula is "=chiinv(probability, df)". The probability is the value calculated by the first function, chitest. The df is the degrees of freedom of the test, which equals the number of categories minus 1 (in this case, 3-1 = 2). The formulas and results are shown below.

=CHITEST(B28:B30,C28:C30) / 0.00155439843514915
=CHIINV(A33,2) / 12.9330437630415

Tests for association between variables

We will briefly examine two tests for examining associations between variables, correlation and regression analysis. Correlation is used to measure the association between variables that are assumed to covary without a causal relationship. Regression is used when variation in one variable is hypothesized to cause variation in another variable.

Correlation


If two variables are hypothesized to vary with one another but one isn’t dependent on the other, the degree of association, or correlation between them can be measured.. The correlation coefficient is unitless and ranges from -1 to 1. To calculate the correlation coefficient, have your data in columns (or rows). Choose Tools…Data Analysis and then click on Correlation in the window provided.

Your output will look something like this:

tail length (cm) / weight (kg) / height (cm)
tail length (cm) / 1
weight (kg) / 0.862171126 / 1
height (cm) / 0.678133829 / 0.89102804 / 1

Once the correlation coefficient is calculated for two variables, you must determine if it is significantly different from zero (no correlation). To do this you must calculate a t statistic for r using Equation 1 below.

Equation 1.

where

The critical value of t is associated with n-2 d.f. and an alpha of 0.05. You can look up the critical value of t in the table in Brower et al. (1998) or you can use Excel to find it. The function "tinv" will return the critical t value for a particular alpha level and df. The function uses two arguments: the alpha level and the df, calculated as n-2. The function "tdist" will return the probability value for your calculated t value, specified alpha level and df.

Regression

Regression analysis is used when one variable (the independent variable) is thought to be causal upon another variable (the dependent variable). For example, in plants, stem diameter is assumed to be dependent on age and not vice versa, so diameter is the dependent variable and age is the independent variable. Regression analysis derives a line that best fits the data. This line is known as the regression equation or trendline in Excel. The simplest type of regression is a linear one, which follows the form:

y = a + bx

where y is the dependent variable, x is the independent variable, a is the Y-intercept, and b is the slope of the line, also called the regression coefficient.

Regression in Excel can be performed in two ways. The simplest way is to plot the data and apply a trendline, displaying the equation and R2 value. This is the only way to choose an equation other than a linear one. The R2 value shows how much of the variation in the dependent variable is due to variation in the independent variable. The information obtained by using a trendline is limited. A more complete analysis can be obtained by choosing Tools…Data analysis and then selecting Regression in the window provided. A window will appear with many options. The minimum input you need is the Y and X cell ranges and where you want your output to go. Choosing the line plot option is advisable, this is a graph with the data points and expected points (based on the regression equation).


The output is shown on the next page.