Mission 3 Logistic regression and 2x2 tests. (28 October)

Learning goals

·  Be able to produce logistic regression graphs and bar graphs.

·  Be able to test data with a categorical (binary) response variable.

·  Be able to evaluate plots and tests with a categorical (binary) response variable.

·  Start to be aware of statistical power.

Data collection

·  Collect data for either

o  a binomial study

o  a 2x2 type study

o  a logistic regression type study

Do it 2 by 2.

·  Type the logistic regression type data into Excel. Do it in the right way! [ à Open and Save in R manual]. The 2x2 data and the binomial data are not necessary to type into Excel.

·  Exchange logistic regression excel files with your class mates. Exchange the binomial and 2x2 results on a piece of paper.

·  You will now have 3 data sets:

o  one from an binomial study

o  one from a 2x2 type study

o  one from a logistic regression type study

Mission

For each of the three data sets:

·  Create graphs in R.

·  Evaluate the graph. Do you believe in the study?

·  Test the study.

·  Evaluate the study and write a Mini-report.

Hand in

·  One Mini-report on the binomial study.

·  One Mini-report on the 2x2 type study.

·  One Mini-report on the logistic regression type study.

Mail all three Mini-reports to me, , and to your feed-back pal.

Mail your Excel data file to me.

Binomial tests: categorical response – no explanatory

This is a special case when you just want to test if one response outcome is more common than the other. Do ants prefer seeds with elaiosomes? Give two seeds to an ant and see which one it takes. Repeat twenty times. How often does it pick seeds with elaiosomes? Has the ant a real preference or is it picking seeds at random?

This manual is written as if you studied ants. Of course you have to transfer its meaning to your study.

Make a graph.

(this time you don't need to open an Excel data file).

Change the numbers and names.

barplot(c(14,20),names.arg=c("First","Second"))

This kind of graph is also easy to make in Powerpoint:

·  Start Powerpoint

·  Choose Insert àInsert Chart (=Infoga à Diagram)

·  Right click on the chart and choose chart type (=Diagramtyp)

·  Erase non-necessary cells

Type in your headers and values

How confident are you of the result?

Is there a risk that the pattern is due to chance only?

Randomization test

1.  How many ants picked large elaiosomes?
large.choosers <- 18
large.choosers

2.  If the ants didn't have any preferences, what differences could be found in an experiment by chance if we tested 30 ants? Paste the code several times and realize that you will get different results.
random.choice<-sample(c("Large","Small"),size=30,replace=T)
sum(random.choice=="Large")

3.  If we do this a thousand times, is there any risk that we will get more large choosers than 18?


random.large.choosers <- NULL
for (i in 1:1000){
random.choice<-sample(c("Large","Small"),size=30,replace=T)
random.large.choosers[i]<- sum(random.choice=="Large")
}

4.  You can check all random differences by typing random.large.choosers, or fix(random.large.choosers) but it is easier to inspect a histogram. Change 30 to your sample size.
hist(random.large.choosers,breaks=0:30)

5.  The 95 % of the values in random.large.choosers that are closest to 15, are not totally unlikely to get just by chance. If the number of large-choosers in your actual (real) data set is within these 95 % random values, then there is a risk that there is no real preference, and that the apparent preference in your study was just due to chance. On the other hand, if the number of large-choosers from your real data set is larger, then it is very unlikely that it is caused only by chance. Then the conclusion will be that there is indeed a preference. To see the range of the 95 % most likely values in random.large.choosers you check the value of the 25th lowest value and the 975th value (that is, 2.5 % from each end). First you have to sort the difference with sort(). And then use the address tag [] to get value 25 and 975.
sort(random.large.choosers)[c(25,975)]

6.  To get a nice graphical illustration of this, make a histogram and then add vertical lines to illustrate the 95 % interval (red) and the slope from your actual real data (blue). Change 30 to your sample size.

hist(random.large.choosers,breaks=0:30)

abline(v=sort(random.large.choosers)[c(25,975)],col="red",lwd=2)

abline(v=large.choosers,col="blue",lwd=4)

What are your conclusions? Can you trust the result from your study?

Make a binomial test.

Change the numbers. The second number should be the total sample size.

binom.test(18,30)

Is your study significant?

N.B. When you use a p-value in a report use two significant digits, like 0.0023. If it's below 0.0001 just write p< 0.0001.

If you want to compare your value to a mathematical binomial distribution instead of a randomly sampled distribution you can do like this. First change the values of your sample size (n) and the number of outcomes for one of the alternatives (k). The barplot function in unfortunately a little tricky to work with…

n<-30

k<-21

mp<-barplot(dbinom(0:n,n,0.5),names.arg=0:n,axis.lty=1)

red<-NULL

for (i in 1:(n+1)){red[i]<-sum(dbinom(0:(i-1),n,.5))}

barplot(dbinom(0: (sum(red<0.025)-1),n,0.5),add=T,col="red")

barplot(c(rep(0, n-(sum(red<0.025)-1)),dbinom((n-(sum(red<0.025)-
1)):n,n,0.5)),add=T,col="red")

abline(v=mp[k+1],col="blue",lwd=2)

The binomial test is very easy to do!

Let's say that the ants really prefer seeds with elaiosomes. And that they almost always pick those seeds. Sometimes, however, you might get stupid ants. If there are 1 stupid ant in your lab, how many ants do you have to test (= how large sample size do you need)? If there are 2 stupid ants? Or three? Or if one third of the ants are stupid? Check it out with binom.test().

2´2 tests: categorical response – categorical explanatory

First create a 2×2-table

(so this time you don't need to open an Excel data file).

This is an example from a strider experiment (Geometridae, walking on water insects). Do male and female respond differently to presence of predatory fish in there aquarium? With one insect of each sex in the aquarium, who will feed first on a dropped banana fly?

For your data, change the numbers in the matrix, and the names. R first fills column 1, then column 2.

Strider<-matrix(c(11,9,17,3),2,dimnames=list(Sex=c("Male", "Female"), Treatment=c("Fish","Predator free")))

Check that the matrix is correct:


Strider

Make a barplot:

barplot(Strider)

Or perhaps better with the bar side by side:

barplot(Strider,beside=T)

You can specify the colours and add a legend:

barplot(Strider,beside=T,col=c("red","black"),legend.text=T)

If the legend interferes with your bars, increase the graph height or width with ylim or xlim.

barplot(Strider,beside=T,col=c("red","black"),legend.text=T, ylim=c(0,max(Strider)*1.5))

Or just pimp the graph in Inkscape, where you easily can move legends and change bar colours and text.

Tip!

This kind of graph is also easy to make in Powerpoint:

·  Start Powerpoint

·  Choose Insert àInsert Chart (=Infoga à Diagram)

·  Right click on the chart and choose chart type (=Diagramtyp)

·  Erase non-necessary cells

Type in your headers and values

When you have a graph – evaluate it!

Do you think there is difference in response between the explanatory groups?

Or could this just be caused by chance?

Test your 2×2-table

Test your 2x2 matrix from above. Of course change Strider to what your matrix is called.

fisher.test(Strider)

Is your study significant?

What does it mean biologically?

You already have nice graph.

Instead of the fisher.test many scientists have used the Chi-square-test. It is not as reliable with small sample sizes, but has the same purpose. If you like to try, this is how you do it:

chisq.test(Strider)

If you want you can use the chisq.test function to run a simulation for you (like the permutations you did yesterday). 2000 is the number of randomisations.

chisq.test(Strider,simulate.p.value=TRUE, B=2000)

Logistic regression: categorical response – continuous explanatory

1.  Open and attach your data set [ à Open and Save manual ]. Get it right!

2.  First examine your data graphically. Change y (= response) and x (= explanatory). N.B. you must write x~y (not the usual y~x) to get this graph right.

stripchart(x~y, method="jitter", jitter=0.1, pch=19)

Do you think that the outcome of your y (= response) depends on your x(=explanatory)?

How confident are you?

Is there a risk that the pattern is due to chance only?

3.  Test logistic regression with a generalised linear model (glm). Change y (= response) and x (= explanatory). Now you shall write y~x as usual. The word binomial tells R that the response is of an "either…or…" type.

mod.glm<-glm(y~x,binomial)

anova(mod.glm,test="Chi")

Check the p-value of your x variable! Is it significant?

4.  Make a neat graph. This is one of the hardest things to do in any statistical programs. We want the points to be jittered (= scattered), because in logistic regression data it is otherwise very common that they hide each other. And we want a line that shows how the probability of the two different response outcomes changes along the x axis. In many programs this is impossible. In R it requires some code. Remember to change y and x, and other green fat things to your names.

par(mar=c(5,4,4,5))

plot(jitter(as.numeric(y)-1,factor=0.5)~x,pch=19,axes=F, xlab="Seed.size",ylab="",cex.lab=1.5)

box()

axis(1,cex.axis=1.5)

axis(2,1:0,c("Germinated","Not germinated"),cex.axis=1.5)

Code explanation, which you can skip!

The par(mar=… makes the graph window a little larger. To reset just close the graph window.

The next row transforms your categorical y variable to a numeric with values 0 and 1, and adds jitter.

The last three rows adds axis in a manual way. To get it neat…

5.  If your x variable (explanatory) was significant, you can add a probability line. (If it was not significant try anyway, but use a graph without line in your report). Change y and x, and other green fat things.

axis(4,las=1, col.axis="red")

mtext("Probability of germination",4,line=3,cex=1.5,col="red")

mod.glm<-glm(y~x,binomial)

intervalx<-seq(min(x),max(x),length.out=30)

lines(intervalx,predict(mod.glm,list(x=intervalx),type="resp"), col="red",lwd=2)

Code explanation, which you can skip!

The first row adds a red axis to the right.

The second row adds text there.

The third row specifies the generalised linear model for your data.

The forth row creates new values evenly spaced along the x axis.

The last lines then predicts the probabilities along the x axis and make a smooth line out of them.

6.  If the graph looks upside down it's because R uses alphabetic order of your response outcomes. To change the order use factor(), and start all over again...

y<-factor(y,levels=c("Survived","Died"))

7.  Paste the graph into your Mini report.

What biological conclusions can you make?

7