Lecture R4: R statistical models – Part I

R4_S1 Statistical models in R – Part I

  • R has several statistical functions and packages
  • Once you’ve decided on which statistical test to use, the application of the R functions are quick and easy
  • You will be introduced to many functions. In this session we’ll cover t-tests, wilcox tests, hypothesis testing and chi-square
  • In the next session, we’ll learn more about linear and multiple regression, analysis of variance and correlation coefficients
  • What we cover, however, will not be comprehensive – not by any means, and you should explore further on your own

R4_S2 Basic R - review

  • Just to get you started – a brief review of how to attach preloaded datasets, ls() objects in your workspace and the use of the hash (i.e ‘#’) to insert comments.
  • You should open an R session on your computer, and work through the commands shown on this slide

R4_S3 Basic R - review

  • And here we have a few simple commands to plot variables
  • Work through the commands and see what your output is

R4_S4 T-tests

  • Several tab-delimited files accompany these R lectures
  • Find the ‘gain.txt’ file
  • Download this file onto your computer, open and attach the file in R
  • Upon opening, remember to
  • - change the directory
  • - include filename extension

R4_S5 One sample t-tests

  • To open, read and attach the file, you would have used the commands shown here
  • To execute the one sample t-test is simple – just use t.test and indicate the filename in brackets

R4_S6 Two sample t-tests

  • When two samples are assumed to have equal variances, then the data can be pooled to find an estimate for the variance
  • By default, R assumes unequal variances
  • If the variances are assumed equal, then you need to specify it when using t.test

R4_S7 Two sample t-tests

  • For example, using the ‘gains’ data:
  • here we execute the Welch two-sample t-test, with unequal variances
  • and the two-sample t-test, with equal variances, where we’ve specified that the variances are equal

R4_S8 Two sample t-tests, equal variances

  • If we consider an example of the two-sample t-test:
  • Suppose the recovery time for patients taking a new drug is measured (in days), and the data are as shown
  • A side-by-side boxplot (not shown here) indicates that the assumptions of equal variances and normality are valid
  • A one-sided test for equivalence of means using the t-test is found
  • This tests the null hypothesis of equal variances against the one-sided alternative that the drug group has a smaller mean

R4_S9 Two sample t-tests, equal variances

  • So, to take you stepwise through commands which you would enter into R – and here, you should be working along, entering the data into your own R session….
  • Looking at the output of the two-sample t-test, we accept the null hypothesis

R4_S10 Paired t-tests

  • Then there’s the matched or paired t-tests - these use a different statistical model
  • The test assumes that the two samples share common traits
  • In R you only need to specify that you want a paired t-test

R4_S11 Paired t-tests

  • As an example of the paired t-test, consider the dilemma of having two graders:
  • In order to promote fairness in grading, each application was graded twice by different graders. Based on the grades, can we see if there is a difference between the two graders?
  • Clearly there are differences
  • Are they described by random fluctuations, or is there a bias of one grader over another?
  • A matched sample test will give us some insight

R4_S12 Paired t-tests

  • You will enter the data into R. But first you should check the assumption of normality, with normal plots say. You would find that the general shape is seen to be normal
  • Then you can apply the t-test as follows indicating that the data are paired
  • Looking at the output, we reject the null hypothesis
  • Notice that the data are not independent of each other as grader 1 and grader 2 each grade the same papers – we expect that if grader 1 finds a paper good, that grader 2 will also, and vice versa. This is exactly what non-independent means
  • A t-test without the ‘paired’ argument would lead to a different conclusion – try it and see

R4_S13 One sample Wilcox test

  • As an example of a one sample Wilcox test:
  • Suppose we have measurements for a specific response variable, taken from 9 random plants, at time1 (x) and time2 (y) after treatment was applied
  • Data are entered, and the one sample Wilcox test is run, specifying the alternative hypothesis
  • Usage is rather simple

R4_S14 Two sample Wilcox test

  • As an example of a two sample Wilcox test:
  • Suppose we have permeability constants of a placental membrane at term (x) and between 12 to 26 weeks gestational age (y).
  • Data are entered, and the two sample Wilcox test is run, specifying the alternative hypothesis
  • For more information on the wilcox test, use the help tools on R

R4_S15 Wilcox rank tests

  • The wilcox.test is a signed rank test
  • Many books first introduce the sign test, where ranks are not considered. This can be calculated using R as well.
  • A function to do so is simple.median.test, which computes the p-value for a two-sided test for a specified median
  • See how it works in the box…

R4_S16 Confidence interval for the median

  • Confidence intervals for the median are important too
  • The R function wilcox.test performs a non-parametric test for the median, but we need to specify that we want a confidence interval computed
  • As an example, suppose the following data is the earnings of American CEO's in dollars - the command shown creates a test for the median
  • We couldn't have used a t-test as the data isn't even close to normal

R4_S17 Testing a population parameter

  • Hypothesis testing is mathematically related to the problem of finding confidence intervals. However, the approach is different
  • For one, you use the data to tell you where the unknown parameters should lie,
  • for hypothesis testing, you make a hypothesis about the value of the unknown parameter and then calculate how likely it is that you observed the data or worse.
  • However, with R you will not notice much difference, as the same functions are used for both. The way you use them is slightly different though
  • Consider a simple survey: You ask 100 randomly chosen people, and 42 say “yes" to your question. Does this support the hypothesis that the true proportion is 50%?
  • To answer this, we set up a test of hypothesis. The null hypothesis, is that p = 0.5, the alternative hypothesis would be p ! = 0.5. This is a “two-sided" alternative
  • We use prop.test to test the assumptions

R4_S18 Testing a population parameter

  • In the example of the simple survey, the prop.test command:
  • Yields a p-value of 0.1336. The p-value reports how likely we are to see this data or worse assuming the null hypothesis. The notion of worse, is implied by the alternative hypothesis
  • In particular, the p-value is the probability of 42 or fewer, or 58 or more, answer “yes“, when the chance a person will answer “yes“, is 50-50
  • Now, the p-value is not so small as to make an observation of 42 seem unreasonable in 100 samples, assuming the null hypothesis. Thus, one would “accept" the null hypothesis

R4_S19 Testing a population parameter

  • Consider another simple survey:
  • But this time, suppose we ask 1000 people and 420 say “yes”. Does this still support the null hypothesis that p = 0.5?
  • The R command yields a very tiny p-value of 4.956e-07 and the null hypothesis is rejected
  • This illustrates that the p-value depends not just on the ratio, but also n
  • In particular, it is because the standard error of the sample average gets smaller as n gets larger.

R4_S20 Testing a mean

  • Suppose a car manufacturer claims a model gets 25 mpg. A consumer group asks 10 owners of this model to calculate their mpg and the mean value was 22 with a standard deviation of 1.5.
  • Is the manufacturer's claim supported?
  • In this case the null hypothesis that u = 25, against the one-sided alternative hypothesis that u < 25
  • To test using R we simply need to tell R about the type of test
  • For this example, the built-in R function t.test isn't going to work - the data is already summarized - so we need to calculate the test statistic and then find the p-value…

R4_S21 testing a mean

  • Compute the test statistic
  • Get the p value
  • This is a small p-value (0.000068)
  • The manufacturer's claim is suspicious

R4_S22 Chi-squared test

  • The chi-squared distribution allows for statistical tests of categorical data
  • A goodness of fit test checks to see if the data came from some specified population
  • The chi-squared goodness of fit test allows one to test if categorical data corresponds to a model where the data is chosen from the categories according to some specified set of probabilities.
  • In R, we need to specify the actual frequencies and the assumed probabilities, but the usage is very simple

R4_S23 Chi-squared test

  • As an example:
  • If we toss a die 150 times and find that we have the distribution of rolls as shown on the slide, is the die fair?
  • Of course, you suspect that if the die is fair, the probability of each face should be the same or 1/6. In 150 rolls then you would expect each face to have about 25 appearances. Yet the 6 appears 36 times. Is this coincidence or perhaps something else?
  • R has a built-in test for this type of problem. To use it we need to specify the actual frequencies, the assumed probabilities and the necessary language to get the result we want. In this case - goodness of fit - the usage is very simple
  • The formal hypothesis test assumes the null hypothesis is that each category has probability 1/6, against the alternative that at least one category doesn't have this specified probability
  • You will see that the value of chi-squared is 6.72; the degrees of freedom are 6-1=5; the calculated p-value of 0.2423
  • So we have no reason to reject the hypothesis that the die is fair.

R4_S24 Chi-squared test

  • And finally, to end off this lecture, work through the R commands shown on this slide, to further familiarize yourself with the chi-squared test
  • You will need to access the ‘Genedata.txt’ file which accompanies this lecture-series
  • Remember to change to the correct directory and attach the file
  • Work through these commands – make a table, get the chi-squared values
  • Try this and see what you get

1