Math Talk – January 30, 2014
Rachel Saidi
Introduction to R
1. What is R?
R is a statistical software system for data analysis and graphics, developed by Ihaka and Gentleman of University of Auckland Department of Statistics (1995), and it is considered a dialect of S and S-Plus. R is freely distributed and has many built-in functions for statistical analysis and excellent graphics.
2. How is R-Studio different from R?
R-Studio is a slightly more user-friendly version of R with 4 window panels appearing
3. Where can I get R and R-Studio?
First download R:
http://cran.us.r-project.org/bin/windows
Then you can download R-studio:
http://www.rstudio.com/products/rstudio/download/
You need R before you can use R-Studio (below is what R-Studio looks like with 4 panels).
4. How can I get help when in R-Studio?
Type: help() and put the particular command you are interested in within the parenthesis;
Highlight that line and press “Control” “R” and information will come up in the bottom right screen
Example:
I would like more information about the command, mean
Type: Help(mean)
The resulting information that appears is shown below:
5. How do I import data?
It is easy to input small size data sets by typing, but for large data sets, it is always convenient to import them from external sources. Use the read.csv () command for Excel files:
a. Name your file for R-studio, a name like “mydata”
b. Use <- following your name for the file
c. Add the command: read.csv(……)
d. Within the parenthesis and in quotes type the name of the file you are importing. If you want to keep the first line as the headings for the columns, include “…. , head=TRUE).
mydata<- read.csv(“Pollution.csv, head=TRUE)
e. Use the command: attach(mydata) to be able to use the data set.
*** Note: capital letters are read differently than lower case letters in R
6. Now let’s try to type data directly into R to find a linear regression equation and correlation. For fun, I have found some silly data with “Spurious (false) Correlation” at www.tylervigen.com . Feel free to browse this site on your own, but remember…. CORRELATION DOES NOT IMPLY CAUSATION!!!!!
Example:
Per capita consumption of cheese (US)
correlates with
Number of people who died by becoming tangled in their bedsheets
Per capita consumption of cheese (US)
Pounds (USDA) / 29.8 / 30.1 / 30.5 / 30.6 / 31.3 / 31.7 / 32.6 / 33.1 / 32.7 / 32.8
Number of people who died by becoming tangled in their bedsheets
Deaths (US) (CDC) / 327 / 456 / 509 / 497 / 596 / 573 / 661 / 741 / 809 / 717
Create R-Code to make a linear regression and correlation for per capita consumption of cheese in pounds (US) correlated with numbers of people who died by becoming tangled in their bedsheets:
***** but remember…. CORRELATION DOES NOT IMPLY CAUSATION!!!!!
You can copy from here as we go through the code together, or you can copy the entire code provided at the end of this document in the appendix
# Clear all
rm (list = ls())
# Define variables
percapcheese = c (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8)
deathbysheets = c (327, 456, 509, 497, 596, 573, 661, 741, 809, 717)
# Create a histogram
hist (percapcheese, probability = TRUE, breaks=seq(29, 34, 0.5), col="lightblue")
# Create a normal curve on top of the histogram
curve (dnorm (x, mean=mean(percapcheese), sd=sd(percapcheese)), col = "red", add=TRUE)
# Create a scatterplot – start with a new window
window()
plot (percapcheese, deathbysheets, main = "Scatterplot of Death By Sheet Entanglement vs Cheese Consumption")
# Compute statistical properties
summary (percapcheese)
Summary(deathbysheets)
# Perform a simple linear regression (or linear model – lm)
linearfit = lm(deathbysheets ~ percapcheese)
summary (linearfit)
Notice the “multiple R-Squared” value is the correlation coefficient: 0.897. This is slightly different from the one presented on the website.
7. What should you do next?
Of course this is just a very brief introduction into R-code. Download R and R-Studio on your own computer. Try to explore more. I have posted sample data sets and sites to find more data as well as more sample coding on my site: ______
Also, R-code is very easy to google most commands and procedures.
Thank you for attending this presentation!
Appendix - Code
# Clear all
rm (list = ls())
# Define variables
percapcheese = c (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8)
deathbysheets = c (327, 456, 509, 497, 596, 573, 661, 741, 809, 717)
# Create a histogram
hist (percapcheese, probability = TRUE, breaks=
seq(29, 34, 0.5), col="lightblue")
# Create a normal curve on top of the histogram
curve (dnorm (x, mean=mean(percapcheese),
sd=sd(percapcheese)), col = "red", add=TRUE)
# Create a scatterplot - start with a new window
window()
plot (percapcheese, deathbysheets, main = "Scatterplot of Death
By Sheet Entanglement vs Cheese Consumption")
# Compute statistical properties
summary (percapcheese)
summary (deathbysheets)
# Perform a simple linear regression (or linear model – lm)
linearfit = lm(deathbysheets ~ percapcheese)
summary (linearfit)