19
Practical 2Linear Models :Single Linear Regression, 2 sample t-test & One-Way Anova
This is the first practical where we start doing stats in a more serious way. Before we can use more advanced techniques though, it will be worthwhile to revisit some basic techniques and pay attention to their underlying assumptions. All too often such techniques are used for significance testing without examining after the fact if they were appropriate in the first place. Let's reverse this trend and start using statistics in a more critical manner, and make the interpretation of our data rigorous and sound.
In this practical, we revisit three very basic techniques, considered special cases of linear models:
a) Single Linear Regression,
b) 2-sample t-test,
c) One-Way Anova.
Table of Content
2.1 Running your commands from a script file 2
2.2 Single Linear Regression 2
Plotting the data and the linear trend 3
Saving the plot as a graphics file 4
Obtaining the R2 value from the model object 5
Calculating the R2 value from first principles 6
The model parameters and their associated standard errors 9
The F-test 10
Confidence intervals about the regression 11
Checking the assumptions 14
Checking for outliers and their impact on your model 15
Cook's distance as a measure of influence 17
2.3 Two-sample t-test 18
2.4 One-Way Anova 19
2.5
2.6
2.1 Running your commands from a script file
Typing directly all your commands in the R console is but one way to interact with R. Another way, definitely more practical, is to use a script in the R editor window. You type your commands in this script exactly how you would type them in the console. You can then select part of a line, a whole line or even several lines from your script and "send" these to the R console for processing. If you have made a mistake, you can then correct the call in your script file (not retype the whole blooming thing) and rerun it. These scripts have also the merit of keeping a record of the calls that you used in your session (not all of them, just the ones that eventually worked as you wished), which you can then save (as a text file without formatting) for later.
To create a blank script file, select New script in the File menu. A blank page (called Untitled) will then appear in the R editor. May I suggest that you type all the commands from this practical (without the R prompts, '' or '+') in such a script, saving it regularly (with Ctrl-S) under the name "Prac1.R" in a directory of your choice. Saving it with the .R extension is essential for it to be recognised automatically by R when you want to open it at a later date (using the Open script option in the File menu). Note that these script files, being text files really, can also be opened in Notepad, independently from R.
When you are ready to "send" an instruction to R, just select (using the mouse or combining the Shift and arrow keys) what you need from your script (a name, a line or multiple lines) and press Ctrl-R. This will effectively send your selection to the R console and ask R to process it.
If you want a whole line to be processed, there is an even neater trick. Placing the cursor anywhere in the line (without selecting any thing) and pressing Ctrl-R will send automatically the whole line to the R console for processing and subsequently move the cursor to the next line. Using this trick, you can then effectively process all the lines of your script, one line at a time, by pressing repetitively the Ctrl-R combination.
2.2 Single Linear Regression.
To be able to redo all the analyses (and a few more) that we have seen together in Lecture 2, you need first to install the faraway package. Please do so (following instructions from Practical 1 section 6 Obtaining R and Installing Packages). When you are done, type the following commands in your script file:
> library(faraway) ## mind the spelling
Loads the package faraway (in position 2 of the R search path, see Section 1.10).
> data(seatpos)
Loads the seatpos data frame from the faraway package into your R session (see Section 1.9).
> help(seatpos)
Opens the Help page about the seatpos data, to figure out what it is about (see Section 1.9).
> attach(seatpos)
Attaches the seatpos data frame in position 2 of the R search path so that the names of its components can be recognized on their own by R (see Section 1.10).
> names(seatpos)
[1] "Age" "Weight" "HtShoes" "Ht" "Seated" "Arm"
[7] "Thigh" "Leg" "hipcenter"
Displays the names of all the components (here columns) belonging to the seatpos data frame.
> model <- lm(Weight ~ Ht)
Performs a linear model analysis with Weight as the response variable and Ht as the explanatory variable, and assign the result to the object called model (feel free to use any name you fancy).
> model
Call:
lm(formula = Weight ~ Ht)
Coefficients:
(Intercept) Ht
-292.992 2.653
Prints the model object. This may not necessarily show you the entire content of the object though. The designers of the lm function have decided in their infinite wisdom to keep some components of the object hidden (to keep the console uncluttered). If you want to see the list of the components of this object, use the function names :
> names(model)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
As you can see, the only two components effectively printed when calling the model object itself are coefficients and call.
The equation of your single linear regression, Weight = a + b. Ht, therefore has an intercept
a equal to -292.992 and a slope, b, equal to 2.653. This slope expresses the fact that the linear trend predicts that for every unit increase in height will correspond a 2.653 units increase in weight. It doesn't imply causation, only that the two may be somehow connected (directly or indirectly).
Plotting the data and the linear trend
The plot of the points which we analysed can be produced with plot(), either:
> plot(Ht, Weight)
When the two variables are separated by a comma, the convention is to supply the x-variable first and the y-variable second.
> plot(Weight ~ Ht)
When the two variables are separated by a tilde, ~, (i.e. as a formula) the convention is to supply the y-variable first and the x-variable second.
By default, both plots will use the names of the variables (here Weight and Ht) as labels for the two axes, and will display the data as "points" with an open circle symbol: .
You can improve the look of this plot by stipulating a few options. Try the following:
> plot(Weight ~ Ht, xlab="Height (cm)", ylab="Weight (pounds)",
+ main="38 Car Drivers", pch=16)
The xlab and ylab options stipulate the text to use as the labels for the x-axis and y-axis respectively. The main option stipulates the main title at the top of the figure and pch stipulates the graphic symbol to use for the data points (here 16 is a large closed circle, 20 would be a smaller closed circle, and there are many more to choose from (see page 44 in Paradis (2005) R for beginners for all available choices). If you felt so inclined you could even specify the color of these symbols with col="blue" for instance (see the R_ColorChart.pdf in the Useful Docs directory of the Adv. Research Skills module for a listing of all colours available by "name").
Displaying the trend line on top of this plot is easily done with abline(), either:
> abline( a=-292.992, b=2.653) ## or more simply abline(-292.992, 2.653)
Passing the arguments (a and b) by values.
> abline (model$coef[1], model$coef[2])
Extracting the arguments (a and b) from the coefficient component in the model object. Here note that I abbreviated the name of this coefficient component to just coef (which is fine as long as there is no possible confusion with any other component from the model object).
> abline(model, col="red", lwd=2)
By far the simplest way, just passing on the model object itself (if it exists of course). Here I stipulated the options col="red" to display a red line and lwd=2 to make the line width thicker (the default is 1). I could also have chosen a dashed line (lty=2) or dotted line (lty=3) instead of a continuous line (the default: lty=1).
Saving the plot as a graphics file
At this point, if you are happy with your plot, you may want to save a copy for inclusion in a report for instance. Different graphics format are available and which one is the best to use will depend on what you intend to do with the picture. To include it in a Powerpoint presentation or display it on a computer screen in a word document, the best choice I have found is png (for portable network graphics). To print in hard copy (from a word document for example), the best choice is eps (for encapsulated postscript), but it won't look as nice on screen. You may also save your graph as a pdf document (which generally are high definition and scalable), but unless you have Adobe Acrobat Writer (or a similar program), it won't be easy to manipulate these files directly and make them part of bigger documents (a report for instance).
To save the graph in any of these format, make sure the graphics window is active (by clicking once on its top bar for example), then select in the File menu, the option Save as and click on the file format you want (note that eps is listed as Postscript).
If you want to compare the png and eps format and how they look printed and on screen (for that you will need to open the Word document, located in the Practicals directory of the Adv. Research Skills module). They are side by side in Figure 2.1.
Figure 2.1 The plot of weight versus height for 38 car drivers, saved either as a png file (on the left) or as an eps file (on the right).
Obtaining the R2 value from the model object
The first thing to calculate of course to gauge the strength of this single linear regression is the R2 value, which measures the "proportion of variance explained".
In R, all we need to do is use the function summary() and pass our model object to it.
> summary(model)
Call:
lm(formula = Weight ~ Ht)
Residuals:
Min 1Q Median 3Q Max
-29.388 -12.213 -4.876 4.945 59.586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -292.9915 50.6402 -5.786 1.34e-06 ***
Ht 2.6533 0.2989 8.878 1.35e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.31 on 36 degrees of freedom
Multiple R-squared: 0.6865, Adjusted R-squared: 0.6777
F-statistic: 78.82 on 1 and 36 DF, p-value: 1.351e-10
On the penultimate line, we can see that the R2 value is equal to 0.6865, which means that 68.65 % of the variance in the response variable (here Weight) is "summarized" (or accounted for) by our linear model with Ht as unique explanatory variable. Don't worry about the fact that it is called Multiple R-squared, it applies whether one has a single or multiple predictors in our model (see Lectures 3 & 4).
Note that you could assign the result of the summary()function to a new object, let's say sum.model to be able to retrieve the R2 value on its own (as one of the components of the new object).
> sum.model <- summary(model)
> names(sum.model)
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
> sum.model$r.squared
[1] 0.6864548
Or, now that we know what the function summary() returns when provided with our model object, we can instead extract the relevant component without having to create an object:
> summary(model)$r.squared # or abbreviated as summary(model)$r.s
[1] 0.6864548
You could also use the standalone function cor() which calculates the correlation coefficient, r, and square it to retrieve R2 (only true for a single linear regression though):
> cor(Weight, Ht)^2
[1] 0.6864548
However, since one of the aims of this module is to make you examine more critically whether the assumptions of your tests are met or not by your data, I would not recommend using just cor(). Use it sparingly to examine very quickly whether two or more variables may be linearly related.
Calculating the R2 value from first principles
We can also calculate the R2 value from first principles using the Residual Sum of Squares (RSS) and the Total Sum of Squares (TSS).
Recall that R2 = 1 – RSS/TSS (Lecture 2, slide 6).
Where RSS is the sum of the squared differences between the observed y-values and the fitted y-values (those predicted by the model for our observations). And TSS is the sum of the squared differences between the observed y-values and the average of these y-values (i.e. under the null hypothesis that there is no linear trend).
R makes these sorts of calculation very easy to do, with a feature called vectorized arithmetic. Simply stated it means that we can perform arithmetic operations on vectors as if they were a symbolic collection of single values. R will just perform the arithmetic operations on these vectors element-wise so that the first value of one vector is paired with the first value of the other vectors and so on for each subsequent vector value. A vector will then be produced with the results of every element-wise operation. A very simple example of this was shown in the previous practical when we calculated the confidence interval around the mean (Practical 1, page 9).
In our case, then, RSS can be calculated as follows (breaking it down in a few steps to show you the internal logic of it):
> fit <- fitted(model); fit
1 2 3 4 5 6 7 8
197.5949 146.1218 110.8335 204.2280 168.9397 176.6342 143.7338 191.7577