Week 17: linear regression and log transforms, Tutorial Solutions
Ordinary Linear Regression
Question 1
1(i) and (ii) Graphical answers. Check points for you to confirm the graph was correct.
When time = 4.70 myoglobin level =1310
When time = 6.00 myoglobin level = 559
These are calculated from the line. You need not have calculated these points but your plotted line should go through them. If it does not, either you are just waving a ruler and hoping for the best or your calculations were wrong.
(iii) Gradient of the fitted line is –578. (do not lose the minus sign)
This means that for every hour taken the myoglobin level fell by 578ng/ml.
(iv) Look at the highlighted line from the output copied below
MTB > Regress 'myo' 1 'time'.
The regression equation is
myo = 4027 - 578 time
Predictor Coef Stdev t-ratio p
Constant 4027 1082 3.72 0.003
time -577.9 195.4 -2.96 0.011
s = 321.1 R-sq = 40.2% R-sq(adj) = 35.6%
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 902226 902226 8.75 0.011
Error 13 1340398 103108
Total 14 2242624
Null Hypothesis H0: gradient = = 0
Alternative Hypothesis H1: 0
The p-value for the gradient (highlighted line) is 0.011.
Since this is less than 0.05 we can reject the null hypothesis in favour of the alternative and conclude that there is some (fairly weak) evidence that the true gradient is non-zero. Hence the line has some meaning and could be used for prediction.
(The p-value is not less than 0.01 so we cannot reject the null hypothesis at 1% significance. That is why I said it was only fairly weak evidence of a non –zero gradient).
(v) 40.2% of the variation in myoglobin level can be ascribed to the length of time the athletes were competing. (So 59.8% is due to something else, natural variation between athletes, fitness etc?)
(vi) A 95% confidence interval for the gradient is given by
Sample gradient t 0.025,n-2 st.error of gradient
i.e.-577.9 t 0.025,13 195.4
The numbers come from the highlighted line above. Remember that the column labelled stdev is the standard error – no calculation needed.
So the 95% C.I for the gradient is -577.9 2.160 195.4 = -577.9 422.1
(vii) A 95% confidence interval for the intercept is 4027 2.160 1082 = 4027 2337 by the same method as in (vi). The intercept gives the average myoglobin level for an athlete before the start of the triathlon. It is very far to the left of the earliest finishing time (understandably) and represents an extrapolation. Therefore, even though it makes sense that it is significantly different from 0, it is not reliable, given the distance from the data and an R2 < 75%.
Question 2
Y1 and X1
Null Hypothesis H0: gradient = = 0
Alternative Hypothesis H1: 0
y1 = 4.39 + 0.857 x1
Predictor Coef StDev T P
Constant 4.393 1.872 2.35 0.057
x1 0.8571 0.3708 2.31 0.060
S = 2.403 R-Sq = 47.1% R-Sq(adj) = 38.3%
Analysis of Variance
Source DF SS MS F P
Regression 1 30.857 30.857 5.34 0.060
Residual Error 6 34.643 5.774
Total 7 65.500
(i) The p-value for the gradient is 0.060 (highlighted)
So we cannot reject the null hypothesis. There is no evidence that the gradient of a straight line through these points differs significantly from zero. So we are not justified in quoting the equation or using it.
(ii) Does this say there is no relation between y1 and x1? Look back at the graph!
No. Look at the graph. It shows a clear curve and would need a curve fitted not a straight line.
Y2 and X2
Null Hypothesis H0: gradient = = 0
Alternative Hypothesis H1: 0
The regression equation is
y2 = 3.64 + 0.190 x2
Predictor Coef StDev T P
Constant 3.643 2.554 1.43 0.204
x2 0.1905 0.5058 0.38 0.719
S = 3.278 R-Sq = 2.3% R-Sq(adj) = 0.0%
Analysis of Variance
Source DF SS MS F P
Regression 1 1.52 1.52 0.14 0.719
Residual Error 6 64.48 10.75
Total 7 66.00
(iii) The p-value is 0.719. Since this is greater than 0.05 we cannot reject the null hypothesis at 5% significance so there is no evidence that the gradient of the line is non-zero. The line is of no use for describing the relationship.
(iv) Does this say there is no relation between y2 and x2? Look back at the graph!
Looking at the graph suggests the fitting a straight line or anything else is a waste of time.
Linear regression after logging one or both axes
Question 3
(i)The plot should show a curve with a steep gradient to start with then falling more slowly.
(ii)You should have something that looks more like a straight line.
(iii)The equation is Ln (Viscosity) = 3.89 – 0.0866 temp
(iv)If viscosity = Aekx
and we take natural logs we get
Ln(viscosity) = Ln(Aekx) = Ln A + Ln(ekx ) = Ln A + kx
If you have trouble with this note that Ln (ea ) = a for all a.
Try Ln (e2 ) on your calculator. You should get the answer 2.
Try Ln (e5 ) on your calculator. You should get the answer 5. And so on.
If we match Ln(viscosity) = Ln A + kx with ‘ y = c + mx ’
We can see that the intercept gives Ln A and that the gradient k is taken directly from the equation.
So from (iii) above k = -0.0866 ( Watch the minus sign!)
And Ln A = 3.89 so A = e3.89 = 48.9
So the equation for the original data is: viscosity = 48.9e-0.0866x
(v)95% confidence interval for the gradient of the straight line is
-0.086558 t 0.025,6 0.007658
-0.086558 2.447 0.007658
-0.086558 0.018739
We are 95% confident that the gradient lies between –0.105 and –0.068
So the true value of k is encompassed by these limits.
95% confidence interval for the intercept of the straight line is
3.8859 t 0.025,6 0.1929 = 3.8859 0.4720
So the intercept of the line based on the transformed data isbetween 3.4139 and 4.358.
This says Ln A is encompassed by these limits so A lies between e3.4139 = 30.4 and e4.358 = 78.1 with 95% confidence. (Using calculator to get back out of natural logs).
1