Week 17: linear regression and log transforms, Tutorial Solutions

Ordinary Linear Regression

Question 1

1(i) and (ii) Graphical answers. Check points for you to confirm the graph was correct.

When time = 4.70 myoglobin level =1310

When time = 6.00 myoglobin level = 559

These are calculated from the line. You need not have calculated these points but your plotted line should go through them. If it does not, either you are just waving a ruler and hoping for the best or your calculations were wrong.

(iii) Gradient of the fitted line is –578. (do not lose the minus sign)

This means that for every hour taken the myoglobin level fell by 578ng/ml.

(iv) Look at the highlighted line from the output copied below

MTB > Regress 'myo' 1 'time'.

The regression equation is

myo = 4027 - 578 time

Predictor Coef Stdev t-ratio p

Constant 4027 1082 3.72 0.003

time -577.9 195.4 -2.96 0.011

s = 321.1 R-sq = 40.2% R-sq(adj) = 35.6%

Analysis of Variance

SOURCE DF SS MS F p

Regression 1 902226 902226 8.75 0.011

Error 13 1340398 103108

Total 14 2242624

Null Hypothesis H0: gradient =  = 0

Alternative Hypothesis H1:  0

The p-value for the gradient (highlighted line) is 0.011.

Since this is less than 0.05 we can reject the null hypothesis in favour of the alternative and conclude that there is some (fairly weak) evidence that the true gradient is non-zero. Hence the line has some meaning and could be used for prediction.

(The p-value is not less than 0.01 so we cannot reject the null hypothesis at 1% significance. That is why I said it was only fairly weak evidence of a non –zero gradient).

(v) 40.2% of the variation in myoglobin level can be ascribed to the length of time the athletes were competing. (So 59.8% is due to something else, natural variation between athletes, fitness etc?)

(vi) A 95% confidence interval for the gradient is given by

Sample gradient  t 0.025,n-2 st.error of gradient

i.e.-577.9 t 0.025,13 195.4

The numbers come from the highlighted line above. Remember that the column labelled stdev is the standard error – no calculation needed.

So the 95% C.I for the gradient is -577.9 2.160  195.4 = -577.9 422.1

(vii) A 95% confidence interval for the intercept is 4027  2.160  1082 = 4027 2337 by the same method as in (vi). The intercept gives the average myoglobin level for an athlete before the start of the triathlon. It is very far to the left of the earliest finishing time (understandably) and represents an extrapolation. Therefore, even though it makes sense that it is significantly different from 0, it is not reliable, given the distance from the data and an R2 < 75%.

Question 2

Y1 and X1

Null Hypothesis H0: gradient =  = 0

Alternative Hypothesis H1:  0

y1 = 4.39 + 0.857 x1

Predictor Coef StDev T P

Constant 4.393 1.872 2.35 0.057

x1 0.8571 0.3708 2.31 0.060

S = 2.403 R-Sq = 47.1% R-Sq(adj) = 38.3%

Analysis of Variance

Source DF SS MS F P

Regression 1 30.857 30.857 5.34 0.060

Residual Error 6 34.643 5.774

Total 7 65.500

(i) The p-value for the gradient is 0.060 (highlighted)

So we cannot reject the null hypothesis. There is no evidence that the gradient of a straight line through these points differs significantly from zero. So we are not justified in quoting the equation or using it.

(ii) Does this say there is no relation between y1 and x1? Look back at the graph!

No. Look at the graph. It shows a clear curve and would need a curve fitted not a straight line.

Y2 and X2

Null Hypothesis H0: gradient =  = 0

Alternative Hypothesis H1:  0

The regression equation is

y2 = 3.64 + 0.190 x2

Predictor Coef StDev T P

Constant 3.643 2.554 1.43 0.204

x2 0.1905 0.5058 0.38 0.719

S = 3.278 R-Sq = 2.3% R-Sq(adj) = 0.0%

Analysis of Variance

Source DF SS MS F P

Regression 1 1.52 1.52 0.14 0.719

Residual Error 6 64.48 10.75

Total 7 66.00

(iii) The p-value is 0.719. Since this is greater than 0.05 we cannot reject the null hypothesis at 5% significance so there is no evidence that the gradient of the line is non-zero. The line is of no use for describing the relationship.

(iv) Does this say there is no relation between y2 and x2? Look back at the graph!

Looking at the graph suggests the fitting a straight line or anything else is a waste of time.

Linear regression after logging one or both axes

Question 3

(i)The plot should show a curve with a steep gradient to start with then falling more slowly.

(ii)You should have something that looks more like a straight line.

(iii)The equation is Ln (Viscosity) = 3.89 – 0.0866  temp

(iv)If viscosity = Aekx

and we take natural logs we get

Ln(viscosity) = Ln(Aekx) = Ln A + Ln(ekx ) = Ln A + kx

If you have trouble with this note that Ln (ea ) = a for all a.

Try Ln (e2 ) on your calculator. You should get the answer 2.

Try Ln (e5 ) on your calculator. You should get the answer 5. And so on.

If we match Ln(viscosity) = Ln A + kx with ‘ y = c + mx ’

We can see that the intercept gives Ln A and that the gradient k is taken directly from the equation.

So from (iii) above k = -0.0866 ( Watch the minus sign!)

And Ln A = 3.89 so A = e3.89 = 48.9

So the equation for the original data is: viscosity = 48.9e-0.0866x

(v)95% confidence interval for the gradient of the straight line is

-0.086558  t 0.025,6 0.007658

-0.086558  2.447  0.007658

-0.086558  0.018739

We are 95% confident that the gradient lies between –0.105 and –0.068

So the true value of k is encompassed by these limits.

95% confidence interval for the intercept of the straight line is

3.8859  t 0.025,6 0.1929 = 3.8859  0.4720

So the intercept of the line based on the transformed data isbetween 3.4139 and 4.358.

This says Ln A is encompassed by these limits so A lies between e3.4139 = 30.4 and e4.358 = 78.1 with 95% confidence. (Using calculator to get back out of natural logs).

1