5/12/99 252y9943 ECO252 QBA2Name
FINAL EXAMHour of Class Registered (Circle) May 5, 1999 MWF 10 11 TR 12:30 2:00
Note: If this is the only thing you look at before taking the final, you are badly cheating yourself. Problems like 5e, 5f and 6 appeared on the 1998 final. People who used this final and did not read the problems carefully got very wrong answers to them.
Note: If you still think that a large p-value means that a coefficient is significant, you need a conference with an audiologist. Further note that a p-value is a probability and can only be compared with another probability (like the significance level).
Note: Have you reread “Things that You Should Never Do On a Statistics Exam …?” I think I could have graded this exam by just looking for violations of these rules.
I. (16 points) Do all the following.
- Hand in your fourth regression problem (2 points) and answer the following questions.
- For the regression of the number of hours of work against the number of machines, what coefficients are significant at the 1% level? Why? What about the 5% level? (2)
- Would you say that the regression of number of hours of work against the number of machines and months of experience is more successful than the regression against machines alone? Why? (3)
- What was the surprise that occurred when you did the stepwise regression? (2)
Solution: The rule on p-value:
If the p-value is less than the significance level reject the null
hypothesis; if the p-value is greater or equal than the significance
level, do not reject the null hypothesis.
a) part of the printout follows: (For the entire printout see 252x9943.)
MTB > print c1-c4
Data Display
Row Hours Number Exper Inter
1 1.0 1 12 12
2 3.1 3 8 24
3 17.0 10 5 50
4 14.0 8 2 16
5 6.0 5 10 50
6 1.8 1 1 1
7 11.5 10 10 100
8 9.3 5 2 10
9 6.0 4 6 24
10 12.2 10 18 180
MTB > brief 3
5/12/99 252y9943
MTB > regress c1 on 1 c2 'resid''pred'
Regression Analysis
The regression equation is
Hours = 0.10 + 1.42 Number
Predictor Coef Stdev t-ratio p
Constant 0.101 1.267 0.08 0.939
Number 1.4192 0.1908 7.44 0.000
s = 2.056 R-sq = 87.4% R-sq(adj) = 85.8%
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 233.84 233.84 55.30 0.000
Error 8 33.83 4.23
Total 9 267.67
Obs. Number Hours Fit Stdev.Fit Residual St.Resid
1 1.0 1.000 1.520 1.108 -0.520 -0.30
2 3.0 3.100 4.358 0.830 -1.258 -0.67
3 10.0 17.000 14.293 1.047 2.707 1.53
4 8.0 14.000 11.454 0.785 2.546 1.34
5 5.0 6.000 7.197 0.664 -1.197 -0.61
6 1.0 1.800 1.520 1.108 0.280 0.16
7 10.0 11.500 14.293 1.047 -2.793 -1.58
8 5.0 9.300 7.197 0.664 2.103 1.08
9 4.0 6.000 5.777 0.727 0.223 0.12
10 10.0 12.200 14.293 1.047 -2.093 -1.18
Since the p-value column has .936 as the p-value for 0.101, the coefficient ‘constant,’ and .939 is above the significance levels of .01 and .05, we can say that the constant is not significant at these levels. If you look up the values of t for 8 degrees of freedom and significance levels of .005 and .025, you will find that the t-ratio of 0.08 is less than the table value.
Likewise, since the p-value column has zero as the p-value for 0.1.4192, the coefficient of ‘Number,’ and zero is below the significance levels of .01 and .05, we can say that the constant is significant at these levels. If you look up the values of t for 8 degrees of freedom and significance levels of .005 and .025, you will find that the t-ratio of 0.08 is more than the table value.
b) part of the printout follows:
MTB > regress c1 on 2 c2 c3 'resid''pred'
Regression Analysis
The regression equation is
Hours = 1.62 + 1.53 Number - 0.293 Exper
Predictor Coef Stdev t-ratio p
Constant 1.6191 0.9746 1.66 0.141
Number 1.5333 0.1335 11.49 0.000
Exper -0.29311 0.09019 -3.25 0.014
s = 1.388 R-sq = 95.0% R-sq(adj) = 93.5%
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 254.19 127.09 65.99 0.000
Error 7 13.48 1.93
Total 9 267.67
5/12/99 252y9943
SOURCE DF SEQ SS
Number 1 233.84
Exper 1 20.34
Obs. Number Hours Fit Stdev.Fit Residual St.Resid
1 1.0 1.000 -0.365 0.946 1.365 1.34
2 3.0 3.100 3.874 0.579 -0.774 -0.61
3 10.0 17.000 15.487 0.796 1.513 1.33
4 8.0 14.000 13.299 0.776 0.701 0.61
5 5.0 6.000 6.355 0.518 -0.355 -0.28
6 1.0 1.800 2.859 0.854 -1.059 -0.97
7 10.0 11.500 14.021 0.712 -2.521 -2.12R
8 5.0 9.300 8.699 0.644 0.601 0.49
9 4.0 6.000 5.994 0.495 0.006 0.00
10 10.0 12.200 11.676 1.071 0.524 0.59
R denotes an obs. with a large st. resid.
This is one of the few really good ones. R-squared and R-squared adjusted went up, and , equally important, the coefficient of ‘Number’ remained significant while the coefficient of ‘Exper’ was significant at the 5% level.
c) part of the printout follows:
Stepwise Regression
F-to-Enter: 4.00 F-to-Remove: 4.00
Response is Hours on 3 predictors, with N = 10
Step 1 2
Constant 0.1005 -0.4758
Number 1.42 1.85
T-Ratio 7.44 10.81
Inter -0.040
T-Ratio -3.56
S 2.06 1.31
R-Sq 87.36 95.51
More? (Yes, No, Subcommand, or Help)
SUBC> y
No variables entered or removed
More? (Yes, No, Subcommand, or Help)
SUBC> n
MTB > print c1-c6
This was a surprise! Though the coefficients in column 1 are the same as those in the regression of ‘Hours’ against ‘Number’ above, the variable it brought in in column 2 is the interaction variable. I had assumed that it would only help the other two variables to explain ‘Number,’ but the computer refused to bring in ‘Exper.’
5/12/99 252y9943
2. The following pages show the regression of the variable 'mins', the winning time in minutes in a
triathlon, against some of the following independent variables:
'female'A dummy variable that is 1 if the contestant is female.
'swim'Number of miles of swimming
'bike'Number of miles of biking
'run'Number of miles of running
c6‘swim’ multiplied by ‘female’
c7‘bike’ multiplied by ‘female’
c8‘run’ multiplied by ‘female’
c9‘swim’ squared
c10‘bike’ squared
c11‘run’ squared
- In the regression of ‘mins’ against ‘female’, ‘swim’, ‘bike’ and ‘run’, which coefficients have
signs that look wrong? Why? Which coefficients are not significant at the 99% confidence level?
(3)
- Look at the regression of ‘mins’ against ‘run‘, c8 and c11 and the regression of ‘mins’ against ‘run’, and c8. Use . Does either seem to be an improvement over the regression of ‘mins’ against ‘run’ alone? Why?(2)
- Explain the meaning of the F test in the regression of ‘mins’ against ‘female’, ‘swim’, ‘bike’ and
‘run’ . What is being tested and what are the conclusions? (2)
d.The printout concludes with a printout of the data and of a correlation matrix. What does this
suggest about the problems that are occurring with these regressions? (2)
Solution: The printout enclosed with the exam follows:
Worksheet size: 100000 cells
MTB > RETR 'C:\MINITAB\LR13-49.MTW'.
Retrieving worksheet from file: C:\MINITAB\LR13-49.MTW
Worksheet was saved on 5/ 3/1999
MTB > regress c1 on 4 c2 c3 c4 c5
Regression Analysis
The regression equation is
mins = - 24.6 + 35.5 female - 25.0 swim + 7.13 bike - 6.37 run
Predictor Coef Stdev t-ratio p
Constant -24.57 20.13 -1.22 0.241
female 35.47 14.77 2.40 0.030
swim -25.01 45.75 -0.55 0.593
bike 7.130 1.331 5.36 0.000
run -6.372 5.384 -1.18 0.255
s = 33.02 R-sq = 98.0% R-sq(adj) = 97.4%
a) There is no reason to expect the constant to be positive or negative. However in this equation, a negative coefficient for ‘swim’, ‘bike’ or ‘run’ would lead us to believe that an extra mile of swimming, biking or running would lead to a faster time. The positive coefficient for ‘female’ is expected, because, at least at present, women’s times in speed events have been longer than men’s. (However, at the rate that women’s times in athletic events are falling this may not always be true!)
The constant and the coefficients of ‘swim’ and ‘run’ have p-values above .01. If the confidence level is 99%, the significance level is .01, so these coefficients are not significant.
5/12/99 252y9943
Analysis of Variance
SOURCE DF SS MS F p
Regression 4 786104 196526 180.29 0.000
Error 15 16351 1090
Total 19 802455
c) The F test here is a test of the null hypothesis that the independent variables do not explain the dependent variable. The low p-value means that we reject the null hypothesis.
SOURCE DF SEQ SS
female 1 6291
swim 1 726098
bike 1 52189
run 1 1526
Unusual Observations
Obs. female mins Fit Stdev.Fit Residual St.Resid
1 0.00 489.25 547.00 17.48 -57.75 -2.06R
18 1.00 660.48 582.47 17.48 78.01 2.79R
R denotes an obs. with a large st. resid.
MTB > regress c1 on 1 c5
Regression Analysis
The regression equation is
mins = - 19.2 + 23.6 run
Predictor Coef Stdev t-ratio p
Constant -19.25 23.19 -0.83 0.417
run 23.615 1.582 14.92 0.000
s = 57.74 R-sq = 92.5% R-sq(adj) = 92.1%
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 742445 742445 222.69 0.000
Error 18 60011 3334
Total 19 802455
Unusual Observations
Obs. run mins Fit Stdev.Fit Residual St.Resid
1 26.2 489.2 599.5 25.7 -110.2 -2.13R
12 18.6 589.1 420.0 16.4 169.1 3.05R
R denotes an obs. with a large st. resid.
MTB > regress c1 on 2 c5 c8
Regression Analysis
The regression equation is
mins = - 19.2 + 22.1 run + 3.02 C8
Predictor Coef Stdev t-ratio p
Constant -19.25 21.83 -0.88 0.390
run 22.106 1.705 12.96 0.000
C8 3.017 1.659 1.82 0.087
s = 54.36 R-sq = 93.7% R-sq(adj) = 93.0%
5/12/99 252y9943
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 752216 376108 127.27 0.000
Error 17 50240 2955
Total 19 802455
SOURCE DF SEQ SS
run 1 742445
C8 1 9771
Unusual Observations
Obs. run mins Fit Stdev.Fit Residual St.Resid
2 18.6 505.1 391.9 21.9 113.2 2.27R
11 26.2 540.9 639.0 32.5 -98.1 -2.25R
12 18.6 589.1 448.0 21.9 141.0 2.83R
R denotes an obs. with a large st. resid.
MTB > regress c1 on 2 c5 c11
Regression Analysis
The regression equation is
mins = - 102 + 39.6 run - 0.519 C11
Predictor Coef Stdev t-ratio p
Constant -101.71 49.18 -2.07 0.054
run 39.550 8.654 4.57 0.000
C11 -0.5192 0.2778 -1.87 0.079
s = 54.11 R-sq = 93.8% R-sq(adj) = 93.1%
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 752675 376337 128.52 0.000
Error 17 49780 2928
Total 19 802455
SOURCE DF SEQ SS
run 1 742445
C11 1 10230
Unusual Observations
Obs. run mins Fit Stdev.Fit Residual St.Resid
12 18.6 589.1 454.3 24.0 134.8 2.78R
R denotes an obs. with a large st. resid.
MTB > regress c1 on 3 c5 c8 c11
Regression Analysis
The regression equation is
mins = - 102 + 38.0 run + 3.02 C8 - 0.519 C11
Predictor Coef Stdev t-ratio p
Constant -101.71 45.45 -2.24 0.040
run 38.042 8.033 4.74 0.000
C8 3.017 1.526 1.98 0.066
C11 -0.5192 0.2567 -2.02 0.060
s = 50.01 R-sq = 95.0% R-sq(adj) = 94.1%
b) Some people have all the luck! Unlike most choices I gave these are really good. Not only do R-squared and R-squared adjusted go up, but the p-values are all below 10% and the signs of the coefficients are reasonable. So yes, these are probably improvements.
5/12/99 252y9943
Analysis of Variance
SOURCE DF SS MS F p
Regression 3 762446 254149 101.64 0.000
Error 16 40009 2501
Total 19 802455
SOURCE DF SEQ SS
run 1 742445
C8 1 9771
C11 1 10230
Unusual Observations
Obs. run mins Fit Stdev.Fit Residual St.Resid
12 18.6 589.1 482.3 26.3 106.7 2.51R
R denotes an obs. with a large st. resid.
MTB > print c1-c11
Data Display
Row mins female swim bike run C6 C7 C8 C9
1 489.250 0 2.40 112.0 26.2 0.00 0.0 0.0 5.7600
2 505.150 0 2.00 100.0 18.6 0.00 0.0 0.0 4.0000
3 245.500 0 1.20 55.3 13.1 0.00 0.0 0.0 1.4400
4 204.400 0 1.50 48.0 10.0 0.00 0.0 0.0 2.2500
5 114.533 0 0.93 24.8 6.2 0.00 0.0 0.0 0.8649
6 108.267 0 0.93 24.8 6.2 0.00 0.0 0.0 0.8649
7 79.417 0 0.50 18.0 5.0 0.00 0.0 0.0 0.2500
8 566.500 0 2.40 112.0 26.2 0.00 0.0 0.0 5.7600
9 74.983 0 0.50 20.0 4.0 0.00 0.0 0.0 0.2500
10 116.117 0 0.60 25.0 6.2 0.00 0.0 0.0 0.3600
11 540.933 1 2.40 112.0 26.2 2.40 112.0 26.2 5.7600
12 589.067 1 2.00 100.0 18.6 2.00 100.0 18.6 4.0000
13 280.100 1 1.20 55.3 13.1 1.20 55.3 13.1 1.4400
14 235.033 1 1.50 48.0 10.0 1.50 48.0 10.0 2.2500
15 127.167 1 0.93 24.8 6.2 0.93 24.8 6.2 0.8649
16 120.750 1 0.93 24.8 6.2 0.93 24.8 6.2 0.8649
17 90.317 1 0.50 18.0 5.0 0.50 18.0 5.0 0.2500
18 660.483 1 2.40 112.0 26.2 2.40 112.0 26.2 5.7600
19 83.150 1 0.50 20.0 4.0 0.50 20.0 4.0 0.2500
20 131.817 1 0.60 25.0 6.2 0.60 25.0 6.2 0.3600
Row C10 C11
1 12544.0 686.44
2 10000.0 345.96
3 3058.1 171.61
4 2304.0 100.00
5 615.0 38.44
6 615.0 38.44
7 324.0 25.00
8 12544.0 686.44
9 400.0 16.00
10 625.0 38.44
11 12544.0 686.44
12 10000.0 345.96
13 3058.1 171.61
14 2304.0 100.00
15 615.0 38.44
16 615.0 38.44
17 324.0 25.00
18 12544.0 686.44
19 400.0 16.00
20 625.0 38.44
5/12/99 252y9943
MTB > Correlation c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11.
Correlations (Pearson)
mins female swim bike run C6 C7 C8
female 0.089
swim 0.951 0.000
bike 0.984 0.000 0.973
run 0.962 0.000 0.965 0.985
C6 0.510 0.792 0.432 0.420 0.417
C7 0.584 0.716 0.480 0.494 0.487 0.982
C8 0.564 0.726 0.470 0.479 0.487 0.980 0.993
C9 0.956 0.000 0.985 0.979 0.982 0.426 0.483 0.478
C10 0.975 0.000 0.954 0.989 0.983 0.412 0.488 0.478
C11 0.928 0.000 0.932 0.954 0.985 0.403 0.471 0.479
C9 C10
C10 0.982
C11 0.974 0.975
d) There are many correlations above .9 here. These indicate that collinearity is a problem and that many regressions will give coefficients that are insignificant or have unreasonable signs.
5/12/99 252y9943
II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points - Anything extra you do helps, and grades wrap around) . Show your work! State and where applicable. Use a significance level of 5% unless noted otherwise.
Note: These problems involve comparing population means, variances, proportions or medians. To do this you use sample means, variances or proportions. If you look at a problem and tell me that the sample means, variances, proportions or medians differ without incorporating them in a test, you are wasting both your time and mine, and it is possible that my annoyance will affect how I grade the rest of your exam, since I will now suspect that you have no idea what a significant difference is or what we mean by a statistical test!
1. a. Premiums on a group of 11closed end mutual funds were as follows. (These are in per cent, but that shouldn’t affect your analysis.)Test the hypothesis that the mean is 3 per cent using (i) Either a test ratio or a critical value and (ii) A confidence interval. (6)
+4.7 -0.7 +5.3 +9.2 -0.3 -0.3 +5.0 +0.4 -1.9 +0.5 -3.1
b. Test that the following data (i) has a Poisson distribution (6) and (ii)has a Poisson distribution with a mean of 4.5 (6). If you do both parts do only one with a chi-square method.
Solution: a)From Table 3 of the Syllabus Supplement:
Interval for / Confidence Interval / Hypotheses / Test Ratio / Critical ValueMean (
unknown) /
/ / /
1
and .
(i) Test Ratio: . This is in the ‘accept’ region between , so do not reject .
Critical Value: Since this is a 2-sided test, or 0.469 to 2.531. This means that we reject if the sample mean is above 2.531 or below 0.469. Since is between these critical values, do not reject .
+4.7 22.09
-0.7 0.49
+5.3 28.09
+9.2 84.64
-0.3 0.09
-0.3 0.09
+5.0 25.00
+0.4 0.16
-1.9 3.61
+0.5 0.25
-3.1 9.61
18.8174.12
Since and, we find and
1
5/12/99 252y9943
(ii) Confidence Interval: Since this is a two-sided test, or . This does not contradict , because 3 is between –0.822 and 4.240, so do not reject .
b) (i) Since the parameter is unknown, the chi-squared method is the only possible method. To find the mean, sum and divide by , which is the sum of . The actual comparison can be done by either summing or by summing and subtracting .
so . We look up probabilities in the Poisson table (these are in the column labeled ) and multiply them by to get .
0 23 7.3264 -15.6736 245.662 33.5310 72.205 0.018316
1 19 29.3052 10.3052 106.197 3.6238 12.319 0.073263
2 42 58.6100 16.6100 275.892 4.7073 30.097 0.146525
3 60 78.1468 18.1468 329.306 4.2139 46.067 0.195367
4 89 78.1468 -10.8532 117.792 1.5073 101.361 0.195367
5 79 62.5172 -16.4828 271.683 4.3457 99.829 0.156293
6 48 41.6784 -6.3216 39.963 0.9588 55.280 0.104196
7 40 23.8160 -16.1840 261.922 10.9977 67.182 0.059540
8+ 0 20.4520 20.4520 418.284 20.4520 0.000 0.051130
400 399.9998 0.0012 84.3388 484.339
So or Since there are 9 items on the comparison and we have used the data to estimate 1 parameter and , we reject .
(ii) We use the Kolmogorov-Smirnov method, though the Chi-squared method could also be used if you did not do part i. The column is the cumulative distribution from the Poisson table with a mean of 4.5.
Cumulative
0 23 23 .05750 .01111 .0464
1 19 42 .10500 .06110 .0439
2 42 84 .21000 .17358 .0364
3 60 144 .36000 .34230 .0177
4 89 233 .58250 .53260 .0504
5 79 312 .78000 .70293 .0771
6 48 360 .90000 .83105 .0789
7 40 400 1.00000 .91341 .0876
8+ 0 400 1.00000 1.00000 .0000
400
From the Kolmogorov-Smirnov, the critical value for a 95% confidence level is . Since the largest number in is above this value, we reject .
Exam continues in 252z9943.
1