FINAL Examhour of Class Registered (Circle) May 5, 1999 MWF 10 11 TR 12:30 2:00

FINAL Examhour of Class Registered (Circle) May 5, 1999 MWF 10 11 TR 12:30 2:00

5/12/99 252y9943 ECO252 QBA2Name

FINAL EXAMHour of Class Registered (Circle) May 5, 1999 MWF 10 11 TR 12:30 2:00

Note: If this is the only thing you look at before taking the final, you are badly cheating yourself. Problems like 5e, 5f and 6 appeared on the 1998 final. People who used this final and did not read the problems carefully got very wrong answers to them.

Note: If you still think that a large p-value means that a coefficient is significant, you need a conference with an audiologist. Further note that a p-value is a probability and can only be compared with another probability (like the significance level).

Note: Have you reread “Things that You Should Never Do On a Statistics Exam …?” I think I could have graded this exam by just looking for violations of these rules.

I. (16 points) Do all the following.

  1. Hand in your fourth regression problem (2 points) and answer the following questions.
  1. For the regression of the number of hours of work against the number of machines, what coefficients are significant at the 1% level? Why? What about the 5% level? (2)
  2. Would you say that the regression of number of hours of work against the number of machines and months of experience is more successful than the regression against machines alone? Why? (3)
  3. What was the surprise that occurred when you did the stepwise regression? (2)

Solution: The rule on p-value:

If the p-value is less than the significance level reject the null

hypothesis; if the p-value is greater or equal than the significance

level, do not reject the null hypothesis.

a) part of the printout follows: (For the entire printout see 252x9943.)

MTB > print c1-c4

Data Display

Row Hours Number Exper Inter

1 1.0 1 12 12

2 3.1 3 8 24

3 17.0 10 5 50

4 14.0 8 2 16

5 6.0 5 10 50

6 1.8 1 1 1

7 11.5 10 10 100

8 9.3 5 2 10

9 6.0 4 6 24

10 12.2 10 18 180

MTB > brief 3

5/12/99 252y9943

MTB > regress c1 on 1 c2 'resid''pred'

Regression Analysis

The regression equation is

Hours = 0.10 + 1.42 Number

Predictor Coef Stdev t-ratio p

Constant 0.101 1.267 0.08 0.939

Number 1.4192 0.1908 7.44 0.000

s = 2.056 R-sq = 87.4% R-sq(adj) = 85.8%

Analysis of Variance

SOURCE DF SS MS F p

Regression 1 233.84 233.84 55.30 0.000

Error 8 33.83 4.23

Total 9 267.67

Obs. Number Hours Fit Stdev.Fit Residual St.Resid

1 1.0 1.000 1.520 1.108 -0.520 -0.30

2 3.0 3.100 4.358 0.830 -1.258 -0.67

3 10.0 17.000 14.293 1.047 2.707 1.53

4 8.0 14.000 11.454 0.785 2.546 1.34

5 5.0 6.000 7.197 0.664 -1.197 -0.61

6 1.0 1.800 1.520 1.108 0.280 0.16

7 10.0 11.500 14.293 1.047 -2.793 -1.58

8 5.0 9.300 7.197 0.664 2.103 1.08

9 4.0 6.000 5.777 0.727 0.223 0.12

10 10.0 12.200 14.293 1.047 -2.093 -1.18

Since the p-value column has .936 as the p-value for 0.101, the coefficient ‘constant,’ and .939 is above the significance levels of .01 and .05, we can say that the constant is not significant at these levels. If you look up the values of t for 8 degrees of freedom and significance levels of .005 and .025, you will find that the t-ratio of 0.08 is less than the table value.

Likewise, since the p-value column has zero as the p-value for 0.1.4192, the coefficient of ‘Number,’ and zero is below the significance levels of .01 and .05, we can say that the constant is significant at these levels. If you look up the values of t for 8 degrees of freedom and significance levels of .005 and .025, you will find that the t-ratio of 0.08 is more than the table value.

b) part of the printout follows:

MTB > regress c1 on 2 c2 c3 'resid''pred'

Regression Analysis

The regression equation is

Hours = 1.62 + 1.53 Number - 0.293 Exper

Predictor Coef Stdev t-ratio p

Constant 1.6191 0.9746 1.66 0.141

Number 1.5333 0.1335 11.49 0.000

Exper -0.29311 0.09019 -3.25 0.014

s = 1.388 R-sq = 95.0% R-sq(adj) = 93.5%

Analysis of Variance

SOURCE DF SS MS F p

Regression 2 254.19 127.09 65.99 0.000

Error 7 13.48 1.93

Total 9 267.67

5/12/99 252y9943

SOURCE DF SEQ SS

Number 1 233.84

Exper 1 20.34

Obs. Number Hours Fit Stdev.Fit Residual St.Resid

1 1.0 1.000 -0.365 0.946 1.365 1.34

2 3.0 3.100 3.874 0.579 -0.774 -0.61

3 10.0 17.000 15.487 0.796 1.513 1.33

4 8.0 14.000 13.299 0.776 0.701 0.61

5 5.0 6.000 6.355 0.518 -0.355 -0.28

6 1.0 1.800 2.859 0.854 -1.059 -0.97

7 10.0 11.500 14.021 0.712 -2.521 -2.12R

8 5.0 9.300 8.699 0.644 0.601 0.49

9 4.0 6.000 5.994 0.495 0.006 0.00

10 10.0 12.200 11.676 1.071 0.524 0.59

R denotes an obs. with a large st. resid.

This is one of the few really good ones. R-squared and R-squared adjusted went up, and , equally important, the coefficient of ‘Number’ remained significant while the coefficient of ‘Exper’ was significant at the 5% level.

c) part of the printout follows:

Stepwise Regression

F-to-Enter: 4.00 F-to-Remove: 4.00

Response is Hours on 3 predictors, with N = 10

Step 1 2

Constant 0.1005 -0.4758

Number 1.42 1.85

T-Ratio 7.44 10.81

Inter -0.040

T-Ratio -3.56

S 2.06 1.31

R-Sq 87.36 95.51

More? (Yes, No, Subcommand, or Help)

SUBC> y

No variables entered or removed

More? (Yes, No, Subcommand, or Help)

SUBC> n

MTB > print c1-c6

This was a surprise! Though the coefficients in column 1 are the same as those in the regression of ‘Hours’ against ‘Number’ above, the variable it brought in in column 2 is the interaction variable. I had assumed that it would only help the other two variables to explain ‘Number,’ but the computer refused to bring in ‘Exper.’

5/12/99 252y9943

2. The following pages show the regression of the variable 'mins', the winning time in minutes in a

triathlon, against some of the following independent variables:

'female'A dummy variable that is 1 if the contestant is female.

'swim'Number of miles of swimming

'bike'Number of miles of biking

'run'Number of miles of running

c6‘swim’ multiplied by ‘female’

c7‘bike’ multiplied by ‘female’

c8‘run’ multiplied by ‘female’

c9‘swim’ squared

c10‘bike’ squared

c11‘run’ squared

  1. In the regression of ‘mins’ against ‘female’, ‘swim’, ‘bike’ and ‘run’, which coefficients have

signs that look wrong? Why? Which coefficients are not significant at the 99% confidence level?

(3)

  1. Look at the regression of ‘mins’ against ‘run‘, c8 and c11 and the regression of ‘mins’ against ‘run’, and c8. Use . Does either seem to be an improvement over the regression of ‘mins’ against ‘run’ alone? Why?(2)
  2. Explain the meaning of the F test in the regression of ‘mins’ against ‘female’, ‘swim’, ‘bike’ and

‘run’ . What is being tested and what are the conclusions? (2)

d.The printout concludes with a printout of the data and of a correlation matrix. What does this

suggest about the problems that are occurring with these regressions? (2)

Solution: The printout enclosed with the exam follows:

Worksheet size: 100000 cells
MTB > RETR 'C:\MINITAB\LR13-49.MTW'.
Retrieving worksheet from file: C:\MINITAB\LR13-49.MTW
Worksheet was saved on 5/ 3/1999
MTB > regress c1 on 4 c2 c3 c4 c5
Regression Analysis

The regression equation is

mins = - 24.6 + 35.5 female - 25.0 swim + 7.13 bike - 6.37 run

Predictor Coef Stdev t-ratio p

Constant -24.57 20.13 -1.22 0.241

female 35.47 14.77 2.40 0.030

swim -25.01 45.75 -0.55 0.593

bike 7.130 1.331 5.36 0.000

run -6.372 5.384 -1.18 0.255

s = 33.02 R-sq = 98.0% R-sq(adj) = 97.4%

a) There is no reason to expect the constant to be positive or negative. However in this equation, a negative coefficient for ‘swim’, ‘bike’ or ‘run’ would lead us to believe that an extra mile of swimming, biking or running would lead to a faster time. The positive coefficient for ‘female’ is expected, because, at least at present, women’s times in speed events have been longer than men’s. (However, at the rate that women’s times in athletic events are falling this may not always be true!)

The constant and the coefficients of ‘swim’ and ‘run’ have p-values above .01. If the confidence level is 99%, the significance level is .01, so these coefficients are not significant.

5/12/99 252y9943

Analysis of Variance

SOURCE DF SS MS F p

Regression 4 786104 196526 180.29 0.000

Error 15 16351 1090

Total 19 802455

c) The F test here is a test of the null hypothesis that the independent variables do not explain the dependent variable. The low p-value means that we reject the null hypothesis.

SOURCE DF SEQ SS

female 1 6291

swim 1 726098

bike 1 52189

run 1 1526

Unusual Observations

Obs. female mins Fit Stdev.Fit Residual St.Resid

1 0.00 489.25 547.00 17.48 -57.75 -2.06R

18 1.00 660.48 582.47 17.48 78.01 2.79R

R denotes an obs. with a large st. resid.

MTB > regress c1 on 1 c5

Regression Analysis

The regression equation is

mins = - 19.2 + 23.6 run

Predictor Coef Stdev t-ratio p

Constant -19.25 23.19 -0.83 0.417

run 23.615 1.582 14.92 0.000

s = 57.74 R-sq = 92.5% R-sq(adj) = 92.1%

Analysis of Variance
SOURCE DF SS MS F p

Regression 1 742445 742445 222.69 0.000

Error 18 60011 3334

Total 19 802455

Unusual Observations

Obs. run mins Fit Stdev.Fit Residual St.Resid

1 26.2 489.2 599.5 25.7 -110.2 -2.13R

12 18.6 589.1 420.0 16.4 169.1 3.05R

R denotes an obs. with a large st. resid.

MTB > regress c1 on 2 c5 c8

Regression Analysis

The regression equation is

mins = - 19.2 + 22.1 run + 3.02 C8

Predictor Coef Stdev t-ratio p

Constant -19.25 21.83 -0.88 0.390

run 22.106 1.705 12.96 0.000

C8 3.017 1.659 1.82 0.087

s = 54.36 R-sq = 93.7% R-sq(adj) = 93.0%

5/12/99 252y9943

Analysis of Variance

SOURCE DF SS MS F p

Regression 2 752216 376108 127.27 0.000

Error 17 50240 2955

Total 19 802455

SOURCE DF SEQ SS

run 1 742445

C8 1 9771

Unusual Observations

Obs. run mins Fit Stdev.Fit Residual St.Resid

2 18.6 505.1 391.9 21.9 113.2 2.27R

11 26.2 540.9 639.0 32.5 -98.1 -2.25R

12 18.6 589.1 448.0 21.9 141.0 2.83R

R denotes an obs. with a large st. resid.

MTB > regress c1 on 2 c5 c11

Regression Analysis

The regression equation is

mins = - 102 + 39.6 run - 0.519 C11

Predictor Coef Stdev t-ratio p

Constant -101.71 49.18 -2.07 0.054

run 39.550 8.654 4.57 0.000

C11 -0.5192 0.2778 -1.87 0.079

s = 54.11 R-sq = 93.8% R-sq(adj) = 93.1%

Analysis of Variance

SOURCE DF SS MS F p

Regression 2 752675 376337 128.52 0.000

Error 17 49780 2928

Total 19 802455

SOURCE DF SEQ SS

run 1 742445

C11 1 10230

Unusual Observations

Obs. run mins Fit Stdev.Fit Residual St.Resid

12 18.6 589.1 454.3 24.0 134.8 2.78R

R denotes an obs. with a large st. resid.

MTB > regress c1 on 3 c5 c8 c11

Regression Analysis

The regression equation is

mins = - 102 + 38.0 run + 3.02 C8 - 0.519 C11

Predictor Coef Stdev t-ratio p

Constant -101.71 45.45 -2.24 0.040

run 38.042 8.033 4.74 0.000

C8 3.017 1.526 1.98 0.066

C11 -0.5192 0.2567 -2.02 0.060

s = 50.01 R-sq = 95.0% R-sq(adj) = 94.1%

b) Some people have all the luck! Unlike most choices I gave these are really good. Not only do R-squared and R-squared adjusted go up, but the p-values are all below 10% and the signs of the coefficients are reasonable. So yes, these are probably improvements.

5/12/99 252y9943

Analysis of Variance
SOURCE DF SS MS F p
Regression 3 762446 254149 101.64 0.000
Error 16 40009 2501

Total 19 802455

SOURCE DF SEQ SS

run 1 742445

C8 1 9771

C11 1 10230

Unusual Observations

Obs. run mins Fit Stdev.Fit Residual St.Resid

12 18.6 589.1 482.3 26.3 106.7 2.51R

R denotes an obs. with a large st. resid.

MTB > print c1-c11

Data Display

Row mins female swim bike run C6 C7 C8 C9

1 489.250 0 2.40 112.0 26.2 0.00 0.0 0.0 5.7600

2 505.150 0 2.00 100.0 18.6 0.00 0.0 0.0 4.0000

3 245.500 0 1.20 55.3 13.1 0.00 0.0 0.0 1.4400

4 204.400 0 1.50 48.0 10.0 0.00 0.0 0.0 2.2500

5 114.533 0 0.93 24.8 6.2 0.00 0.0 0.0 0.8649

6 108.267 0 0.93 24.8 6.2 0.00 0.0 0.0 0.8649

7 79.417 0 0.50 18.0 5.0 0.00 0.0 0.0 0.2500

8 566.500 0 2.40 112.0 26.2 0.00 0.0 0.0 5.7600

9 74.983 0 0.50 20.0 4.0 0.00 0.0 0.0 0.2500

10 116.117 0 0.60 25.0 6.2 0.00 0.0 0.0 0.3600

11 540.933 1 2.40 112.0 26.2 2.40 112.0 26.2 5.7600

12 589.067 1 2.00 100.0 18.6 2.00 100.0 18.6 4.0000

13 280.100 1 1.20 55.3 13.1 1.20 55.3 13.1 1.4400

14 235.033 1 1.50 48.0 10.0 1.50 48.0 10.0 2.2500

15 127.167 1 0.93 24.8 6.2 0.93 24.8 6.2 0.8649

16 120.750 1 0.93 24.8 6.2 0.93 24.8 6.2 0.8649

17 90.317 1 0.50 18.0 5.0 0.50 18.0 5.0 0.2500

18 660.483 1 2.40 112.0 26.2 2.40 112.0 26.2 5.7600

19 83.150 1 0.50 20.0 4.0 0.50 20.0 4.0 0.2500

20 131.817 1 0.60 25.0 6.2 0.60 25.0 6.2 0.3600

Row C10 C11

1 12544.0 686.44

2 10000.0 345.96

3 3058.1 171.61

4 2304.0 100.00

5 615.0 38.44

6 615.0 38.44

7 324.0 25.00

8 12544.0 686.44

9 400.0 16.00

10 625.0 38.44

11 12544.0 686.44

12 10000.0 345.96

13 3058.1 171.61

14 2304.0 100.00

15 615.0 38.44

16 615.0 38.44

17 324.0 25.00

18 12544.0 686.44

19 400.0 16.00

20 625.0 38.44

5/12/99 252y9943

MTB > Correlation c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11.

Correlations (Pearson)

mins female swim bike run C6 C7 C8

female 0.089

swim 0.951 0.000

bike 0.984 0.000 0.973

run 0.962 0.000 0.965 0.985

C6 0.510 0.792 0.432 0.420 0.417

C7 0.584 0.716 0.480 0.494 0.487 0.982

C8 0.564 0.726 0.470 0.479 0.487 0.980 0.993

C9 0.956 0.000 0.985 0.979 0.982 0.426 0.483 0.478

C10 0.975 0.000 0.954 0.989 0.983 0.412 0.488 0.478

C11 0.928 0.000 0.932 0.954 0.985 0.403 0.471 0.479

C9 C10

C10 0.982

C11 0.974 0.975

d) There are many correlations above .9 here. These indicate that collinearity is a problem and that many regressions will give coefficients that are insignificant or have unreasonable signs.

5/12/99 252y9943

II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points - Anything extra you do helps, and grades wrap around) . Show your work! State and where applicable. Use a significance level of 5% unless noted otherwise.

Note: These problems involve comparing population means, variances, proportions or medians. To do this you use sample means, variances or proportions. If you look at a problem and tell me that the sample means, variances, proportions or medians differ without incorporating them in a test, you are wasting both your time and mine, and it is possible that my annoyance will affect how I grade the rest of your exam, since I will now suspect that you have no idea what a significant difference is or what we mean by a statistical test!

1. a. Premiums on a group of 11closed end mutual funds were as follows. (These are in per cent, but that shouldn’t affect your analysis.)Test the hypothesis that the mean is 3 per cent using (i) Either a test ratio or a critical value and (ii) A confidence interval. (6)

+4.7 -0.7 +5.3 +9.2 -0.3 -0.3 +5.0 +0.4 -1.9 +0.5 -3.1

b. Test that the following data (i) has a Poisson distribution (6) and (ii)has a Poisson distribution with a mean of 4.5 (6). If you do both parts do only one with a chi-square method.

Solution: a)From Table 3 of the Syllabus Supplement:

Interval for / Confidence Interval / Hypotheses / Test Ratio / Critical Value
Mean (
unknown) /
/ / /

1

and .

(i) Test Ratio: . This is in the ‘accept’ region between , so do not reject .

Critical Value: Since this is a 2-sided test, or 0.469 to 2.531. This means that we reject if the sample mean is above 2.531 or below 0.469. Since is between these critical values, do not reject .

+4.7 22.09

-0.7 0.49

+5.3 28.09

+9.2 84.64

-0.3 0.09

-0.3 0.09

+5.0 25.00

+0.4 0.16

-1.9 3.61

+0.5 0.25

-3.1 9.61

18.8174.12

Since and, we find and

1

5/12/99 252y9943

(ii) Confidence Interval: Since this is a two-sided test, or . This does not contradict , because 3 is between –0.822 and 4.240, so do not reject .

b) (i) Since the parameter is unknown, the chi-squared method is the only possible method. To find the mean, sum and divide by , which is the sum of . The actual comparison can be done by either summing or by summing and subtracting .

so . We look up probabilities in the Poisson table (these are in the column labeled ) and multiply them by to get .

0 23 7.3264 -15.6736 245.662 33.5310 72.205 0.018316

1 19 29.3052 10.3052 106.197 3.6238 12.319 0.073263

2 42 58.6100 16.6100 275.892 4.7073 30.097 0.146525

3 60 78.1468 18.1468 329.306 4.2139 46.067 0.195367

4 89 78.1468 -10.8532 117.792 1.5073 101.361 0.195367

5 79 62.5172 -16.4828 271.683 4.3457 99.829 0.156293

6 48 41.6784 -6.3216 39.963 0.9588 55.280 0.104196

7 40 23.8160 -16.1840 261.922 10.9977 67.182 0.059540

8+ 0 20.4520 20.4520 418.284 20.4520 0.000 0.051130

400 399.9998 0.0012 84.3388 484.339

So or Since there are 9 items on the comparison and we have used the data to estimate 1 parameter and , we reject .

(ii) We use the Kolmogorov-Smirnov method, though the Chi-squared method could also be used if you did not do part i. The column is the cumulative distribution from the Poisson table with a mean of 4.5.

Cumulative

0 23 23 .05750 .01111 .0464

1 19 42 .10500 .06110 .0439

2 42 84 .21000 .17358 .0364

3 60 144 .36000 .34230 .0177

4 89 233 .58250 .53260 .0504

5 79 312 .78000 .70293 .0771

6 48 360 .90000 .83105 .0789

7 40 400 1.00000 .91341 .0876

8+ 0 400 1.00000 1.00000 .0000

400

From the Kolmogorov-Smirnov, the critical value for a 95% confidence level is . Since the largest number in is above this value, we reject .

Exam continues in 252z9943.

1