5/5/00 252z0043

2. A sales manager wishes to predict unit sales by sales persons () on the basis of number of calls made () and number of different products the salesperson pushes (). The data is below (Use .

Row units calls products

1 28 14 2 392

2 71 35 52485

3 38 22 4 836

4 70 29 52030

5 22 6 2 132

6 27 15 3 405

7 28 17 3 476

8 47 20 5 940

9 14 12 1 168

10 70 30 52100

9964

Many people were too lazy or ignorant to compute . There is no way in the universe to get from and nor to get from and . You will always be asked to compute a sum of this sort on an exam, so figure out how to do it in advance.

The quantities below were given:

.You do not need all of these.

a. Compute a simple regression of units against calls.(8)

b. Compute (4)

c. Compute (3)

d. Compute ( the std deviation of the slope) and do a confidence interval for .(3)

e. Do a prediction interval for units when the salesperson makes 5 calls. (3) Why is this interval likely to be larger than other prediction intervals we might compute for numbers of calls that we have actually observed? (1)

Solution:

a) See computation above.

Spare Parts Computation:

1

1

becomes . Lots of people found instead.

5/5/00 252z0043

They hadn't read the question!

b)

( always!)

could be used in b) or could be used in c).

c) ( is always positive!)

d) so . Note: Some versions of this exam asked for , , or . You have to read the question to find out which one is wanted. Many people didn't.

e) . If and , then

From the regression formula outline the Prediction Interval is , where . So

.

So . This interval will be smallest when . Because 5 is below any values of that we actually have, the prediction interval will be relatively gigantic as the involves values of that are far from the mean.

5/5/00 252z0043

3. Data from problem 2 is repeated. (Use .

Row units calls products

1 28 14 2

2 71 35 5

3 38 22 4

4 70 29 5

5 22 6 2

6 27 15 3

7 28 17 3

8 47 20 5

9 14 12 1

10 70 30 5

.

a. Do a multiple regression of units against calls and products. (12)

b. Compute and adjusted for degrees of freedom for both this and the previous problem. Compare the values of adjusted between this and the previous problem. Use an F test to compare here with the from the previous problem.(6)

c. Compute the regression sum of squares and use it in an F test to test the usefulness of this regression. (5)

d. Use your regression to predict the number of units sold when a sales person makes 20 calls and pushes 5 products.(2)

e. Use the directions in the outline to make this estimate into a confidence interval and a prediction interval. (4)

Solution: a) First, we compute . Second, we compute or copy , , , , and . Third, we compute or copy our spare parts:

and .

* indicates quantities that must be positive. (Note that some of these were computed for the last problem.)

5/5/00 252z0043

Fourth, we substitute these numbers into the Simplified Normal Equations:

,

which are

and solve them as two equations in two unknowns for . We do this by multiplying the second equation by 6.9811, which is 740.00 divided by 106.00. The purpose of this is to make the coefficients of equal in both equations. We could do just as well by multiplying the second equation by 20.5 divided by 106 and making the coefficients equal.

So the two equations become . We then subtract the second equation from the first to get , so that . The first of the two normal equations can now be rearranged to get , which gives us . Finally we get by solving . Thus our equation is .

Note: An alternate way of solving the Simplified normal equations is to multiply the second equation by 5.1707 which is 106 divided by 20.5. The resulting equations are We then subtract the second equation from the first to get , so that If we then solve for , we get essentially the same answer.

b) The Regression sum of Squares is and is used in the ANOVA below.

The coefficient of determination is . Our results can be summarized below as:

* / / /
.8807 / 10 / 1 / .8807
.9210 / 10 / 2 / .8984

, which is adjusted for degrees of freedom, has the formula , where is the number of independent variables. adjusted for degrees of freedom seems to show that our second regression is better.

5/5/00 252z0043

One way to do the F test is to note that the total sum of squares is . For the regression with one independent variable the regression sum of squares is *. For the regression with two independent variables the regression sum of squares is was computed in b) as 3912.672.. The difference between these is 170.922. The remaining unexplained variation is

= 4248.500 – 3912.672 = 335.828*.

The ANOVA table is

Source / SS* / DF* / MS* / * /
/ 3741.75 / 1 / 3741.75
/ 170.922 / 1 / 170.922 / 3.563 /
Error / 335.828 / 7 / 47.9755
Total / 4248.500 / 9

Since our computed is smaller than the table , we do not reject our null hypothesis that has no effect.

A faster way to do this is to use the s directly. The difference between = 88.07% and = 92.10% is 4.03%.

Source / SS* / DF* / MS* / * /
/ 88.07 / 1 / 88.07
/ 4.03 / 1 / 4.03 / 3.57 /
Error / 7.90 / 7 / 1.12857
Total / 100.00 / 9

The numbers are a bit different because of rounding, but the conclusion is the same.

c) We computed the regression sum of squares in the previous section.

Source / SS / DF / MS / /
, / 3912.672 / 2 / 1956.33 / 41.02 /
Error / 335.828 / 7 / 47.975
Total / 4248.500 / 9

Since our computed is larger than the table , we reject our null hypothesis that and do not explain .

d) . Since the last few digits don't seem to mean a lot I used .

5/5/00 252z0043

From the ANOVA above

. This can be read from the MS in the ANOVA above.

According to the outline " An approximate confidence interval and an approximate prediction interval ." Use So the Confidence Interval is and the Prediction Interval is

5/5/00 252z0043

1

4. Your country's tourist office reports the following tourist arrivals over a 20 year period.

year arrivals (thousands)

0 11.75

1 78.93

2 203.04

3 268.95

4 380.49

5 457.32

6 525.51

7 596.56

8 640.74

9 710.67

10 748.02

11 795.13

12 845.21

13 843.08

14 922.58

15 945.22

16 934.72

17 945.67

18 952.38

19 933.86

Your assistant fits the following equations to the data:

arrivals = 162 + 50.0 year

(39.5) (3.55)

R-sq =91.7% Durbin-Watson statistic = 0.19

arrivals= -3.34 + 105 year - 2.90 yearsq

(8.94) (2.18) (0.111)

R-sq = 99.8% Durbin-Watson statistic = 2.48

Do the following:

a. Using only the s given above:

(i) Show that adjusted for degrees of freedom rises between the first and second regression. (2)

(ii) Fake an F test to show that the addition of the year squared improves the regression. (4)

(iii) Test the correlation between arrivals and year for significance (3)

(iv) Test the hypothesis that the correlation between arrivals and year is .99 (4)

b. Compute a rank correlation between arrivals and year (Note: if you can't get you are wasting both our time) (3) and

(i) Test it for significance (2)

(ii) Explain why it is higher than the correlation you computed in part a above. (1)

c. Explain what the values of the Durbin-Watson statistics show. (4)

1

Solution: a) (i), which is adjusted for degrees of freedom, has the formula , where is the number of independent variables. For the first one, and so , and for the second one, and so .

(ii) The difference between = 91.7% and = 99.8% is 8.1%.

Source / SS / DF / MS / /
/ 91.7 / 1
/ 8.1 / 1 / 8.1 / 68.85 /
Error / 0.2 / 17 / 0.11764
Total / 100.0 / 19

Since our computed is larger than the table , we reject our null hypothesis that has no effect.

5/5/00 252z0043

(iii) The simple sample correlation coefficient is , square root of . Since this was given by the printout, we don't need to compute it, so . From the outline, if we want to test against and are normally distributed, we use . Since , we reject .

(ii) The outline says, if and we want to test against “ we need to use Fisher's z-transformation. Let . This has an approximate mean of and a standard deviation of , so that . “ So if and , , and Since and this is a 2-sided test, we reject .

b) (i) The data is repeated below with the calculations for rank correlation.

year arrivals

0 11.75 1 1 0 0

1 78.93 2 2 0 0

2 203.04 3 3 0 0

3 268.95 4 4 0 0

4 380.49 5 5 0 0

5 457.32 6 6 0 0

6 525.51 7 7 0 0

7 596.56 8 8 0 0

8 640.74 9 9 0 0

9 710.67 10 10 0 0

10 748.02 11 11 0 0

11 795.13 12 12 0 0

12 845.21 13 14 1 1

13 843.08 14 13 -1 1

14 922.58 15 15 0 0

15 945.22 16 18 -2 4

16 934.72 17 17 0 0

17 945.67 18 19 -1 1

18 952.38 19 20 -1 1

19 933.86 20 16 4 16

0 24

. If we want a 2-sided test at the 99% confidence level of , compare with the 0.5% value from the Pearson’s rank correlation coefficient table. Since the table value is .4451 reject the null hypothesis. We conclude that the rank correlation is significant.

5/5/00 252z0043

(ii) The second regression shows that there is a slight curvature in the relation between the two variables. Since correlation tests for a linear relationship, it is not quite appropriate, but rank correlation will detect a slightly curved but generally positive relationship.

c) A Durbin-Watson Test is a test for autocorrelation. For , and , the text table gives and . The null hypothesis is ‘No Autocorrelation’ and our rejection region is or . We really should use the value for , but a check of the table leaves us sure that it is somewhat below .95. thus the D-W statistic of 0.18 is probably in the rejection region. For , and , the text table gives and The 'do not reject' region is between and . 2.48 is in this region, but this is really for We can't be sure if we actually use

5/5/00 252z0043

5. An analysis of a sample of 200 prisoners of their adjustment to civil life after release from prison reveals the following:

Residence Adjustment to Civil Life.

after releaseOutstandingGoodFairPoorTotal

Hometown 27 34 34 25 120

Not Hometown 15 16 24 25 80

Total 42 50 58 50 200

Do statistical tests of the following:

a. The proportion in each adjustment category was the same for both 'hometown' and 'not hometown' groups. (8)

b. The proportion in the combined 'outstanding' and 'good' categories was higher in the 'hometown' group than the 'not hometown' group. (5)

c. The combined proportion of the whole group of 200 that made an 'outstanding' or 'good' adjustment was 50% (4)

Solution: Note!! A test of multiple proportions is a test! Every year I see people trying to compare more than two proportions by a method appropriate for b) below. It doesn't work! is defined as a difference between two proportions, when you have more than two that definition doesn't work. Also, simply computing the proportions and telling me that they are different is just a way of making me suspect that you don't know what a statistical test is.

a) The data is copied below. The s are found by dividing the row sums in by grand total. The s are then used to multiply the column totals to get the material in

This is a chi-squared test of homogeneity. Our null hypothesis is 'Homogeneity' . The calculations are done in two ways below. Save time by computing only .

Row

1 27 25.2 -1.80000 3.2400 0.12857 28.9286

2 34 30.0 -4.00000 16.0000 0.53333 38.5333

3 34 34.8 0.80000 0.6400 0.01839 33.2184

4 25 30.0 5.00000 25.0000 0.83333 20.8333

5 15 16.8 1.80000 3.2400 0.19286 13.3929

6 16 20.0 4.00000 16.0000 0.80000 12.8000

7 24 23.2 -0.80000 0.6400 0.02759 24.8276

8 25 20.0 -5.00000 25.0000 1.25000 31.2500

Total 200 200.0 3.78406 203.7841

so do not reject the null hypothesis. We conclude that, except for random variations, the proportion in each category is the same for both groups.

5/5/00 252z0043

b) From Table 3

Interval for / Confidence
Interval / Hypotheses / Test Ratio / Critical Value
Difference
Between
Proportions
/

/
/

Or use /

Our Hypotheses are or where . If we use the test ratio method, we need to find, and . So

.

. So . Since do not reject . We do not reject the null hypothesis if .

c) Table 3 says the following:

Interval for / Confidence Interval / Hypotheses / Test Ratio / Critical Value
Proportion / / / /

In the last part of the problem, we found that the proportion of people in the 'outstanding' or 'good' categories was Thus, if we use the test ratio method . We reject if is not between . It is between these values, so we do not reject .

5/5/00 252z0043

6. In an effort to teach safety principles to a group of your employees, 22 employees were randomly assigned to one of four groups. After the sessions they took a test that was scored from 0 to 10 with the following results:

Programmed
Instruction / Lecture / Videotape / Discussion
7 / 8 / 7 / 8
6 / 5 / 9 / 5
5 / 8 / 6 / 6
6 / 6 / 8 / 6
6 / 9 / 5 / 5
8 / 10

Do statistical tests of the following: (Assume that the underlying distribution is Normal)

a. Is there a difference between the means? (7)

b. Does column 4 have a Normal distribution with a population mean of 7.2 and a population standard deviation of 1.5? (5)

c. At the same time we gave the managers a test on safety and then a day of training - scores were not reported, but of 15 managers 11 performed better after the day of training. Use a sign test to show if the day of training was successful. (4)

Solution: Note!! A test of multiple means is an Analysis of Variance! Every year I see people trying to compare more than two means by a method appropriate for comparing two means. It doesn't work! is defined as a difference between two means, when you have more than two that definition doesn't work. Also, simply computing the means and telling me that they are different is just a way of making me suspect that you don't know what a statistical test is.

a) Because we are comparing means under the assumption that the underlying distribution is normal, this is an ANOVA.

/ / / / Sum
7 / 8 / 7 / 8
6 / 5 / 9 / 5
5 / 8 / 6 / 6
6 / 6 / 8 / 6
6 / 9 / 5 / 5
8 / …. / .… / 10
Sum / 38 / 36 + / 35 + / 40 /
/ 6 + / 5 + / 5 + / 6 /
/ 6.3333 / 7.2000 / 7.0000 / 6.6667 /
SS / 246 + / 270+ / 255+ / 286 /

5/5/00 252z0043

Source / SS / DF / MS / / /
Between / 2.4052 / 3 / 0.8017 / 0.32 / ns / Column means equal
Within / 45.4584 / 18 / 2.5255
Total / 47.8636 / 21

Explanation: Since the Sum of Squares (SS) column must add up, 45.4584 is found by subtracting 2.4057 from 47.8636. Since , the total degrees of freedom are . Since there are 4 random samples or columns, the degrees of freedom for Between is 4 – 1 = 3. Since the Degrees of Freedom (DF) column must add up, 18 = 21 – 3. The Mean Square (MS) column is found by dividing the SS column by the DF column. 0.8017 is and 2.5255 is . , and is compared with from the F table . Because our computed is less than the table , do not reject .

b) Because the mean and variance are known and the sample is small, the only test that is practical is the Kolmogorov-Smirnov Test.

The column is the cumulative distribution computed from the Normal table. is . is the Cumulativedivided by .

Cumulative

5 1 1 .1667 -1.47 .0708 .0959

5 1 2 .3333 -1.47 .0708 .2625

6 1 3 .5000 -0.80 .2119 .2881

6 1 4 .6667 0.80 .2119 .4548

8 1 5 .8333 0.53 .7019 .1314

10 1 6 1.0000 1.87 .9693 .0307

7

From the Kolmogorov-Smirnov Table, the critical value for a 95% confidence level is .4050. Since the largest number in is above this value, we reject .

c) We get the p-value for this result by using the binomial table with and .

Since this is greater than , we do not reject and thus conclude that the training was not successful.

5/5/00 252z0043

7. Three groups of Executives are given a test on management principles. We will assume that the underlying distribution is not Normal. (M&L p627)

Manufacturing Executives
Score Rank / Finance Executives
Score Rank / Trade Executives
Score Rank
51 9 / 15 2 / 89 19
31 7 / 32 8 / 20 3.5
14 1 / 68 13 / 60 11
69 14 / 87 18 / 72 15
86 17 / 20 3.5 / 56 10
62 12 / 28 6 / 22 5
96 20 / 77 16 / 63.5
80 / 97 21
87.5

Using rank tests, test the following:

a. The distributions of scores are the same for all three groups (7)

b. Taken as a single group, nonmanufacturing executives do worse on the test than manufacturing executives. (7)

c. The median score for Finance executives is 60 (Do not use a sign test if you used it in the last problem.)

(4 points for a sign test, 5 for a better method)

d. 45 days after you get back from Cancun, your doctor orders a runs test. If + indicates days when you had the runs and - indicates days when you did not. There were 27 + days and 18 - days, and a total of 18 runs of either plusses or minuses. Was the sequence random? (5)

Solution: a) Since this involves comparing three apparently random samples from a non-normal distribution, we use a Kruskal-Wallis test. The null hypothesis is Columns come from same distribution or medians are equal.

Sums of ranks were given above. To check the ranking, note that the sum of the three rank sums is 80 + 87.5 + 63.5 = 231, that the total number of items is 7 + 8 + 6 = 21 and that the sum of the first numbers is Now, compute the Kruskal-Wallis statistic . If we try to look up this result in the (7 ,8 ,6) section of the Kruskal-Wallis table (Table 9) , we find that the problem is to large for the table. Thus we must use the chi-squared table with 2 degrees of freedom. Since do not reject .

b) Because we are comparing two random samples from a nonnormal distribution, we use the Wilcoxon-Mann-Whitney Method. If we designate manufacturing as sample 1 and nonmanufacturing as sample 2, our hypotheses are and . The sum of ranks for manufacturing is 80. The sum of ranks for nonmanufacturing is 87.5 + 63.5 = 151. As in part a), their sum is 231, and this checks out as equal to .

5/5/00 252z0043

We designate the smaller of the two rank sums, 80, as . We are unable to find critical values or p-values for a 5% two-tailed test with and on either of the Wilcoxon-Mann-Whitney tables, since is too high. The outline says that for values of and that are too large for the tables, has the normal distribution with mean and variance . Note that our value of is above the mean. This is because the average rank of sample 1 is higher than the average rank of sample 2, as it would have to be if nonmanufacturing executives do worse on the test. This means that we are doing a right sided test. . Since this is below, we do not reject

c) The Wilcoxon Signed rank test for paired data was used in class as a powerful test of the median. Our hypotheses are and . The difference column will be .

difference rank

15 -45 8 -If we total negative and positive ranks separately, we get

32 -28 4 - and . According to the Wilcoxon signed

68 8 1 +rank test table, the 2.5% value for is 4. Since the

87 27 3 +smaller of the two rank sums, 12, is above this critical

20 -40 7 -value, do not reject the null hypothesis.

28 -32 5 -

77 17 2 +

97 37 6 -

d) This is, of course, a runs test. is the total number of items, , and .

To test the null hypothesis of randomness for a small sample, assume that the significance level is 5% and use the table entitled 'Critical values of in the Runs Test.’ Unfortunately, is to high for the table. According to the outline, for a larger problem (if and are too large for the table), follows the normal distribution with and . Then and . So Since this value of is between , we do not reject

1