MAS Applied Exam Sample Solutions, May 2014

1. (a) H0: 1 = 2

Ha: 1 ≠ 2

(b) Based on the box plots, there is no evidence that the population means differ, so it is probably safe to use the pooled sample variance and the exact t-test.

Since |0.539| is not greater than t0.025, 55 = 2.005. (Two-tailed) P-value = 2(0.296) = 0.592. So we cannot reject H0. We cannot conclude the mean length of the lake frogs differs for the mean length of the pond frogs.

(c) The significance level is set by the researcher to be the maximum allowable probability of a Type I error (i.e., rejecting H0 when it is in fact true). In this case, we don’t want to make the error of concluding the mean lengths differ, if in fact they are the same, so we set  to be relatively small (0.05), so we know ahead of time that there is only a 5% chance of making this error.

(d) Omitting the outliers is not a good idea at all, unless there was some reason the measurements were wrong (like faulty equipment). In this case, omitting the outliers could have caused a big discrepancy in the results, since most of the omitted values for the pond sample were on the small side. A better approach, if the outliers were a concern, would be to base the inference on robust statistics, such as medians or trimmed means.

(e) A nonparametric alternative would be the Wilcoxon Rank-Sum test (aka Mann-Whitney test). The necessary conditions for such a test are that the two independent samples come from populations that are continuous and identical except for a possible difference in location (center). [Note: The samples could come from different distributions but the null hypothesis would then be H0: P(X>Y) = P(X<Y).]

2. a) Using the central limit theorem would give the following formula for the confidence interval.

and traditionally and would be substituted for and (called a Wald interval). The Agresti-Caffo solution is generally accepted as being better. It would use the following values

in place of the and ni. Adding one to each yi and 2 to each ni is a common substitution as well and is approximately the same for a 95% interval. (There is no penalty on this exam for using the traditional method, but it will give an extra set of assumptions to check in (b).

The traditional interval was very similar, (0.1591, 0.3342).

(b) Any form of this interval assumes we have two independent binomial experiments. Given that we are not sampling with replacement, the binomial part can at best be approximate (would really be hypergeometric). To make the approximation work we need the population to be much larger (say 100x) bigger than the population. It seems doubtful that there are 15,000 garage users and 15,000 parking lot users among the faculty! That the two samples are independent seems like it should approximately hold (especially if the populations were large enough for the first part).

The Agresti-Caffo interval has no other assumptions. The Wald interval would require that both of the and are at least 5.

(c) We are 90% confident that the difference in the proportion of those favoring the new offering between the current garage users and current lot users is between 15.8% and 33.1% (so the garage users like it a lot more). The 90% confident means that approximately 90% of intervals constructed in this way will contain the true difference in proportions. This particular interval either does, or it doesn’t.

(d) Hypothesis tests are always constructed under the assumption that the null hypothesis is true. Here, the null hypothesis is that the two population proportions are equal, so we should plug the same estimated value in for each. (So we pretend it is one large sample from a single population and estimate p for that).

(e) This would be a test for homogeneity. The table would be three parking groups by two opinions. A 3x2 table has (3-1)x(2-1)=2 degrees of freedom.

3. a) Give the formal model being fit in this simple linear regression – including both the model equation, identification of any symbols used, and all necessary assumptions.

Where is the average FCAT mathematics score in school i, is the y-intercept, is the slope, is the percentage of students below poverty level in school i, and is the error for school i. One way of writing the assumptions is that where is a common error variance.

b) That the mean of the errors is zero can be examined using the residual vs. predicted (or fitted) plot. Moving from left to right the errors seem to be fairly vertically symmetric around zero, so this assumption seems reasonable.

The assumption of equal variances is checked from the same graph. Moving from left to right they the vertical spread seems to decrease (a reversed fan structure) and so this assumption seems violated.

We cheat in checking the normality of the errors by using a single q-q plot of the residuals (instead of one for each x value or even range of x values). It is slightly heavy tailed in this case, but it is hard to tell with a small sample. I can see an analyst going either way on this, but it might go away if the heteroscedasticity is dealt with.

For checking independence, there is no indication that this is a random sample and there are many ways in which the errors could be related (similar regions of the state, similar ethnic make-ups, etc…). Plots of residuals against a number of variables that could relate the schools could be reassuring if they showed no patterns.

(c) There is a statistically significant (p-value < 0.000003) relationship between the percent scoring below the poverty level and the average math score in schools at the 3rd grade level. Each additional percent below poverty decreases the estimated expected average math score for the school by 0.30544 (the slope estimate). This relationship is fairly strong, with the poverty rate explaining an estimated 67.31% (R-squared) of the variability in average math scores between schools.

(d) The confidence interval for give us a range that we could be 95% confident that the true underlying regression line at the corresponding value passes through. Given a value , we are 95% confident that a yet unobserved corresponding value will fall in that range.

4) An experiment is conducted to compare the effectiveness of five different computer screen overlays for the reduction of eye strain. The five covers were “clear”, “clear-anti-glare”, “tinted”, “tinted anti-glare”, and “softening”. Ninety volunteers were gathered with approximately the same uncorrected vision, amount of time spent on-line, and ambient light in their offices. They were randomly assigned an overlay (18 to each) to use for an entire work week, and a measure of cumulative total eye strain (scale from 0=low to 50=high) was collected for each subject.

(a) H0: clear = clear-anti-glare=tinted=tinted-anti-glare=softening

vs. HA: at least one mean is different from the others

where the are the mean total eye-strain for all users like the ones in our experiment.

(b) The omnibus test only tells us that the mean total eye-strain for at least one of the covers differs from the mean total eye-strain at least one other. It doesn’t tell us which ones are different or how they differ.

(c) L½clear – ½ clear-anti-glare+ ½ tinted - ½ tinted-anti-glare + 0 softening

H0: L=0 or (clear+tinted)/2 = (clear-anti-glare+tinted-anti-glare)/2

HA: L≠0 or (clear+tinted)/2 ≠ (clear-anti-glare+tinted-anti-glare)/2

(d) We would determine which pairs of means were statistically significantly different from each other (so 10 different comparisons in this experiment) while maintaining a family-wise error rate at the specified level. Typically this is displayed by showing the means from largest to smallest and then identifying which are not significantly different (so it also shows us which are not statistically different from the largest mean and which are not significantly different from the smallest mean as a by product).

(e) We could select one of the screen cover types as a control, and then see which of the other means were statistically significantly different from it (so 4 different comparisons) while maintaining a family-wise error rate at the specified level. As part of the process it would indicate which had a statistically significantly larger mean and which were statistically significantly smaller. (As it does fewer comparisons, it would have more power than Tukey’s HSD for any false null hypotheses that were common between the two.)

5. (a) The Two-Factor ANOVA model assumes the responses at each factor level combination have a normal distribution. In this case, the assumption is not strictly met since the data are integer counts, but (especially since the counts are relatively large) they still may be approximately normal. A natural distribution for the counts would be a Poisson, but note that as the mean of the Poisson becomes large, the Poisson distribution resembles the normal distribution. So the normality assumption in this analysis may in fact be fine to use.

(b) Something like:

Or like:

(c) The analyst has ignored the fact that there is significant interaction here. He should not compare the materials alone, nor should he compare the machines alone. A full conclusion would demand some inferential comparisons among machine-material combinations, whether post hoc multiple comparisons or pre-planned comparisons (contrasts). At first glance, it appears to be the case that machine I is better if producing Plastic or Rubber gaskets, but machine II is better if producing Cork gaskets.

6. (a)

The predicted stack loss for a plant having air flow 60, water temperature 25, and acid concentration 90 is:

(b) H0: 1 = 23 = 

Ha: Not all i = 0, for i = 1, 2, 3

For this data set, we would reject H0, because of the very large ANOVA F-value of 59.9. (P-value < 0.0001).

(c) Multicollinearity refers to correlation between the values of the predictor variables in the data set. In this data set, there is not a problem with multicollinearity since all of the Variance Inflation Factors (VIFs) are relatively small (less than 3). VIFs greater than 5 or 10 or so would indicate problems with multicollinearity. [Note that it is still possible to suffer the effects of multicollinearity with smaller VIFs.]

(d) Using the rule of thumb that an observation is influential if its |DFFITS| > [2(3 + 1)/21]1/2, we see that the observations with |DFFITS| > 0.617 are 1, 3, 4, and 21. Using the alternate rule of thumb that calls an observation influential is |DFFITS| > 1, only observation 21 (with |DFFITS| = 2.1) is influential.

(e) H0: 23 = given thatis in the model)

Ha: Not both i = 0, for i = 2, 3

(f) Based on the SAS output for the F-test, since F = 6.67 > 3.59 = F0.05, 2, 17, we reject H0 and conclude that the model with only air flow may NOT be sufficient. (P-value = 0.0073)

(g) H0: 3 = given thatand 2 are in the model)

Ha: 3 ≠ 0

This can be done with a t-test. From the SAS output, t = -0.97, and since |t| is not greater than t0.025, 17 = 2.11, we fail to reject H0. The model with only air flow and water temperature is sufficient. (P-value = 0.344)