Master of Applied Statistics
Applied Statistics Comprehensive Exam
May 2014
Directions: This is a closed book exam with a 3-hour time limit. Attached you will find the relevant computer output, two pages of formulas, and tables for the normal, t, 2, and F distributions. You may use a non-programmable, non-graphing calculator.
Answer Only Five of the Six Questions.
1) A scientist is trying to compare the mean lengths (in cm) of two populations of green tree frogs. The first sample of 23 frogs was found in large lakes, while the second sample of 34 frogs was found in small ponds. The data (along with sample means and variances and some graphical summaries) is attached
(a) Set up the correct null and alternative hypotheses for this situation.
(b) Formally compare the means using the appropriate hypothesis test. Use = .05.
(c) Carefully explain the reasoning behind setting =0.05 (e.g. what is the underlying meaning of significance level?).
(d) Suppose hypothetically that it was found that the scientist had, before the analysis, omitted some values that seemed like outliers. The omitted values were, for the lake sample: 4.12, 3.72, 3.35, 6.89, 7.56, 7.77, and for the pond sample: 4.01, 3.23, 3.35, 3.12, 3.14, 3.56, 3.67, 8.13. Discuss whether the scientist’s action in omitting these values can be justified, explaining your reasoning clearly.
(e) Suggest a nonparametric approach to testing the hypothesis of interest in this problem. What are the conditions for the nonparametric method to be appropriate?
2) With the impending destruction of three more parking lots on campus, a survey is being conducted to assess the views of faculty and staff towards various parking options. One of these is the option of paying $40 per month for one of a limited number of reserved spots in the otherwise free, but first come-first serve, parking. One of the research questions concerns the possible differences in opinion of those who currently pay for $60/month garage spots versus those who use a free parking lot.
To assess this, a simple random sample of 150 current garage space users is gathered (of which 116 think this is a good option to offer), and a second simple random sample of 150 current lot users is gathered (of which 79 think this is a good option to offer).
(a) Construct a 90% confidence interval for the difference in the proportion of garage space users and lot users who think this is a good option to offer.
(b) Verify whether the assumptions for constructing this interval are met.
(c) Interpret this interval in terms of the problem, including what we mean by “90% confident”.
(d) Imagine that instead of creating a confidence interval, it was desired to test the hypothesis that the garage users would be more receptive to this idea than the lot users. Briefly explain why we use in the denominator instead of p1 and p2.
(e) Imagine that the university decided to use three groups instead of two – garage users, a Z lot users, and seniority-based lot users – for examining whether there were differences in opinions on this proposal. What sort of chi-square test could be applied in this case (goodness-of-fit, homogeneity, or independence), and how many degrees of freedom would it have?
3) The attached data set FCAT contains test scores from 22 Florida elementary schools as found in McClaveSincich (2013) and based on data from Tekwe (2004). MATH and READING are the average scores of third graders in those schools on the corresponding sections of the Florida Comprehensive Assessment Test (FCAT). POVERTY is the percentage of students below the poverty level in the school. The goal is to determine if there is a relationship between the poverty level within the school and the average mathematics score (predicting the mathematics score from the poverty level).
(a) Give the formal model being fit in this simple linear regression – including both the model equation, identification of any symbols used, and all necessary assumptions.
(b) Verify each of the assumptions, or state what additional output would be needed to verify them and how it would be used.
For parts c and d, assume that the assumptions are believable.
(c) Describe the relationship between the average math scores and percent scoring below the poverty level. Statistically justify your statements.
(d) Consider constructing a 95% confidence interval for and a 95% prediction interval for for an observation with a poverty level of 45. (You do not need to actually calculate them!) Describe how each should be interpreted in terms of the problem.
4) An experiment is conducted to compare the effectiveness of five different computer screen overlays for the reduction of eye strain. The five covers were “clear”, “clear-anti-glare”, “tinted”, “tinted anti-glare”, and “softening”. Ninety volunteers were gathered with approximately the same uncorrected vision, amount of time spent on-line, and ambient light in their offices. They were randomly assigned an overlay (18 to each) to use for an entire work week, and a measure of cumulative total eye strain (scale from 0=low to 50=high) was collected for each subject.
(a) Give the null and alternate hypothesis that would be tested by the omnibus test (aka ANOVA table test), being sure to identify any symbols used in the context of the problem.
(b) Consider a colleague with less statistical experience than you. Briefly explain why an omnibus test like the one in (a) is unlikely to reveal anything of practical use.
(c) Specify the contrast that would be used for comparing “anti-glare” to “glare”. Identify the null and alternate hypothesis tested by it in terms of the notation in (a).
(d) Describe what purpose would be accomplished by using Tukey’s Honestly Significant Differences procedure with this experimental data.
(e) Describe what purpose would be accomplished by using Dunnett’s procedure on this experimental data.
5) A company wants to compare the mean production levels of two machines that produce gaskets of three different materials (cork, rubber, and plastic). An experiment is conducted in which 3 production runs (each lasting one hour) were made by each machine for each gasket material. The data (number of gaskets produced per run) are as follows:
Machine
MI MII
Rubber76, 74, 8265, 73, 63
MaterialPlastic 86, 84, 8872, 81, 81
Cork53, 54, 4559, 64, 64
(a) The two-way ANOVA model assumes that at each combination of machine and material, the production follows what type of distribution? Explain why the two-way ANOVA model may be appropriate in this case even though the data here are necessarily counts (whole numbers).
(b) Draw an interaction plot (profile plot) based on these data that will indicate whether there appears to be interaction between Machine and Material. What is your conclusion, based on the plot?
(c) Some output from SAS for this analysis is attached. An analyst examined the output and stated, “Based on the F-tests, I can see that the mean production differs among the three materials, but there is no significant difference in mean production between the two machines.” His interpretation was that the company may confidently use either machine to produce the gaskets. Assess the validity of the analyst’s conclusion.
6) The “stack loss” (y) is a measure of the efficiency of an industrial plant in absorbing ammonia. To predict stack loss, a multiple regression model was fit with the following independent variables: Air Flow (x1), Water Temperature (x2), and Acid Concentration (x3). SAS output from this regression analysis is attached.
(a) Write the fitted least-squares regression equation and use it to predict the stack loss for a plant having air flow 60, water temperature 25, and acid concentration 90. Show your work!
(b) What are the null and alternative hypotheses for testing whether there is any regression relationship between y and the set of independent variables? For this data set, what would be the conclusion of such a test? Justify your answer.
(c) Explain what multicollinearityis. Does there appear to be a problem with multicollinearity in this regression analysis? Justify your answer.
(d) Based on the SAS output, do any of the observations appear to be particularly influential? If so, which one(s)? Justify your answer.
(e) A supervisor raised the question: “If we include ‘Air Flow’ in our model, do we really need to include ‘Water Temperature’ and ‘Acid Concentration’ as well? Those two variables are harder to measure.” Write the hypothesis we could test to answer the supervisor’s question.
(f) A statistician performed the hypothesis test from part (f) in SAS and obtained the following output:
Mean
Source DF Square F Value Pr > F
Numerator 2 70.14307 6.67 0.0073
Denominator 17 10.51941
At = 0.05, what is the conclusion of the test regarding the need for ‘Water Temperature’ and ‘Acid Concentration’? Justify your answer numerically.
(g) Suppose the supervisor in part (e) had asked, “If we include ‘Air Flow’ and ‘Water Temperature’ in our model, do we really need to include ‘Acid Concentration’ as well? That variable is harder to measure.” Write the hypothesis we could test to answer the supervisor’s question, and conduct the test at = 0.05.