252solng3-052 11/07/05
252solng3-05211/07/05 (Open this document in 'Page Layout' view!)Name:
Class days and time:
Please include this on what you hand in!
Graded Assignment 3
A. In your outline there are 6 methods to compare means or medians, methods D1, D2, D3, D4, D5a and D5b. Methods D6a and D6b compare proportions and method D7 compares variances or standard deviations. In the following cases, identify and and identify which method to use. If the hypotheses involve a mean, state the hypotheses in terms of both and . If the hypotheses involve a proportion, state them in terms of both and . If the hypotheses involve standard deviations or variances, state them in terms of both and or . All the questions involve means, medians, proportions or variances. One of these problems is a chi-squared test. Remember that a yes answer is not acceptable without an explanation.
Note: Look at 252thngs ( 252thngs) on the syllabus supplement part of the website before you start (and before you take exams) . Remember that I use and as parameters and and as sample statistics.
------
Example: This may seem long but it appears on an old Graded Assignment 3.
A group of supervisors are given the exams on management skills before and after taking a course in management. Scores are as follows.
Supervisor / Before / After1 / 63 / 78
2 / 93 / 92
3 / 84 / 91
4 / 72 / 80
5 / 65 / 69
6 / 72 / 85
7 / 91 / 99
8 / 84 / 82
9 / 71 / 81
10 / 80 / 87
11 / 68 / 93
If we assume that the distribution of results is Normal, what method should we use to answer the question “Has the course improved the scores of the managers?”
Solution: You are comparing means before and after the course. You can get away with using means because the parent distributions are Normal. If is the mean of the second sample, you are hoping that , which, because it contains no equality is an alternate hypothesis. So your hypotheses are or . If , then . The important thing to notice here is that the data are in before and after pairs, so you use Method D4.
------
Notes: 1) All methods in section D are methods that can only be used for comparison of 2 samples. This is because, if (theta) is a parameter like or is easy to define and will be zero if and are equal. If we go to more than two samples, say 3, we need something like , where is some sort of average of the parameters of the samples. This will equal zero if all the parameters are equal and will not allow positive discrepancies in one sample to cancel out negative discrepancies in another. This is what takes us to chi-squared and ANOVA methods.
Saying is not the same as saying , because would be negative if , but saying is the same as saying . (Try proving this – it’s simple algebra.)
2) You can always substitute a method for the median for a method for the mean, but not vice versa.
However, if a Normal distribution applies, a method involving means will be more efficient and powerful.
3) The computer will used Method D3 when it is not told what method to use. This is quite general because if the sample variances are similar, it gives results like D2 and if the sample sizes are large, it gives results like D1. However, if variances are equal D2 is easier to use and if the samples are large D1 is easier to use.
4) The K-S and Lilliefors methods only exist because chi-square performs so poorly for small samples. K-S needs or other parameters. Lilliefors uses and only works to test for a Normal distribution.
5) ‘Significant’ in statistics means that we have rejected a hypothesis like and ‘significantly different’ means that we have rejected a hypothesis like . Of course, if two parameters are significantly different, their difference is significant.
6) Be careful of inequalities. If or and , then
------
1. You have data on income in two villages ( in village 1, in village 2). You want to test the hypothesis that village 2 has higher earnings than village 1. You know that income has an extremely skewed distribution. and you have to decide whether to use the mean or the median income.
Solution: If is the median. . Since we are comparing medians and the data are not paired, use Method D5a. Note: If total earnings in the two villages are more important to you than typical earnings, go with the mean!)
Question for a Later Exam: What if there are 3 villages? Solution: Use Kruskal-Wallis.
2.The data in the file CONCRETE 1 on your CD represents the strength (measured by how many thousands of pounds/square inches that they can take without buckling) of 40 concrete samples on the second and seventh days after pouring.( is the strength on the second day and is the strength on the seventh day, each line refers to a single sample.) Assume that the underlying distribution is Normal and test the hypothesis that it is stronger on the seventh day.
Solution: If it is stronger on the seventh day, we can say . Because this does not contain an equality, it is an alternate hypothesis. or . If , then . The important thing to notice here is that the data are in pairs because each line refers to a single sample, so you use Method D4. If we use a confidence interval it will be of the form .
If we use a test ratio, we will compare with .
If we use a critical value for , we will use .
3. You have interviewed a sample of 80 small businesses in the Northeast and 75 small businesses in the Southeast. Each business has indicated whether they sell in foreign markets. You want to show that businesses in the Northeast are more likely to export. ( is the total number of firms that export in the Northeast sample, in the Southeast).
Solution: If or . If , then . Since we are comparing proportions from independent samples, use Method D6a.
If we use a confidence interval it will be of the form .
If we use a test ratio, we will compare with .
If we use a critical value for , we will use .
4. You expand the sample in 3 by adding 60 small businesses in the Midwest, ( is the number of these that export). You test the hypothesis that the same fraction of businesses export in each region.
Solution: If or . This is a chi-squared test of homogeneity. Since we are comparing multiple proportions, use a chi-squared test.
Somehow, I missed a question 5 and one method this year, so here’s a problem from last year’s assignment.
5. You have a sample of earned incomes for 25 couples, both of whom are teachers. ( is the women's incomes in a column, is the men's. Each line represents one couple. ) Test to see if the men make more than the women.
Solution: If we are working with incomes, especially if the sample is small and we have no reason to believe that a symmetrical distribution applies, we should compare medians. If is the median. . Since we are comparing medians and the data are paired, use Method D5b.
Question for a Later Exam: In Southern Utah and Northern Arizona there are a number of polygamous families, so what if we add a column 3 for a second wife? Solution: Friedman Test.
6. In order to see which garage to use under contract for automobile repairs, 10 cars are towed first to garage 1 and then to garage 2. You end up with two data sets, the first data column,, is estimates from the first garage and the second data column,, is estimates for the second garage. Each of the 10 lines of data refers to one car. You believe that the estimates are approximately normally distributed. Compare the estimates in garage 1 and 2. Would you change your method if there were 200 cars?
Solution: There is no reason to assume that one garage is cheaper than the other, so or . If , then . Again, you compare means because you are, presumably, interested in the total amount that you will pay for the repairs, which means that you want the lowest average cost. The important thing to notice here is that the data are in pairs, so you use Method D4. If the sample expands from 10 to 200, Method D4 is still the most efficient and the only valid method because it is the only method for paired data.
If we use a confidence interval it will be of the form .
If we use a test ratio, we will compare with .
If we use critical values for , we will use .
Question for a Later Exam: What if we want to check 3 garages? Solution: 2-way ANOVA, with one measurement per cell.
7. You have processing times in seconds, , for a sample of 5 computer jobs from the accounting department and for 6 jobs from the research department, . You believe that the underlying distributions are Normal and want to show that research jobs take longer than accounting jobs. Would you change your method if
Solution: or . If , then . Because you believe that the Normal distribution applies, you use a method that compares means. The total sample size is too small to use Method D1, which means that D2 or D3 should work. You could test the variances for equality and use D2, or not bother and use D3.
If we use a confidence interval it will be of the form .
If we use a test ratio, we will compare with .
If we use a critical value for , we will use .
Question for a Later Exam: What if we want to compare three or more departments? Solution: One-way ANOVA.
8. You are having a part produced in two different machines. is 200 randomly selected data points that represent the length of parts from machine one, is 200 randomly selected data points that represent the length of parts from machine two. You want to test your suspicion that parts from machine 2 are longer than parts from machine 1. In a problem of this type you would assume that the lengths are normally distributed.
Solution:In a problem of this type you would assume that the lengths are normally distributed. You could use Method D2 (if you tested the variances for equality) or D3 here, but, since you have two large samples, it would be far easier to use Method D1.
or . If , then .
9. You also suspect that parts from machine two are more variable in length than parts from machine one (This is the same as saying that machine 2 is less reliable than machine 1). Test this suspicion.
Solution: When you see words like reliability or variability, think variance or standard deviation. or . In terms of the variance ratio or , the alternate hypothesis rules, so and . Since you are comparing variances, use Method D7. Compare the ratio against
Question for a Later Exam:What if you doubt that the Normal Distribution applies? Solution:Levene Test.
Question for a Later Exam:What if there are 3 machines? Solution: Levene or Bartlett Test.
10. A panel is exposed to an ad for Smelly-Welly Dirt Devourer. Before seeing the ad, 5 out of the 40 members had a favorable impression of Smelly-Welly. After seeing the ad, 2 more members of the panel plus the original 5 had a favorable impression. Has the proportion with favorable impressions risen significantly?
Solution:It clearly says that we are comparing proportions and that we want to see if . Since this does not contain an equality, it must be an alternate hypothesis. If we can use or . If , then . We do not have independent samples but are sampling twice from the same sample. Since we are not comparing proportions from independent samples, use Method D6b. Note that in our setup , we have 2 people who said no on the first question and yes on the second, but no people who said yes on the first, but no on the second.
We will compare against .
Question for a Later Exam: What if we now show another ad and 1 more person switches into the favorable column?
Solution: I really haven’t figured this one out yet. Sounds like some variation on the Chi-squared test.
Summary
It may help to use the following table.
Paired Samples / Independent SamplesLocation - Normal distribution.
Compare means. / Method D4 / Methods D1- D3
Location - Distribution not Normal. Compare medians. / Method D5b / Method D5a
Proportions / Method D6b / Method D6a
Variability - Normal distribution. Compare variances. / Method D7
B. You have 3 methods that can be used for goodness of fit tests. Chi-squared, Kolmogorov-Smirnov and Lilliefors. Which would you use in the following cases? 1. You want to know if the Normal distribution applies to a data set.
a. The data set consists of 15 numbers – you do not know the population mean and variance and will have to compute sample means and variances from the data.
b. The data set consists of 15 numbers – you think that you know the population mean and variance.
c. The data set consists of 5000 numbers and you have observed frequencies for the following intervals:below 1000, 1000-11199.99, 1200-1399.99, 1400-1599.99 ……….2600-2799.99, 2800 and above. You think you know the population mean and variance.
d. The data set consists of 5000 numbers and you have observed frequencies for the following intervals: below 1000, 1000-11199.99, 1200-1399.99, 1400-1599.99 ……….2600-2799.99, 2800 and above. You have computed a sample mean and variance from the data.
Solution: a. If you only have 15 numbers, there are too few to be able to get more than three frequencies if we use Chi-squared. K-S cannot be used if the parameters are unknown, so that you will have to use Lilliefors.
b. If you only have 15 numbers, there are too few to be able to get more than three frequencies if we use Chi-squared. K-S can be used because the parameters are known in advance.
c. With large data sets, most people use Chi-squared. However, there is no reason not to use K-S in this case.You can use K-S because you think you know and
d. With large data sets, most people use Chi-squared. However, there is no reason not to use Lilliefors in this case. You cannot use K-S because you do not know or
In a and d, our hypotheses are .
In b and c, our hypotheses are , where we replace and with the numbers that we believe are correct before we start.
1