Practice Problems for Midterm #1

Solution to Practice Problems for Applied Statistics Midterm #1

1) What is the difference between a population and a sample? What is our objective in examining samples?

Answer: A population is the set of all items. A sample is a subset of the population. Our objective is to infer things about populations by examining samples.

2) Using the unemployment data in the Excel spreadsheet P1DATA.XLS (linked to the course website), create:

a) a frequency distribution

b) a relative frequency distribution

c) a percent frequency distribution

d) a cumulative relative frequency distribution

e) a cumulative percent frequency distribution

Answer: For brevity, I have eliminated the middle items.

Value / Frequency / Relative Frequency / Cumulative Relative Frequency / Percent Frequency / Cumulative Percent Frequency
3.4 / 9 / 0.0186 / 0.0186 / 1.86% / 1.86%
3.5 / 8 / 0.0165 / 0.0351 / 1.65% / 3.51%
3.6 / 1 / 0.0021 / 0.0372 / 0.21% / 3.72%
3.7 / 8 / 0.0165 / 0.0537 / 1.65% / 5.37%
3.8 / 16 / 0.0331 / 0.0868 / 3.31% / 8.68%
3.9 / 7 / 0.0145 / 0.1012 / 1.45% / 10.12%
… / … / … / … / … / …
10.3 / 1 / 0.0021 / 0.9897 / 0.21% / 98.97%
10.4 / 3 / 0.0062 / 0.9959 / 0.62% / 99.59%
10.5 / 0 / 0.0000 / 0.9959 / 0.00% / 99.59%
10.6 / 0 / 0.0000 / 0.9959 / 0.00% / 99.59%
10.7 / 0 / 0.0000 / 0.9959 / 0.00% / 99.59%
10.8 / 2 / 0.0041 / 1.0000 / 0.41% / 100.00%
Total / 492

f) a histogram

g) an ogive

h) a stem and leaf display (use only the years 1981-1984 for this)

Answer: Note that I have used the integer portion as the stem.

7 / 2 / 2 / 2 / 2 / 3 / 3 / 4 / 4 / 4 / 4 / 4 / 5 / 5 / 5 / 5 / 5 / 6 / 7 / 8 / 8 / 9
8 / 0 / 3 / 3 / 5 / 5 / 6 / 8 / 9
9 / 0 / 2 / 3 / 4 / 4 / 5 / 6 / 8 / 8
10 / 1 / 1 / 1 / 2 / 3 / 4 / 4 / 4 / 8 / 8

3) Using the unemployment data used in problem 2,

a) what is the mean?

b) what is the median?

c) what is the mode?

d) what is the range?

e) what is the interquartile range?

f) what is the five number summary?

g) what is the variance?

h) what is the standard deviation?

i) what is the z-score of smallest observation?

j) what is the z-score of the largest observation?

Answer:

mean = 5.96

median = 5.7

mode = 5.4

range = 7.4

Q1 = 5.0 ; Q2 = 5.7 ; Q3 = 7.0

interquartile range = 2.0

five number summary = {3.4,5.0,5.7,7.0,10.8}

variance = 2.28

standard deviation = 1.51

z-score3.4 = -1.69

z-score10.8 = 3.20

k) create a table that lists the z-score for every item

Answer: For brevity, I have eliminated the middle items.

Date / Civilian Unemployment Rate / Z-Score
1960 / January / 5.2 / -0.50
1960 / February / 4.8 / -0.77
1960 / March / 5.4 / -0.37
1960 / April / 5.2 / -0.50
1960 / May / 5.1 / -0.57
1960 / June / 5.4 / -0.37
… / … / … / …
2000 / July / 4 / -1.30
2000 / August / 4.1 / -1.23
2000 / September / 3.9 / -1.36
2000 / October / 3.9 / -1.36
2000 / November / 4 / -1.30
2000 / December / 4 / -1.30

4) Are there any unemployment outliers in the data used in problem 2? Justify your answer carefully.

Answer: The z-score of 3.20 for the November and December 1982 periods is quite high, but we have a large set of data (492). There are no other items with z-scores over 3, but there are 10 items with z-scores over 2.7. Also note that the lowest z-score is only –1.69. This suggests that the data is highly skewed to the right. In this scenario, the high z-scores are not likely to be indicative of outliers.

5) Answer not provided due to its similarity to the take home exam.

6) Answer not provided due to its similarity to the take home exam.

7) Using the interest rate data and the classes formed in problem 5,

a) create a table of grouped interest rate data

Range / # of Observations
2.50 – 4.99 / 152
5.00 – 7.49 / 225
7.50 – 9.99 / 79
10.00 – 12.49 / 20
12.50 – 15.00 / 16

b) what is the mean of the grouped data?

c) what is the variance of the grouped data?

d) what is the standard deviation of the grouped data?

Answer: mean = 6.32; variance = 5.78; standard deviation = 2.40. Note that these calculation used a midpoint of (2.5+4.99)/2 = 3.745, etc.

8) Compare the values calculated in problem 7 to the mean, variance, and standard deviation of the full interest rate data. Comment on the potential errors when we have grouped data.

Answer: full data mean = 6.18; full data variance = 5.86; full data standard deviation = 2.42. Notice that these are slightly different than what we found in problem 7. This illustrates the basic problems associated with grouped data. As is typical (but not true all the time), our grouped estimates of variability are slightly below the estimates using the full data set. The means are also different, but is likely due to the skewed nature of the data.

9) Consider the following sales data:

Month /

Sales

/ Month /

Sales

January / $200 / July / $140
February / $190 / August / $150
March / $200 / September / $140
April / $180 / October / $120
May / $170 / November / $110
June / $170 / December / $90

a) Create a histogram of the data that might mislead people to believe that sales are generally increasing.

b) Create a time series plot that might mislead people to believe that sales are generally increasing.

c) Based on your answers to a) and b), what advice would you give to people who review business plans for potential investment?

Answer: We must be careful to examine the labels on any graphical and/or tabular display. We should NEVER look at a presentation without carefully considering the labels.

d) Briefly comment on the relationship between transparency, ethics, and legality.

Answer: A display might be perfectly accurate and complete, but not be transparent. If the display is misleading, we certainly violate ethical standards. Legislation typically addresses accuracy, so in many cases, one could create a misleading display that is unethical yet legal.

10) Evaluate the validity of the following statement: “A cutoff rule of z = +/-3 should be used to determine outliers”. If you disagree, comment on how one might determine an appropriate outlier rule.

Answer: The statement is too general and is therefore incorrect. Any cutoff rule must be determined on a case-by-case basis. In making the determination, the sample size should be considered as well as potential skewness in the data. For example, large samples commonly have values with z=+/-3. Such values should not be discarded.

11) Evaluate the validity of the following statement: “Ethical behavior demands that we present data in such a way that it is accurate and complete, but not transparent”.

Answer: The statement is decidedly incorrect. Transparency is important for both communicative reasons and ethical reasons. If we present data (or descriptions of data) in such a way that it is difficult to interpret, we are guilty of potentially misleading our audience. Whether intentional or not, this could be considered an ethical violation.

12) Briefly comment on how outliers might cause descriptive statistics to be misleading.

Answer: The existence of just one outlier can dramatically alter many of our descriptive statistics. For example, suppose that we roll a die 10 times and come up with {3,2,5,4,3,6,1,2,4,2}. The mean is 3.20, the standard deviation is 1.55, and the five-number summary is {1,2,3,4.5,6}. Suppose, though, that we accidentally typed in 66 instead of 6 in the data. The mean is then 9.20, the standard deviation is 19.99, and the five-number summary is {1,2,3,4.5,66}. In this example, we would surely catch the error. In many cases, however, such errors are difficult to detect.

13) Consider four sets A, B, C, and D such that AÇB¹Æ, AÇC¹Æ, AÇD=Æ, BÇC¹Æ, BÇD=Æ, CÇD=Æ, and AÇBÇC=Æ. Draw a Venn diagram depicting this situation.

Answer:

14) A door-to-door salesman has examined historical data on his success given the sex of the person who answers the door. 74% of the time, a woman answers the door. He has also noted the following:
P(sale)=0.3 (i.e., the man makes a sale at 30% of the houses he approaches)
P(sale Ç woman)=0.19 (i.e., 19% of the time, a woman answers the door and a sale follows).
What is the probability of getting a sale given that a man answers the door?

Answer: We want P(sale|male). We are tempted to use Bayes’ Rule, but it isn’t necessary. We know that P(sale Ç male) = 0.3-0.19 = 0.11 and that males answer the door 26% of the time. From our rule for conditional probabilities, P(sale|male)=P(sale Ç male)/P(male) = 0.11/0.26 = 42.3%.

15) A bank screens credit applicant based on three factors, current debt, income, and prior payment history. 40% of all applicants are rejected. 15% of applicants fail the debt test. 20% of applicants fail the income test. 5% of applicants fail the payment history test. You know that a certain customer applied and was rejected. What is the probability that the customer was rejected due to low income? Comment on your ability to answer the question if 30% of all applicants are rejected (and the other numbers are the same).

Answer: We want P(low income|rejection). We know from Venn diagrams and the derivation of Bayes’ Rule that

So there is a 50% chance that the applicant failed due to low income. We can use this approach because the factors are independent. If 30% of all applicants are rejected, then there must be some applicants that were rejected for multiple reasons. To answer the question, we must clarify whether we are interested in “rejected due only to low income” or “rejected due to low income and/or other reasons”. If the former, the answer is 0.20/0.30 = 66.67%. If the latter, we cannot answer without additional information.

16) Suppose that telemarketing sales are dependent on two factors: weather (when it’s raining, more people are home) and time of day (if you call during prime time, people are less likely to answer the phone). Those factors are independent. It rains with probability 0.1 and prime time constitutes 40% of the normal calling hours. A telemarketer can make 10 calls per hour. The net profit (including everything except telemarketer wages) per successful call is $9 and the probabilities of success on a given call are as follows.

Raining / Not Raining
Prime Time / 0.25 / 0.15
Not Prime Time / 0.3 / 0.2

Telemarketers charge $15 per hour. What is the expected profit per hour of calling? Should you implement a restricted calling plan? If so, what would you recommend? What is the expected profit per hour of calling under the new plan?

Answer: The probabilities for the possible scenarios are

Raining / Not Raining
Prime Time / 0.1´0.4 = 0.04 / 0.9´0.4 = 0.36
Not Prime Time / 0.1´0.6 = 0.06 / 0.9´0.6 = 0.54

P(success on a randomly chosen call) = P(prime time & raining)´P(success | prime time & raining)

+ P(not prime time & raining)´P(success | not prime time & raining)

+ P(prime time & not raining)´P(success | prime time & not raining)

+ P(not prime time & not raining)´P(success | not prime time & not raining)

= 0.04´0.25 + 0.06´0.3 + 0.36´0.15 + 0.54´0.2

= 0.19

The expected profit per hour of calling is then 10´0.19´$9 - $15 = $2.10. On average, 1.9 calls per hour are successful, giving a net profit of $17.10 less the $15 paid to the telemarketer.

The lowest probabilities of success occur during prime time, so we might consider not making calls during prime time. To answer this, we consider whether prime time calling is profitable or not.

P(success on a randomly chosen call during prime time) =

P(raining)´P(success | raining)

+ P(not raining)´P(success | not raining)

= 0.1´0.25 + 0.9´0.15 = 0.16

The expected profit per hour of calling during prime time is then 10´0.16´$9 - $15 = -$0.60. So, we should not make calls during prime time.

P(success on a randomly chosen call not during prime time) = 0.1´0.3 + 0.9´0.2 = 0.21.

Expected profit under the new plan = 10´0.21´$9 - $15 =$3.90.

One might also consider not calling unless it is raining, but that would be difficult to implement. It also might result in low morale because the employees would be on a very uncertain work schedule. I therefore chose not to consider that possibility.