Test 1, St 590, Dickey Solutions on last page /
1. (10 pts.) A leaf in a tree computed on training data had 200 1’s and 130 0’s. What is the contribution of that leaf ______to the misclassification count?
2. (10 pts.) If a contingency table has 3 rows and its Chi-square test statistic has 12 degrees of freedom, how many columns _____ must the table have?
3. (10 pts.) Average squared error is a way of assessing how well we did with which of the three kinds of predictions (estimates , decisions, rankings) mentioned in the book and in class?
4. (24 pts.) I ran a regression with monthly data. I fit a linear trend plus 11 monthly dummy variables using the reference cell approach with December as the reference cell. The linear part of the model is 100 + 3t where t is the observation number. The January dummy variable coefficient is 10.
(A) Observation 10 occurs in December for these data. Compute its predicted value______
(B) Find, if possible from the given information, the predicted values for observation 11 ______and observation 12 ______. (put “NP” if not possible)
(C) In words that might interest a non-statistician, give a briefinterpretation of that January coefficient 10.
5. (10 pts.) The following contingency table has counts of undergraduate students in 4 majors classified by whether they finished in 4 years or not, this being my target.
major
A B C D Totals
Finished? Yes 12 22 9 7 50
No 13 4 15 18 50
(A) Compute the contribution ______to the Chi-square statistic coming from the 9 people in major C who finished in 4 years.
(B) Suppose the p-value for the Chi-square statistic for this table is exactlyp=0.00010. Compute the associated logworth ______(no Bonferroni adjustment)
6. (36 pts.) The target in the decision tree below is whether a person purchases (Target=1) or does not purchase (Target=0) a product. The first split is on gender (M or F) and the second on age in years as shown.
M F
Age < 40.5 Age > 40.5
(A) Estimate the probability of purchasing for a female age 35 ______and a male age 35 ______.
(B) How many females ______were in the data setand how many of them purchased _____ ?
(C) How many people ______were in the root node?
(D) What is the percentage of 1’s ______and 0’s ______in the root node?
(E) Based on this tree, estimate the probability of purchasingfor the 5% of people most likely to purchase ______and from that, the lift ______at the fifth percentile.
/
SOLUTIONS
(1) Since it’s the training data the decision is 1,the more likely class, leading to 200 correct and 130 incorrect decisions thus 130 is the misclassification count coming from this leaf.
(2) 12 = (3-1)(c-1) so c=7.
(3) Estimates
(4) December: 100 + 3(10) + 0 = 130 (reference month)
January: 100 + 3(11) + 10 = 143 (January effect 10)
February: 100 + 3(12) + ? NP
The 10 is the January effect minus the December effect. It is the difference between those means after adjusting for the trend.
(5) (9+15)(50)/100 = 12 so (9-12)(9-12)/12 = 9/12 = 0.75 Logworth = 4 (p is 10 raised to the -4 power)
(6) F .4 M .9 (regardless of age) The point here is to remember that a leaf cannot be split any further . Its elements cannot be distinguished from each other. All males, regardless of age, have the same predicted probability based on the tree model.
8000 Females, (.65)(8000) = 5200 purchased. 35% did not.
10000 people in root node, 5200+(.9)(2000) = 7000 purchased so 70% and 30%
Most likely (males) .9 probability, overall .7 probability so lift = .9/.7 = 1.2857