Question 1: a) Briefly outline the major steps of decision tree classification.
b) Consider the following data set:
X Y Z f
1 0 1 1
1 1 0 0
0 0 0 0
0 1 1 1
1 0 1 1
0 0 1 0
0 1 1 1
1 1 1 0
Draw the decision tree which would be learned from this data using the recursive splitting algorithm. Assume that splits are chosen using information gain, and gain ties are broken to prefer splits by alphabetical order.
Question 2: What is meant by an ‘outlier’? Of the following set of values, which is the outlier?
{0, 0.2, 0.5, 0.6,−0.1, 42, 0.67}.
2. What is the purpose of a ‘test set’? Give one advantage and one disadvantage of using a large test set.
3. What is the difference between ‘supervised’ and ‘unsupervised’ learning? Name an unsupervised learning algorithm and give an example of how it can be used in practical applications.
4. Explain the difference between ‘nominal’ and ‘continuous’ attributes. Give TWO examples of each type of attribute.
5. Write down the definition of bias and variance when fitting a model m(x) on a dataset D. Draw a sketch graph showing the effect of sampling multiple training datasets on a linear regression model.
Question 3: This question is concerned with association rules with the following dataset.
Temperature Humidity Windy Play
hot high false no
hot high true no
hot high false yes
mild high false yes
cool normal false yes
cool normal true no
cool normal true yes
mild high false no
cool normal false yes
mild normal false yes
mild normal true yes
mild high true yes
hot normal false yes
mild high true no
a) What are the two main differences between association rules and classification rules?
Illustrate your answer with an example.
b) Define the ‘coverage’ and ‘accuracy’ of an association rule. Explain the role of these
measures in evaluating association rules.
c) Define the term ‘item set’. Give two examples of an item set for the dataset in this
question.
d) What is the ‘subset theorem’? How can it be applied to reduce the search space of item
sets? Find all two-item sets with coverage of at least 4.
e) From the item sets you found in part (d), calculate the two single-consequent rules with
the greatest accuracy.
Question 4:
(a)Explain how the following clustering methods work:
i) Agglomerative ii) Hierarchical iii) K-Means
(b)The k-Means algorithm relies on iterating between two steps. List these two steps
succinctly?
(c)Name two computational limitations of the k-Means clustering algorithm?
(d)Consider applying k-Means to a dataset that consists only of binary variables. How could
you calculate distances between a centroid and an instance.
(e)What is the objective function of the k-Means algorithm.
(f)Name one way that the clusters found by the agglomerative clustering algorithm differ to
those found by the k-means clustering algorithm?
(g)A fundamental assumption of basic classification algorithms is that the training and test set
data distributions are stationary. What does this mean?
(h)The naïve Bayes algorithm assumes the independent variables are conditionally
independent of each other given the class label. What assumptions of independence does
decision tree algorithms make?
Question 5: The following is an example of customer purchase transaction data set.
CID / TID / Date / Items Purchased1 / 1 / 01/01/2001 / 10,20
1 / 2 / 01/02/2001 / 10,30,50,70
1 / 3 / 01/03/2001 / 10,20,30,40
2 / 4 / 01/03/2001 / 20,30
2 / 5 / 01/04/2001 / 20,40,70
3 / 6 / 01/04/2001 / 10,30,60,70
3 / 7 / 01/05/2001 / 10,50,70
4 / 8 / 01/05/2001 / 10,20,30
4 / 9 / 01/06/2001 / 20,40,60
5 / 10 / 01/11/2001 / 10,20,30,60
Note: CID = Customer ID and TID = Transactions ID
a)Calculate the support, confidence and lift of the following association rule. Indicate if the items in the association rule are independent of each other or have negative or positive impacts on each other. {10} -> {50,70}
b)The following is the list of large two item sets. Show the steps to apply the Apriori property to generate and prune the candidates for large three itemsets. Describe how the Apriori property is used is in the steps. Give the final list of candidate large three item sets. {10,20} {10,30} {20,30} {20,40}
c)Does customer 1 support the sequence <{20} {50,70} {10}>? Justify your answer.
d)Calculate the support of <{10}, {30}>.
e)Based on the types of association rules discussed in class, identify which type(s) of rules
{10}-> {50,70} is?
Good luck! Marghny