Question 1: a) Briefly outline the major steps of decision tree classification.

b) Consider the following data set:

X Y Z f

1 0 1 1

1 1 0 0

0 0 0 0

0 1 1 1

1 0 1 1

0 0 1 0

0 1 1 1

1 1 1 0

Draw the decision tree which would be learned from this data using the recursive splitting algorithm. Assume that splits are chosen using information gain, and gain ties are broken to prefer splits by alphabetical order.

Question 2: What is meant by an ‘outlier’? Of the following set of values, which is the outlier?

{0, 0.2, 0.5, 0.6,−0.1, 42, 0.67}.

2. What is the purpose of a ‘test set’? Give one advantage and one disadvantage of using a large test set.

3. What is the difference between ‘supervised’ and ‘unsupervised’ learning? Name an unsupervised learning algorithm and give an example of how it can be used in practical applications.

4. Explain the difference between ‘nominal’ and ‘continuous’ attributes. Give TWO examples of each type of attribute.

5. Write down the definition of bias and variance when fitting a model m(x) on a dataset D. Draw a sketch graph showing the effect of sampling multiple training datasets on a linear regression model.

Question 3: This question is concerned with association rules with the following dataset.

Temperature Humidity Windy Play

hot high false no

hot high true no

hot high false yes

mild high false yes

cool normal false yes

cool normal true no

cool normal true yes

mild high false no

cool normal false yes

mild normal false yes

mild normal true yes

mild high true yes

hot normal false yes

mild high true no

a) What are the two main differences between association rules and classification rules?

Illustrate your answer with an example.

b) Define the ‘coverage’ and ‘accuracy’ of an association rule. Explain the role of these

measures in evaluating association rules.

c) Define the term ‘item set’. Give two examples of an item set for the dataset in this

question.

d) What is the ‘subset theorem’? How can it be applied to reduce the search space of item

sets? Find all two-item sets with coverage of at least 4.

e) From the item sets you found in part (d), calculate the two single-consequent rules with

the greatest accuracy.

Question 4:

(a)Explain how the following clustering methods work:

i) Agglomerative ii) Hierarchical iii) K-Means

(b)The k-Means algorithm relies on iterating between two steps. List these two steps

succinctly?

(c)Name two computational limitations of the k-Means clustering algorithm?

(d)Consider applying k-Means to a dataset that consists only of binary variables. How could

you calculate distances between a centroid and an instance.

(e)What is the objective function of the k-Means algorithm.

(f)Name one way that the clusters found by the agglomerative clustering algorithm differ to

those found by the k-means clustering algorithm?

(g)A fundamental assumption of basic classification algorithms is that the training and test set

data distributions are stationary. What does this mean?

(h)The naïve Bayes algorithm assumes the independent variables are conditionally

independent of each other given the class label. What assumptions of independence does

decision tree algorithms make?

Question 5: The following is an example of customer purchase transaction data set.

CID / TID / Date / Items Purchased
1 / 1 / 01/01/2001 / 10,20
1 / 2 / 01/02/2001 / 10,30,50,70
1 / 3 / 01/03/2001 / 10,20,30,40
2 / 4 / 01/03/2001 / 20,30
2 / 5 / 01/04/2001 / 20,40,70
3 / 6 / 01/04/2001 / 10,30,60,70
3 / 7 / 01/05/2001 / 10,50,70
4 / 8 / 01/05/2001 / 10,20,30
4 / 9 / 01/06/2001 / 20,40,60
5 / 10 / 01/11/2001 / 10,20,30,60

Note: CID = Customer ID and TID = Transactions ID

a)Calculate the support, confidence and lift of the following association rule. Indicate if the items in the association rule are independent of each other or have negative or positive impacts on each other. {10} -> {50,70}

b)The following is the list of large two item sets. Show the steps to apply the Apriori property to generate and prune the candidates for large three itemsets. Describe how the Apriori property is used is in the steps. Give the final list of candidate large three item sets. {10,20} {10,30} {20,30} {20,40}

c)Does customer 1 support the sequence <{20} {50,70} {10}>? Justify your answer.

d)Calculate the support of <{10}, {30}>.

e)Based on the types of association rules discussed in class, identify which type(s) of rules

{10}-> {50,70} is?

Good luck! Marghny