Data Mining: Tutorial 1

Intelligent Data Analysis and Probabilistic Inference

Data Mining Tutorial 3: Clustering and Associations Rules

i. Explain the operation of the k-means clustering algorithm using pseudo code.

ii. Given the following eight points, and assuming initial cluster centroids given by A, B, C, and that a Euclidean distance function is used for measuring distance between points, use k-means to show only the three clusters and calculate their new centroids after the second round of execution.

ID / X / Y
A / 2 / 10
B / 2 / 5
C / 8 / 4
D / 5 / 8
E / 7 / 5
F / 6 / 4
G / 1 / 2
H / 4 / 9

i. Explain the meaning of support and confidence in the context of association rule discovery algorithms and explain how the a priori heuristic can be used to improve the efficiency of such algorithms.

ii. Given the transactions described below, find all rules between single items that have support >= 60%. For each rule report both support and confidence.

1: (Beer)
2: (Cola, Beer)
3: (Cola, Beer)
4: (Nuts, Beer)
5: (Nuts, Cola, Beer)
6: (Nuts, Cola, Beer)
7: (Crisps, Nuts, Cola)
8: (Crisps, Nuts, Cola, Beer)
9: (Crisps, Nuts, Cola, Beer)
10:(Crisps, Nuts, Cola, Beer)

3. a. Explain how hierarchical clustering algorithms work, make sure your answer describes what is meant by a linkage method and how it is used.

b. Explain the advantages and disadvantages of hierarchical clustering compared to K-means clustering.

4. The following table shows the distance matrix between five genes,

G1 / G2 / G3 / G4 / G5
G1 / 0
G2 / 9 / 0
G3 / 3 / 7 / 0
G4 / 6 / 5 / 9 / 0
G5 / 11 / 10 / 2 / 8 / 0

i. Based on a complete linkage method show the distance matrix between the first formed cluster and the other data points.

ii. Draw a dendrogram showing the full hierarchical clustering tree for five points based on complete linkage.

iii. Draw a dendrogram showing the full hierarchicatree for the five points based on single linkage.

Data Mining Tutorial 3: Answers

Clusters after 1st iteration

Cluster1: A (2,10), D (5,8), H (4,9)

Cluster2: B: B (2,5), G (1,2)

Cluster3: C (8,4), E (7,5), F (6,4)

Centroids after 1st iteration

Cluster1: centroid: (3.66, 9)

Cluster2: centroid: (1.5, 3.5)

Cluster3: centroid: (7, 4.33)

Clusters after 2nd iteration (no change) Cluster1: A (2,10), D (5,8), H (4,9)

Cluster2: B: B (2,5), G (1,2)

Cluster3: C (8,4), E (7,5), F (6,4)

Centroids after 2nd iteration (no change)

Cluster1: centroid: (3.66, 9)

Cluster2: centroid: (1.5, 3.5)

Cluster3: centroid: (7, 4.33)

Initial Supports

Beer: Support = 9/10

Cola: Support=8/10

Nuts: Support=7/10

Crisps: Support=4/10 (Drop Crisps)

Beer, Cola: Support=7/10

Beer, Nuts: Support=6/10

Cola, Nuts: Support=6/10

Beer->Cola (Support=70%, Confidence= 7/9=77%

Cola->Beer (Support=70%, Confidence= 7/8=87.5

Beer->Nuts (Support=60%, Confidence= 6/9=66%

Nuts->Beer (Support= 60%, Confidence= 6/7=85.7%

Cola->Nuts (Support=60%, Confidence= 6/8=75%

Nuts->Cola (Support=60%, Confidence= 6/7=85.7%

4. The first cluster will be formed from G3 and G5 since they have the minimum distance.

G35 / G1 / G2 / G4
G35 / 0
G1 / 11 / 0
G2 / 10 / 9 / 0
G4 / 9 / 6 / 5 / 0

, 16th Dec2003