Dr. Christoph F. Eick

Draft Graded Homework2 COSC 6340

Spring 2005

Due: Mo., April 11, 11a (electronic submission)

Last updated: Mo., April 4, 3p

Remark: Points associated with particular problems are subject to change.

5) Clustering [5] Graded

Suppose the task is to cluster the following seven points (1, 1), (1, 3), (4, 5), (5, 5), (2, 1), (7,7), (6,3) into 3 clusters (k=3) assuming Manhattan Distance (d((x1,x2),(y1,y2))=|x1-y1| + |x2-y2|) and the K-means clustering algorithm. Assume that the K-means initially assigns (1,1), (1, 3) and (7, 7) to Cluster1 , (4,5) to Cluster2 and the other 3 points to Cluster3. How do those clusters change in the first two iterations when applying k-means?

6) Similarity Assessment [7] Graded

Assume the following relation Students(ssn, age, gpa, avg_class_rank) that contains students that were admitted in the year 2000 into our undergraduate program is given. You can assume that

·  age is an integer; the maximum age is 50 the minimum age is 20, and the average age is 28 and the mean absolute deviation is 10.

·  gpa denotes the UH COSC gpa; the average gpa is 2.8 and the mean absolute deviation is 0.6; the maximum gpa is 4.0 the minimum gpa is 0.

·  Avg_class_rank has 6 values (5=top-5%, 4=top-10%, 3=top-20%, 2=top-30% 1=top_half, 0=bottom half)

A.  Define a student distance measure that considers gpa and class_rank of being of major importance, and age of being of minor importance. [7]

B.  Using your distance measure compute the distance for the following pair of students [2] :

(111111111, 25, 2.8, 2)

(222222222, 24, 3.7, 5)


7) Association Rule Mining [5] Graded

Assume you have to apply the APRIORI algorithm assuming that the minimum support is 3/9 (3 out of 9) to the following set of 9 transactions that involve purchases of items A, B, C, D, E, F, G.

T1={A, C, D} T6={A, C, D, E, F}

T2={A, D, F} T7={A, B, D, F}

T3={D, E, F} T8={A, B, C, D, F}

T4={A, B, D, F} T9={B, C, E}

T5={A, F}

Describe how Apriori’s Large Item Set Generation algorithm works for the example.

List what candidate item sets will be generated in each pass, and which remain in the candidate item set after pruning (use notations of the Han book) [4]

Assuming minimum confidence is 75%, give 2 rules (of your own choice) that would be generated by an association rule mining algorithm. [1]

8) Normalization [6] graded

R(A,B,C,D,E,F) is given with: (1) ABàCD (2)CDàAB (3)ABàF (4) FàE

a) What are the candidate keys of relation R? [1]

b) Transform R into a relational schema that is in BCNF and does not have any lost functional dependencies! [5]


9) Reasoning with MVD and FD’s Ungraded (solution to be discussed in class on April 12, 2005)

Assume a relation R(A,B,C,D,E, F) with the following dependencies is given:

(1) AB à CDEF

(2) C à ABDEF

(3) E àà BC

Answer the following questions giving reasons for your answers:

a) Is R in 4th normal form? [1]

b) Is it possible to decompose R into R1 (A, B, D, E) and R2 (C, E, F)

without loss of information? [3]

c) Does C àà BDEF hold for R? [1]

d) Does E àà D hold for R? [5]

e) Does E à A hold for R? [5]

f) Is R in BCNF? [3]

1