CES 514 Data Mining Spring 2010

Home Work # 1

(Due: Feb 24, 2010)

1.  Consider a conjunctive query that involves intersecting several sets. Write a program in Matlab that attempts tp optimize this computation by considering the sets in the increasing order of size. (Thus, the smallest two sets are first intersected, the result is intersected with the next smallest set etc.) Your program should also output the number of comparisons performed by this optimized algorithm. Assume that each set is an array of integers, and the first parameter to your program is a cell array of all these arrays. Since the number of sets being intersected is itself a variable, the second input to your program is an array containing the indices of the sets being intersected.

Example: Suppose the arrays are [1, 3, 5, 11], [2, 5, 10] [2, 3, 4, 11, 14] and [1, 11] and suppose we want to intersect the sets 1 and 4. Then, the call mintersect(C, [1,4]) (where C = {[1, 3, 5, 11], [2, 5, 10] [2, 3, 4, 11, 14], [1, 11]}) should get the task accomplished and the resulting intersection would be [1,11].

Solution: Shown below is the Matlab code to implement a solution to this problem.

function [res, cnt] = sinter(carr)

len = comp_lengths(carr);

[ind,len] = sort(len);

[cnt, res] = intersect(carr(ind(1),ind(2));

for j=3:length(len)

[tcnt, res] = intersect(res, carr(ind(j)));

cnt = cnt + tcnt;

end;

end;

function len = comp_lengths(carr)

len(1) = length(carr(1));

for j = 2:length(carr)

len(end+1) = length(carr(j));

end;

end;

2.  Exercise 1.11

SOLUTION.

A naive evaluation of the query x AND (NOT y) would be to calculate (NOT y) first as a new postings list, which takes O(N) time, and the merge it with x. Therefore, the overall complexity will be O(N).

An efficient postings merge algorithm to evaluate x AND (NOT y) is:

MERGE1 (x, y)

1 answer <- ( )

2 while x!= NIL and y!= NIL

3 do if docID(x)=docID(y)

4 then x<- next(x)

5 y<-next(y)

6 else if docID(x)<docID(y)

7 then ADD(answer,docID(x))

8 x<- next(x)

9 else y<-next(y)

10 return(answer)