University of Pittsburgh

School of Computing and Information

Department of Informatics and Networked Systems

INFSCI 2711: Advanced Topics in Database Management

Spring 2018: Wednesday

Homework #2: Distributed Query Processing

Graded out of 70 points;

Due: in class, Monday 4/11,

1. [20 pt] Considerthe following relations

Student:

s_num / s_name / grade
s1 / Johns Smith / 1
s2 / Bill Evasn / 2
s3 / Daniel Brown / 1
s4 / Mel Gibson / 5

Library:

person_name / borrow_book
J. Smith / Pride and Prejudice
Bill J. Evasn / Harry Potter
Daniel Brown / Vampire
Mal Gibson / Little Mermaid

Fill in the following table for every pair of s_names in Studentand person_name in the Library table. For Overlap and Jaccard distances assume that strings are sets of words separated by blank spaces.

Author.a_name / Book.author_name / sed / length filter / Sednorm / Overlap distance / Jaccard distance
Johns Smith / J. Smith / 4 / 3 / 0.36 / 2 / 0.67
Johns Smith / Bill J. Evasn / 12 / 2 / 0.92 / 5 / 1
Johns Smith / Daniel Brown / 11 / 1 / 0.92 / 4 / 1
Johns Smith / Mal Gibson / 11 / 1 / 1 / 4 / 1
Bill Evasn / J. Smith / 9 / 2 / 0.9 / 4 / 1
Bill Evasn / Bill J. Evasn / 3 / 3 / 0.23 / 1 / 0.33
Bill Evasn / Daniel Brown / 8 / 2 / 0.67 / 4 / 1
Bill Evasn / Mal Gibson / 7 / 0 / 0.7 / 4 / 1
Daniel Brown / J. Smith / 11 / 4 / 0.92 / 4 / 1
Daniel Brown / Bill J. Evasn / 11 / 1 / 0.85 / 5 / 1
Daniel Brown / Daniel Brown / 0 / 0 / 0 / 0 / 0
Daniel Brown / Mal Gibson / 9 / 2 / 0.75 / 4 / 1
Mel Gibson / J. Smith / 9 / 2 / 0.9 / 4 / 1
Mel Gibson / Bill J. Evasn / 10 / 3 / 0.77 / 5 / 1
Mel Gibson / Daniel Brown / 9 / 2 / 0.75 / 4 / 1
Mel Gibson / Mal Gibson / 1 / 0 / 0.1 / 2 / 0.67

2) [10pt] What is the result join table if we set the threshold of Overlap distance to 4? (less or less equal both count as correct)

Author.a_name / Book.author_name
Johns Smith / J. Smith
Bill Evasn / Bill J. Evasn
Daniel Brown / Daniel Brown
Mel Gibson / Mal Gibson

3) [10pt] What is minimum threshold for length filter if we get the following result?

Threshold = 3

Author.a_name / Book.author_name
Johns Smith / Bill J. Evasn
Bill Evasn / J. Smith
Bill Evasn / Daniel Brown
Daniel Brown / Mal Gibson
Mel Gibson / Daniel Brown
Mel Gibson / J. Smith
Daniel Brown / Daniel Brown
Mel Gibson / Mal Gibson
Bill Evasn / Mal Gibson
Johns Smith / Daniel Brown
Johns Smith / Mal Gibson
Daniel Brown / Bill J. Evasn

2. [20 pt] Consider the following terms and corresponding posting sizes:

Term Posting_size

Eyes 213312

Kaleidoscope 87009

Marmalade 107913

Skies 271658

Tangerine 46653

Trees 316812

Suggest the best query processing order for the following query:

(Tree or Tangerine) AND (Eyes OR marmalade) AND (skies OR kaleidoscope)

List the top three query processing order by ascending order?

Tree or Tangerine appears in 316812 + 46653 = 363465 documents in the corpus; Eyes OR marmalade appear in 213312 +107913 = 321225documents in the corpus; and skies OR kaleidoscope appear in 271658+87009 = 358667 documents in the corpus. Best query processing order process terms (set) from the least popular to most popular.

1: (Eyes OR marmalade) and (skies OR kaleidoscope) and(Tree or Tangerine)

2:(Eyes OR marmalade) and(Tree or Tangerine) and (skies OR kaleidoscope)

3: (skies OR kaleidoscope) and (Eyes OR marmalade) and(Tree or Tangerine)

3. [40 pt] Consider the following collection:

Doc1: brown bear is from forest

Doc2: iceberg melts and polar bear is gone

Doc3: bear is lovely

Consider the query below:

Query: where is polar bear from

  1. [10 pt] Please generate the term-document frequency matrix:

doc1 / doc2 / doc3
brown / 1
bear / 1 / 1 / 1
is / 1 / 1 / 1
from / 1
forest / 1
iceberg / 1
melts / 1
and / 1
polar / 1
gone / 1
lovely / 1
  1. [15 pt] If we limit the vocabulary with the following set, please generate the term-document TF-IDF weight matrix:

Vocabulary set: polar bear is from

TF / doc1 / doc2 / doc3 / IDF
polar / 0 / 1 / 0 / 0.477121
bear / 1 / 1 / 1 / 0
is / 1 / 1 / 1 / 0
from / 1 / 0 / 0 / 0.477121
TF-IDF / doc1 / doc2 / doc3
polar / 0 / 0.477121 / 0
bear / 0 / 0 / 0
is / 0 / 0 / 0
from / 0.477121 / 0 / 0
  1. [10 pt] Using the TF-IDF matrix you generate, please compute the cosine similarity between query and each document.

doc1 / doc2 / doc3
query / 0.7071 / 0.7071 / 0
TF-IDF / doc1 / doc2 / doc3 / Query
polar / 0 / 0.477121 / 0 / 0.477121
bear / 0 / 0 / 0 / 0
is / 0 / 0 / 0 / 0
from / 0.477121 / 0 / 0 / 0.477121
  1. [5 pt] Consider the cosine similarity, which document should be most relevant to the query.

Doc 1 and Doc2 is most relevant to the query.