University of Pittsburgh
School of Computing and Information
Department of Informatics and Networked Systems
INFSCI 2711: Advanced Topics in Database Management
Spring 2018: Wednesday
Homework #2: Distributed Query Processing
Graded out of 70 points;
Due: in class, Monday 4/11,
1. [20 pt] Considerthe following relations
Student:
s_num / s_name / grades1 / Johns Smith / 1
s2 / Bill Evasn / 2
s3 / Daniel Brown / 1
s4 / Mel Gibson / 5
Library:
person_name / borrow_bookJ. Smith / Pride and Prejudice
Bill J. Evasn / Harry Potter
Daniel Brown / Vampire
Mal Gibson / Little Mermaid
Fill in the following table for every pair of s_names in Studentand person_name in the Library table. For Overlap and Jaccard distances assume that strings are sets of words separated by blank spaces.
Author.a_name / Book.author_name / sed / length filter / Sednorm / Overlap distance / Jaccard distanceJohns Smith / J. Smith / 4 / 3 / 0.36 / 2 / 0.67
Johns Smith / Bill J. Evasn / 12 / 2 / 0.92 / 5 / 1
Johns Smith / Daniel Brown / 11 / 1 / 0.92 / 4 / 1
Johns Smith / Mal Gibson / 11 / 1 / 1 / 4 / 1
Bill Evasn / J. Smith / 9 / 2 / 0.9 / 4 / 1
Bill Evasn / Bill J. Evasn / 3 / 3 / 0.23 / 1 / 0.33
Bill Evasn / Daniel Brown / 8 / 2 / 0.67 / 4 / 1
Bill Evasn / Mal Gibson / 7 / 0 / 0.7 / 4 / 1
Daniel Brown / J. Smith / 11 / 4 / 0.92 / 4 / 1
Daniel Brown / Bill J. Evasn / 11 / 1 / 0.85 / 5 / 1
Daniel Brown / Daniel Brown / 0 / 0 / 0 / 0 / 0
Daniel Brown / Mal Gibson / 9 / 2 / 0.75 / 4 / 1
Mel Gibson / J. Smith / 9 / 2 / 0.9 / 4 / 1
Mel Gibson / Bill J. Evasn / 10 / 3 / 0.77 / 5 / 1
Mel Gibson / Daniel Brown / 9 / 2 / 0.75 / 4 / 1
Mel Gibson / Mal Gibson / 1 / 0 / 0.1 / 2 / 0.67
2) [10pt] What is the result join table if we set the threshold of Overlap distance to 4? (less or less equal both count as correct)
Author.a_name / Book.author_nameJohns Smith / J. Smith
Bill Evasn / Bill J. Evasn
Daniel Brown / Daniel Brown
Mel Gibson / Mal Gibson
3) [10pt] What is minimum threshold for length filter if we get the following result?
Threshold = 3
Author.a_name / Book.author_nameJohns Smith / Bill J. Evasn
Bill Evasn / J. Smith
Bill Evasn / Daniel Brown
Daniel Brown / Mal Gibson
Mel Gibson / Daniel Brown
Mel Gibson / J. Smith
Daniel Brown / Daniel Brown
Mel Gibson / Mal Gibson
Bill Evasn / Mal Gibson
Johns Smith / Daniel Brown
Johns Smith / Mal Gibson
Daniel Brown / Bill J. Evasn
2. [20 pt] Consider the following terms and corresponding posting sizes:
Term Posting_size
Eyes 213312
Kaleidoscope 87009
Marmalade 107913
Skies 271658
Tangerine 46653
Trees 316812
Suggest the best query processing order for the following query:
(Tree or Tangerine) AND (Eyes OR marmalade) AND (skies OR kaleidoscope)
List the top three query processing order by ascending order?
Tree or Tangerine appears in 316812 + 46653 = 363465 documents in the corpus; Eyes OR marmalade appear in 213312 +107913 = 321225documents in the corpus; and skies OR kaleidoscope appear in 271658+87009 = 358667 documents in the corpus. Best query processing order process terms (set) from the least popular to most popular.
1: (Eyes OR marmalade) and (skies OR kaleidoscope) and(Tree or Tangerine)
2:(Eyes OR marmalade) and(Tree or Tangerine) and (skies OR kaleidoscope)
3: (skies OR kaleidoscope) and (Eyes OR marmalade) and(Tree or Tangerine)
3. [40 pt] Consider the following collection:
Doc1: brown bear is from forest
Doc2: iceberg melts and polar bear is gone
Doc3: bear is lovely
Consider the query below:
Query: where is polar bear from
- [10 pt] Please generate the term-document frequency matrix:
doc1 / doc2 / doc3
brown / 1
bear / 1 / 1 / 1
is / 1 / 1 / 1
from / 1
forest / 1
iceberg / 1
melts / 1
and / 1
polar / 1
gone / 1
lovely / 1
- [15 pt] If we limit the vocabulary with the following set, please generate the term-document TF-IDF weight matrix:
Vocabulary set: polar bear is from
TF / doc1 / doc2 / doc3 / IDFpolar / 0 / 1 / 0 / 0.477121
bear / 1 / 1 / 1 / 0
is / 1 / 1 / 1 / 0
from / 1 / 0 / 0 / 0.477121
TF-IDF / doc1 / doc2 / doc3
polar / 0 / 0.477121 / 0
bear / 0 / 0 / 0
is / 0 / 0 / 0
from / 0.477121 / 0 / 0
- [10 pt] Using the TF-IDF matrix you generate, please compute the cosine similarity between query and each document.
doc1 / doc2 / doc3
query / 0.7071 / 0.7071 / 0
TF-IDF / doc1 / doc2 / doc3 / Query
polar / 0 / 0.477121 / 0 / 0.477121
bear / 0 / 0 / 0 / 0
is / 0 / 0 / 0 / 0
from / 0.477121 / 0 / 0 / 0.477121
- [5 pt] Consider the cosine similarity, which document should be most relevant to the query.
Doc 1 and Doc2 is most relevant to the query.