Search EngineHomework #2

Consider the following inverted index and the query - Q1 = (3, 0, 1, 5, 0, 0, 2, 0)

Documents
Terms / D1 / D2 / D3 / D4 / D5 / dk
t1 / 4 / 3 / 0 / 0 / 2 / ?
t 2 / 4 / 0 / 2 / 3 / 1 / ?
t 3 / 0 / 0 / 0 / 2 / 0 / ?
t 4 / 1 / 4 / 5 / 1 / 2 / ?
t 5 / 2 / 3 / 0 / 0 / 0 / ?
t 6 / 3 / 1 / 1 / 0 / 4 / ?
t 7 / 0 / 0 / 1 / 4 / 2 / ?
t 8 / 0 / 1 / 0 / 3 / 0 / ?

* Number in each cell indicate term frequency (i.e. tf)

  1. Rank the documents with respect to Q1 using tfidf term weights (wki) and cosine similarity.
  • tf = fki = number of times that a term (k) occurs in a given document (i)
  • Nd= number of documents in collection
  • dk =number of documents in which term k appears (postings)

Step 1: compute tf*idf weights

-Nd= ?

-d1 = ?, d2 = ?, d3 = ?, d4 = ?, d5 = ?, d6 = ?, d7 = ?, d8 = ?

D1 / D2 / D3 / D4 / D5 / dk
t1
t 2
t 3
t 4
t 5
t 6
t 7
t 8

* Number in each cell indicate tf*idf

D1 / D2 / D3 / D4 / D5 / Q
t1 / 0.887 / 0.666 / 0 / 0 / 0.444 / 3
t 2 / 0.388 / 0 / 0.194 / 0.291 / 0.097 / 0
t 3 / 0 / 0 / 0 / 1.398 / 0 / 1
t 4 / 0 / 0 / 0 / 0 / 0 / 5
t 5 / 0.796 / 1.194 / 0 / 0 / 0 / 0
t 6 / 0.291 / 0.097 / 0.097 / 0 / 0.388 / 0
t 7 / 0 / 0 / 0.222 / 0.887 / 0.444 / 2
t 8 / 0 / 0.398 / 0 / 1.194 / 0 / 0

* Number in each cell indicate tf*idf

Step 2: compute query-document cosine similarity with tf*idf weights.

Cosine Similarity:

-|D1| = sqrt (?) = 1.287

|D2| = sqrt (?) = 1.427

|D3| = sqrt (?) = 0.310

|D4| = sqrt (?) = 2.062

|D5| = sqrt (?) = 0.744

|Q| = sqrt (?) = 6.245

-QD1 = ?

QD2 = ?

QD3 = ?
QD4 = ?

QD5 = ?

-SIM(Q,D1) = ()/() = 0.331

SIM(Q,D2) = ()/() = 0.224

SIM(Q,D3) = ()/() = 0.229

SIM(Q,D4) = ()/() = 0.246

SIM(Q,D5) = = ()/() = 0.478

D1 / D2 / D3 / D4 / D5
SIM / 0.331 / 0.224 / 0.229 / 0.246 / 0.478
Rank