Search EngineHomework #2
Consider the following inverted index and the query - Q1 = (3, 0, 1, 5, 0, 0, 2, 0)
DocumentsTerms / D1 / D2 / D3 / D4 / D5 / dk
t1 / 4 / 3 / 0 / 0 / 2 / ?
t 2 / 4 / 0 / 2 / 3 / 1 / ?
t 3 / 0 / 0 / 0 / 2 / 0 / ?
t 4 / 1 / 4 / 5 / 1 / 2 / ?
t 5 / 2 / 3 / 0 / 0 / 0 / ?
t 6 / 3 / 1 / 1 / 0 / 4 / ?
t 7 / 0 / 0 / 1 / 4 / 2 / ?
t 8 / 0 / 1 / 0 / 3 / 0 / ?
* Number in each cell indicate term frequency (i.e. tf)
- Rank the documents with respect to Q1 using tfidf term weights (wki) and cosine similarity.
- tf = fki = number of times that a term (k) occurs in a given document (i)
- Nd= number of documents in collection
- dk =number of documents in which term k appears (postings)
Step 1: compute tf*idf weights
-Nd= ?
-d1 = ?, d2 = ?, d3 = ?, d4 = ?, d5 = ?, d6 = ?, d7 = ?, d8 = ?
D1 / D2 / D3 / D4 / D5 / dkt1
t 2
t 3
t 4
t 5
t 6
t 7
t 8
* Number in each cell indicate tf*idf
D1 / D2 / D3 / D4 / D5 / Qt1 / 0.887 / 0.666 / 0 / 0 / 0.444 / 3
t 2 / 0.388 / 0 / 0.194 / 0.291 / 0.097 / 0
t 3 / 0 / 0 / 0 / 1.398 / 0 / 1
t 4 / 0 / 0 / 0 / 0 / 0 / 5
t 5 / 0.796 / 1.194 / 0 / 0 / 0 / 0
t 6 / 0.291 / 0.097 / 0.097 / 0 / 0.388 / 0
t 7 / 0 / 0 / 0.222 / 0.887 / 0.444 / 2
t 8 / 0 / 0.398 / 0 / 1.194 / 0 / 0
* Number in each cell indicate tf*idf
Step 2: compute query-document cosine similarity with tf*idf weights.
Cosine Similarity:
-|D1| = sqrt (?) = 1.287
|D2| = sqrt (?) = 1.427
|D3| = sqrt (?) = 0.310
|D4| = sqrt (?) = 2.062
|D5| = sqrt (?) = 0.744
|Q| = sqrt (?) = 6.245
-QD1 = ?
QD2 = ?
QD3 = ?
QD4 = ?
QD5 = ?
-SIM(Q,D1) = ()/() = 0.331
SIM(Q,D2) = ()/() = 0.224
SIM(Q,D3) = ()/() = 0.229
SIM(Q,D4) = ()/() = 0.246
SIM(Q,D5) = = ()/() = 0.478
D1 / D2 / D3 / D4 / D5SIM / 0.331 / 0.224 / 0.229 / 0.246 / 0.478
Rank