Document classification using Squared Euclidean Distance using pTrees:
Steps:
- Build a document by term (DxT) matrix. Assume that we have n documents. The documents consist of m terms. These terms appear in different frequency in different documents. Each document is considered to be a vector of the frequency of the terms. Then this vector is normalized in the range of 0 to 1. Then these values are intervalized into several intervals. Then each interval is represented by a binary value. Then these document vectors will form the document by term matrix.
Example:
Assume that we have 10 documents and 3 terms having term frequency as shown the table:
t1 / t2 / t3d1 / 3 / 5 / 2
d2 / 12 / 5 / 3
d3 / 5 / 2 / 3
d4 / 6 / 6 / 3
d5 / 8 / 7 / 5
d6 / 2 / 1 / 2
d7 / 4 / 3 / 3
d8 / 5 / 3 / 2
d9 / 4 / 2 / 4
d10 / 1 / 3 / 1
The normalized table will look like this:
t1 / t2 / t3d1 / 0.3 / 0.5 / 0.2
d2 / 0.6 / 0.25 / 0.15
d3 / 0.5 / 0.2 / 0.3
d4 / 0.4 / 0.4 / 0.2
d5 / 0.4 / 0.35 / 0.25
d6 / 0.4 / 0.2 / 0.4
d7 / 0.4 / 0.3 / 0.3
d8 / 0.5 / 0.3 / 0.2
d9 / 0.4 / 0.2 / 0.4
d10 / 0.2 / 0.6 / 0.2
Consider 4 intervals of [0,0.2], (0.2,0.3], (0.3,0.45] and (0.45,1.0] and represent them by 2-bit binary value 00, 01, 10, and11 (decimal 0, 1, 2 and 3).
So the DxT matrix will look like:
t1 / t2 / t3d1 / 01 / 11 / 00
d2 / 11 / 01 / 00
d3 / 11 / 00 / 01
d4 / 10 / 10 / 00
d5 / 10 / 10 / 01
d6 / 10 / 00 / 10
d7 / 10 / 01 / 01
d8 / 11 / 01 / 00
d9 / 10 / 00 / 10
d10 / 00 / 11 / 00
- Then form pTree for each term. If the binary code of each interval takes c bits then there will be mt number of pTrees. Length of each pTree is n.
In this example there will be 3x2 = 6 pTrees as shown bellow.
P1,1 / P1,0 / P2,1 / P2,0 / P3,1 / P3,00 / 1 / 1 / 1 / 0 / 0
1 / 1 / 0 / 1 / 0 / 0
1 / 1 / 0 / 0 / 0 / 1
1 / 0 / 1 / 0 / 0 / 0
1 / 0 / 1 / 0 / 0 / 1
1 / 0 / 0 / 0 / 1 / 0
1 / 0 / 0 / 1 / 0 / 1
1 / 1 / 0 / 1 / 0 / 0
1 / 0 / 0 / 0 / 1 / 0
0 / 0 / 1 / 1 / 0 / 0
Here Pi,j = pTree of ith term and jth bit. In this example i=1,2,3 and j =1,0. So P2,1 is the pTree of the higher order bits of term 2 etc.
- Suppose a new document dnew with a certain term frequency needs to be classified. Now the term frequencies first will be normalized then intervalized and then will be coded as described above.
For example:
Lets dnew = (8, 5, 7) after normalization = (0.4, 0.25, 0.35) and after intervalization and using binary value = (10, 01, 10)
- Then we’ll use kNN to find the neighbors to vote for classification.
First we’ll find the distance of dnew from each di (i=1 to 10). We will measure the distance using Squared Euclidean Distant (SED). So the documents having SED <= k2 will vote for classification.
For example the SED from d1 and dnew
S1 = (01 – 10)2+(11 – 01)2+(00 – 10)2
= 0001 + 0100 + 0100
= 001001
The following table shows all the SED from dnew
SEDd1 / 001001
d2 / 000101
d3 / 000011
d4 / 000100
d5 / 000010
d6 / 000001
d7 / 000001
d8 / 000101
d9 / 000001
d10 / 001100
- Calculation of SED using pTrees:
The following pTree operations are defined to find the squared value of the absolute difference of term values and corresponding dnew value for all the documents. Unlike horizontal processing a fixed number of pTree operations will find these values. In this example suppose the squared value are represented by R3, R2, R1, R0 pTrees.
If dnew_i = 00 then
R3 = Pi1 & Pi0
R2 = Pi1 & Pi0’
R1 = 0
R0 = Pi0
If dnew_i = 01 then
R3 = 0
R2 = Pi1 & Pi0
R1 = 0
R0 = Pi0’
If dnew_i = 10 then
R3 = 0
R2 = Pi1’ & Pi0’
R1 = 0
R0 = Pi0
If dnew_i = 11 then
R3 = Pi1’ & Pi0’
R2 = Pi1’ & Pi0
R1 = 0
R0 = Pi0’
Now these squared value pTrees will be added using the pTree addition algorithm to get the SED.
- Now using inequality pTree formula (Fei Pan’s formula) we’ll find the mask pTree of SED <= k2