Document classification using Squared Euclidean Distance using pTrees:

Steps:

  1. Build a document by term (DxT) matrix. Assume that we have n documents. The documents consist of m terms. These terms appear in different frequency in different documents. Each document is considered to be a vector of the frequency of the terms. Then this vector is normalized in the range of 0 to 1. Then these values are intervalized into several intervals. Then each interval is represented by a binary value. Then these document vectors will form the document by term matrix.

Example:

Assume that we have 10 documents and 3 terms having term frequency as shown the table:

t1 / t2 / t3
d1 / 3 / 5 / 2
d2 / 12 / 5 / 3
d3 / 5 / 2 / 3
d4 / 6 / 6 / 3
d5 / 8 / 7 / 5
d6 / 2 / 1 / 2
d7 / 4 / 3 / 3
d8 / 5 / 3 / 2
d9 / 4 / 2 / 4
d10 / 1 / 3 / 1

The normalized table will look like this:

t1 / t2 / t3
d1 / 0.3 / 0.5 / 0.2
d2 / 0.6 / 0.25 / 0.15
d3 / 0.5 / 0.2 / 0.3
d4 / 0.4 / 0.4 / 0.2
d5 / 0.4 / 0.35 / 0.25
d6 / 0.4 / 0.2 / 0.4
d7 / 0.4 / 0.3 / 0.3
d8 / 0.5 / 0.3 / 0.2
d9 / 0.4 / 0.2 / 0.4
d10 / 0.2 / 0.6 / 0.2

Consider 4 intervals of [0,0.2], (0.2,0.3], (0.3,0.45] and (0.45,1.0] and represent them by 2-bit binary value 00, 01, 10, and11 (decimal 0, 1, 2 and 3).

So the DxT matrix will look like:

t1 / t2 / t3
d1 / 01 / 11 / 00
d2 / 11 / 01 / 00
d3 / 11 / 00 / 01
d4 / 10 / 10 / 00
d5 / 10 / 10 / 01
d6 / 10 / 00 / 10
d7 / 10 / 01 / 01
d8 / 11 / 01 / 00
d9 / 10 / 00 / 10
d10 / 00 / 11 / 00
  1. Then form pTree for each term. If the binary code of each interval takes c bits then there will be mt number of pTrees. Length of each pTree is n.

In this example there will be 3x2 = 6 pTrees as shown bellow.

P1,1 / P1,0 / P2,1 / P2,0 / P3,1 / P3,0
0 / 1 / 1 / 1 / 0 / 0
1 / 1 / 0 / 1 / 0 / 0
1 / 1 / 0 / 0 / 0 / 1
1 / 0 / 1 / 0 / 0 / 0
1 / 0 / 1 / 0 / 0 / 1
1 / 0 / 0 / 0 / 1 / 0
1 / 0 / 0 / 1 / 0 / 1
1 / 1 / 0 / 1 / 0 / 0
1 / 0 / 0 / 0 / 1 / 0
0 / 0 / 1 / 1 / 0 / 0

Here Pi,j = pTree of ith term and jth bit. In this example i=1,2,3 and j =1,0. So P2,1 is the pTree of the higher order bits of term 2 etc.

  1. Suppose a new document dnew with a certain term frequency needs to be classified. Now the term frequencies first will be normalized then intervalized and then will be coded as described above.

For example:

Lets dnew = (8, 5, 7) after normalization = (0.4, 0.25, 0.35) and after intervalization and using binary value = (10, 01, 10)

  1. Then we’ll use kNN to find the neighbors to vote for classification.

First we’ll find the distance of dnew from each di (i=1 to 10). We will measure the distance using Squared Euclidean Distant (SED). So the documents having SED <= k2 will vote for classification.

For example the SED from d1 and dnew

S1 = (01 – 10)2+(11 – 01)2+(00 – 10)2

= 0001 + 0100 + 0100

= 001001

The following table shows all the SED from dnew

SED
d1 / 001001
d2 / 000101
d3 / 000011
d4 / 000100
d5 / 000010
d6 / 000001
d7 / 000001
d8 / 000101
d9 / 000001
d10 / 001100
  1. Calculation of SED using pTrees:

The following pTree operations are defined to find the squared value of the absolute difference of term values and corresponding dnew value for all the documents. Unlike horizontal processing a fixed number of pTree operations will find these values. In this example suppose the squared value are represented by R3, R2, R1, R0 pTrees.

If dnew_i = 00 then

R3 = Pi1 & Pi0

R2 = Pi1 & Pi0’

R1 = 0

R0 = Pi0

If dnew_i = 01 then

R3 = 0

R2 = Pi1 & Pi0

R1 = 0

R0 = Pi0’

If dnew_i = 10 then

R3 = 0

R2 = Pi1’ & Pi0’

R1 = 0

R0 = Pi0

If dnew_i = 11 then

R3 = Pi1’ & Pi0’

R2 = Pi1’ & Pi0

R1 = 0

R0 = Pi0’

Now these squared value pTrees will be added using the pTree addition algorithm to get the SED.

  1. Now using inequality pTree formula (Fei Pan’s formula) we’ll find the mask pTree of SED <= k2