Gradient-Variance Optimized Linear Functional Clustering
on
Vertically Structured Data
Dr. William Perrizo, Dr. Vasant A. Ubhaya, Arjun Roy
North Dakota State University
()
ABSTRACT:
This paper describes an approach for the data mining technique called clustering using vertically structured data and linear partitioning methodology. The partitioning methodology is based on the scalar product with a judiciously chosen unit vector whose direction has been optimized using a gradient ascent method. Choosing the best (or a good) unit vector for the gradient ascent is a challenge. We address that challenge by deriving a series of theorems to guide the process in which a starting vector is identified that leads to the maximum variance in the scalar product values. The use of such an initial vector gives the user of this technology high assurance that gaps in those scalar values will be prominent. A prominent gap in the scalar product values is a good cut point for a linear separation hyper-plane to be used in the classification of future unclassified samples. Placing that hyperplane at the middle of that gap guarantees a substantial margin between clusters.
For so-called “big data”, the speed at which a clustering method can be trained is a critical issue. Many very good algorithms are unusable in the big data environment due to the fact that they take an unacceptable amount of time. Therefore, algorithm speed is very important. To address the speed issue, in this paper, we use horizontal processing of vertically structured data rather than the ubiquitous vertical (scan) processing of horizontal (record) data, we use pTree, bit level, vertical data structuring. pTree technology represents and processes data differently from the ubiquitous horizontal data technologies. In pTree technology, the data is structured column-wise (into bit slices) and the columns are processed horizontally (typically across a few to a few hundred bit level columns), while in horizontal technologies, data is structured row-wise and those rows are processed vertically (often down millions, even billions of rows).
P-trees are lossless, compressed and data-mining ready data structures [9][10]. pTrees are lossless because the vertical bit-wise partitioning that is used in the pTree technology guarantees that all information is retained completely. There is no loss of information in converting horizontal data to this vertical format. pTrees are compressed because in this technology, segments of bit sequences which are either purely 1-bits or purely 0-bits, are represented by a single bit. This compression saves a considerable amount of space, but more importantly facilitates faster processing. pTrees are data-mining ready because the fast, horizontal data mining processes involved can be done without the need to decompress the structures first. pTree vertical data structures have been exploited in various domains and data mining algorithms, ranging from classification [1,2,3], clustering [4,7], association rule mining [9], as well as other data mining algorithms. Speed improvements are very important in data mining because many quite accurate algorithms require an unacceptable amount of processing time to complete, even with today’s powerful computing systems and efficient software platforms. In this paper, we evaluate and compare the speed of various data mining algorithms when using pTree technology.
Introduction
Supervised Machine Learning, Classification or Prediction is one of the important data mining technologies for mining information out of large data sets. The assumption is usually that there is a, usually very large, table of data in which the “class” of each instance is given (the training data set) and there is another data set in the class are not known (the test data set) but are to be predicted based on class information found in the training data set (therefore “supervised” prediction). [1, 2, 3].
Unsupervised Machine Learning or Clustering is also an important data mining technology for mining information out of new data sets. The assumption is usually that there is essentially nothing yet known about the data set (therefore it is “unsupervised”). The goal is often to partition the data set into subset of “similar” or “correlated” records [4, 7].
There may be various levels of supervision available and, of course, that additional information should be used to advantage during either the classification or clustering process. That is to say, often the problem is not a purely supervised nor is it purely an unsupervised problem. For instance, it may be known that there are exactly k clusters or similarity subsets, in which case, a method such as k-means clustering may be a very productive method. To mine a RGB image for, say red cars, white cars, grass, pavement and bare ground, k would be five. It would make sense to use that supervising knowledge by employing k-means clustering starting with a mean set consisting of RGB vectors as closely approximating the clusters as one can guess, e.g., red_car=(150,0,0), white_car=(85,85,85), grass=(0,150,0), etc. That is to say, it is productive to view the level of supervision available to us as a continuum and not just the two extremes. The ultimate in supervising knowledge is a very large training set of already classified samples, which has enough class information in it to very accurately assign predicted classes to all unclassified. We can think of a training set as a set of records that have been “classified” by an expert (human or machine) into similarity classes and assigned a class or label.
In this paper we assume there is no supervising knowledge except for an idea of what “similar instances” should mean. We will assume the data set is a table of non-negative integers with n columns and N rows and that two rows are similar if there are close in the Euclidean sense. More general assumptions could be made (e.g., that there are categorical data columns as well, or that the similarity is based on L1 distance or some correlation-based similarity) but we feel that would only obscure the main points we want to make by generalizing.
We structure the data vertically into columns of bits (possibly compressed into tree structures), called predicate Trees or pTrees. The simplest example of pTree structuring of a non-negative integer table is to slice it vertically into its bit-position slices. The main reason we do that is so that we can process across the (usually relatively few) vertical pTree structures rather than processing down the (usually very numerous) rows. Very often these days, data is called Big Data because there are many, many rows (billions or even trillions) while the number of columns, by comparison, is relatively small (tens or hundreds, thousands, multiple thousands). Therefore processing across (bit) columns rather than down the rows has a clear speed advantage, provided that the column processing can be done very efficiently. That is where the advantage of our approach lies, in devising very efficient (in terms of time taken) algorithms for horizontal processing of vertical (bit) structures. Our approach also benefits greatly from the fact that, in general, modern computing platforms can do logical processing of (even massive) bit arrays very quickly [9,10].
The broad question we address in this paper is “do we give up quality (accuracy) when horizontally processing vertically structured data, as compared to vertically processing horizontally structured data. That is, if we structure our data vertically and process across those vertical structures (horizontally), can those horizontal algorithms compete quality-wise with the time-honored methods that process horizontal (record) data vertically? The simplest and clearest setting to make that point, we believe is that of totally unsupervised machine learning or clustering.
Horizontal Clustering of Vertical Data Algorithms
We have developed a series of algorithms to cluster datasets by employing a distance-dominating functional (assigns a non-negative integer to each row). By distance dominating, we simply mean that the distance between any two output functional values is always dominated by the distance between the two input vectors. A class of functionals we have found productive are based on the dot or scalar product with a unit vector.
We let s◦t denote the dot or inner product siti of two vectors s, tRn, and ||s|| = √si2, the Euclidean norm. Note that ||s◦t|| ≤ ||s|| ||t||, and distance between s and t is ||s-t||. We assert that the dot product of a vector s with any unit vector d is “distance dominating, i.e., ||s◦d –t◦d|| ≤ ||s–t||. This follows since ||s◦d –t◦d|| = ||(s–t)◦d|| ≤ ||s–t|| ||d || = ||s–t||, as ||d|| = 1.
The goal in each of these algorithms is to produce a large gap (or several large gaps) in the functional values. A large gap in the functional values reveals a [at least as large] gap between the functional pre-images (due to the distance dominance of the functional. Thus, each functional range gap partitions the dataset into two clusters.
The algorithms we comparing in this paper are:
1. MVM (Mean-to-Vector_of_Medians),
2. GV (Gradient-based hill-climbing of the Variance),
3. GM (essentially, GV applied to MVM)
MVM: In the MVM algorithm, which is heuristic in nature, we simply take the vector d running from the mean of the data set to the vector of column-wise medians, and then unitize it by dividing by its length to get a unit vector d. The functional is then just the dot product with this unit vector d. In order to bit-slice the column of functional values, we need it to contain only non-negative integers, so we subtract the minimum from each functional value and round.
GV: In the GV algorithm, we start with a particular unit vector d (e.g., we can compute the variance matrix of the data set and takes a our initial d, the ek=(0, … 0,1,0, … 0) with a 1 only in the position k corresponding to the maximal diagonal element of the variance matrix). Next, we “hill-climb” the initial unit vector using the gradient of the variance matrix, until a [local] maximum is reached (a unit vector that [locally] maximizes the variance of the range set of the functional). Roughly “high variance” is likely to imply “larger gap(s)”.
GM: In the GM algorithm, we simply apply the GV algorithm but starting with what we believe to be a “very good” unit vector in terms of separating the space into clusters at a functional gap. One such “very good unit vector” is the unitization of the vector running from the mean to the vector of medians (the one used in MVM).
Formulas
In the MVM algorithm, the calculation of the mean or average function, which computes the average of one complete table column by horizontally (ANDs and Ors) the vertical bit slices horizontally, goes as shown here. The COUNT function is probably the simplest and most useful of all these Aggregate functions. It is not necessary to write special function for Count because a pTree RootCount function which efficiently counts the number of 1-bits in the full slice, provided the mechanism to implement it. Given a pTree Pi, RootCount(Pi) returns the number of 1-bits in Pi.
The Sum Aggregate function can total a column of numerical values.
The Average Aggregate, Mean or Expectation will show the average value of a column and is calculated from Count and Sum.
Average () = Sum ()/Count ().
The calculation of the Vector of Medians, or simply the median of any column, goes as follows. Median returns the median value in a column. Rank (K) returns the value that is the kth largest value in a field. Therefore, for a very fast and quite accurate determination of the median value use K = Roof(TableCardinality/2)
Finding the Optimal Scalar Product Unit Vector
The Gradient-based variance hill climbing calculations are given in this section. This includes the derivation of var(X◦d), the variance of the Dot Product Projection with unit vector d, which we denote by f(d).
We use the following conventions and notations: All our vectors are column vectors. {ai :1≤j≤r} denotes the sum of r numbers ai, s◦t, as stated earlier, denotes the inner product {siti:1≤j≤n} of two vectors s, t in Rn, AT is the transpose of a matrix A. Then, clearly, s◦t = sTt.
Let X = (X1 , X2 , … , Xn) be a matrix whose columns are vectors Xi in Rm. We let E(Xi ) denote the mean, average or expectation of X. Let also
vij = cov(Xi ,Xj) = E((Xi – E(Xi ) (Xj – E(Xj)) = E(Xi Xj ) – E(Xi)E(Xj),
so that vii = cov(Xi ,Xi) = var(Xj). We let V= (vij) to be the covariance matrix whose i, j th entry is vij. For any d in Rn, we define f(d) = var(X◦d). Our goal is to develop an algorithm to maximize f(d). Now
var(X◦d) = E(X◦d – E(X◦d))2
= E(X◦d – E(X)◦d)2
= E((X – E(X))◦d)2
= E((( Xi – E(Xi ) (Xj – E(Xj)) di dj)
= E((Xi – E(Xi )) (Xj – E(Xj)) di dj )
= di vij dj ,
where the summation is over all i, j = 1, 2, …,n. Hence,
f(d) = dT Vd.
Note that the gradient vectorf(d) of f(d) is given by f(d) = 2Vd, and its i th component is 2{vij dj :1≤j≤n}. We now state our algorithm:
Algorithm: Given an initial direction d(0)S and ε > 0, we successively compute d(1), d(2),…,d(k),… in S and function values f(d(1)), f(d(2))…,f(d(k))… until f(d(k+1)) – f(d(k)) < ε using the following formulae: