Vertical Set Square Distance Based Clustering without Prior Knowledge of K

Amal Perera, Taufik Abidin, Masum Serazi, William Perrizo

Computer Science Department
North Dakota State University
Fargo, ND 58105 USA

{amal.perera, taufik.abidin, md.serazi, william.perrizo}@ndsu.edu

Abstract

Clustering is automated identification of groups of objects based on similarity. In clustering two major research issues are scalability and the requirement of domain knowledge to determine input parameters. Most approaches suggest the use of sampling to address the issue of scalability. However, sampling does not guarantee the best solution and can cause significant loss in accuracy. Most approaches also require the use of domain knowledge, trial and error techniques, or exhaustive searching to figure out the required input parameters. In this paper we introduce a new clustering technique based on the set square distance. Cluster membership is determined based on the set squared distance to the respective cluster. As in the case of mean for k-means and median for k-medoids, the cluster is represented by the entire cluster of points for each evaluation of membership. The set square distance for all n items can be computed efficiently in O(n) using a vertical data structure and a few pre-computed values. Special ordering of the set square distance is used to break the data into the “natural” clusters compared to the need of a known k for k-means or k-medoids type of partition clustering. Superior results are observed when the new clustering technique is compared with the classical k-means clustering. To prove the cluster quality and the resolution of the unknown k, data sets with known classes such as the iris data, the uci_kdd network intrusion data, and synthetic data are used. The scalability of the proposed technique is proved using a large RSI data set.

Keywords

Vertical Set Square Distance, P-trees, Clustering.

1.  INTRODUCTION

Clustering is a very important human activity. Built-in trainable clustering models are continuously trained from early childhood allowing us to separate cats from dogs. Clustering allows us to distinguish between different objects. Given a set of points in multidimensional space, the goal of clustering is to compute a partition of these points into sets called clusters, such that the points in the same cluster are more similar than points across different clusters. Clustering allows us to identify dense and sparse regions and, therefore, discover overall distribution of interesting patterns and correlations in the data. Automated clustering is very valuable in analyzing large data, and thus has found applications in many areas such as data mining, search engine indexing, pattern recognition, image processing, trend analysis and many other areas [1][2].

Large number of clustering algorithms exists. In the clustering literature these clustering algorithms are grouped into four: partitioning methods, hierarchical methods, density-based (connectivity) methods and grid-based methods [1][3]. In partitioning methods n objects in the original data set is broken into k partitions iteratively, to achieve a certain optimal criterion. The most classical and popular partitioning methods are k-means [4] and k-medoid [5]. The k clusters are represented by the gravity of the cluster in k-means or by a representative of the cluster in k-medoid. Each object in the space is assigned to the closest cluster in each iteration. All the partition based methods suffer from the requirement of providing the k (number of partitions) prior to clustering, only able to identify spherical clusters, and having large genuine clusters split in order to optimize cluster quality [3].

A hierarchical clustering algorithm produces a representation of the nested grouping relationship among objects. If the clustering hierarchy is formed from bottom up, at the start each data object is a cluster by itself, then small clusters are merged into bigger clusters at each level of the hierarchy based on similarity until at the top of the hierarchy all the data objects are in one cluster. The major difference between hierarchical algorithms is how to measure the similarity between each pair of clusters. Hierarchical clustering algorithms require the setting of a termination condition with some prior domain knowledge and typically they have high computational complexity [3][8]. Density based clustering methods attempts to separate the dense and sparse regions of objects in the data space [1]. For each point of a cluster the density of data points in the neighborhood has to exceed some threshold [10]. Density based clustering techniques allow discovering arbitrary shaped clusters through a linking phase. But they do suffer from the requirement of setting prior parameters based on domain knowledge to arrive at the best possible clustering. A grid-based approach divides the data space into a finite set of multidimensional grid cells and performs clustering in each cell and then groups those neighboring dense cells into clusters [1]. Determination of the cell size and other parameters affect the final quality of the clustering.

In general, two of the most demanding challenges in clustering are scalability and minimal requirement of domain knowledge to determine the input parameters [1]. In this work we describe a new clustering mechanism that is scalable and operates without the need of an initial parameter that determines the expected number of clusters in the data set. We describe an efficient vertical technique to compute the density based on influence using the set square distance of each data point with respect to all other data points in the space. Natural partitions in the density values are used to initially partition the data set into clusters. Subsequently cluster membership of each data point is confirmed or reassigned with the use of efficiently recalculating the set square distance with respect to each cluster.

2.  Related Work

Many clustering algorithms work well on small datasets containing fewer than 200 data objects [1]. The NASA Earth Observing System will deliver close to a terabyte of remote sensing data per day and it is estimated that this coordinated series of satellites will generate peta-bytes of archived data in the next few years [12][13][14]. For real world applications, the requirement is to cluster millions of records using scalable techniques [11]. A general strategy to scale-up clustering algorithms is to draw a sample or to apply a kind of data compression before applying the clustering algorithm to the resulting representative objects. This may lead to biased results [1][14]. CLARA [1] addresses the scalability issue by choosing a representative sample of the data set and then continuing with the classical k-mediod method. The effectiveness depends on the size of the sample. CLARANS [6] is an example for a partition based clustering technique which uses a randomized and bounded search strategy to achieve the optimal criterion. This is achieved by not fixing the sample to a specific set from the data set for the entire clustering process. An exhaustive traversal of the search space is not achieved in the final clustering. BIRCH [7] uses a tree structure that records the sufficient statistics (summary) for subsets of data that can be compressed and represented by the summery. Initial threshold parameters are required to obtain the best clustering and computational optimality in BIRCH. Most of the clustering algorithms require the users to input certain parameters [1]. Subsequently the clustering results are sensitive to the input parameters. For example DENCLUE [10] requires the user to input the cell size to compute the influence function. DBSCAN [15] needs the neighborhood radius and minimum number of points that are required to mark a neighborhood as a core object with respect to density. To address the requirement for parameters OPTICS [16] computes an augmented cluster ordering for automatic and interactive cluster analysis. OPTICS stores sufficient additional information enabling the user to extract any density based clustering without having to re-scan the data set. Parameter-less-ness comes at a cost. OPTICS has a time complexity of O (n log n) when used with a spatial index that allows it to easily walk through the search space. Less expensive partition based techniques suffer the requirement of specifying the number of expected partitions (k) prior to clustering [1][3][26]. X-means [26] attempts to find k by repeatedly searching through for different k values and testing it against a model based on Bayesian Information Criterion (BIC). G-means [17] is another attempt to learn k using a repeated top down division of the data set until each individual cluster demonstrates a Gaussian data distribution within a user specified significance level. ACE [14] maps the search space to a grid using a suitable weighting function similar to the particle-mesh method used in physics and then uses a few agents to heuristically search through the mesh to identify the natural clusters in the data. Initial weighting costs only O(n), but the success of the techniques depends on the agent based heuristic search and the size of the gird cell. The authors suggest a linear weighting scheme based on neighborhood grid cells and a variable grid cell size to avoid the over dependence on cell size for quality results. The linear weighting scheme adds more compute time to the process.

3.  Our Approach

Our approach attempts to address the problem of scalability in clustering using a partition-based algorithm using a vertical data structure (P-tree[1]) that aids fast computation of counts. Three major inherent issues with partition-based algorithms are: Need to input K, Need to initialize the clusters that would lead to a optimal solution and, Cluster representation (prototype) and computation of membership for each cluster. We solve the first two problems based on the concept of being able to formally model the influence of each data point using a function first proposed for DENCLUE [10] and the use of an efficient technique to compute the total influence rapidly over the entire search space. Significantly large differences in the total influence are used to identify the natural clusters in the data set. Data points with similar total influence are initially put together as initial clusters to get a better initialization in search of an optimal clustering in the subsequent iterative process. Each cluster is represented by the entire cluster. The above third issue is solved with the use of a vertical data structure. We show an efficient technique that can compute the membership for each data item by comparing the total influence of each item against each cluster.

3.1  Influence and Density

The influence function can be interpreted as a function, which describes the impact of a data point within its neighborhood [10]. Examples for influence functions are parabolic function, square wave function, or the Gaussian function. The influence function can be applied to each data point. Indication of the overall density of the data space can be calculated as the sum of the influence function of all data points [10]. The density function which results from a Gaussian function for a point ‘a’ in a neighborhood ‘xi‘ is

The Gaussian influence function is used in DENCLUE and since it is O(n 2), for all n data points they use a grid to locally compute the density[9][10]. The influence function should be radially symmetric about any point (either variable), continuous and differentiable. Some other influence functions are:

We note that the power 2m function is a truncation of the Gaussian Maclaurin series. Figure 1 shows the distribution of the density for the Gaussian (b) and the Parabolic (c) influence function.

Next we show how the density based on the Parabolic influence function here after denoted by Set Square Distance could be efficiently computed using a vertical data structure. Vertical data representation consists of set structures representing the data column-by-column rather than row-by-row (relational data). Predicate-trees (P-tree) are one choice of vertical data representation, which can be used for data mining instead of the more common sets of relational records. P-trees [23] are a lossless, compressed, and data-mining-ready data structure. This data structure has been successfully applied in data mining applications ranging from Classification and Clustering with K-Nearest Neighbor, to Classification with Decision Tree Induction, to Association Rule Mining [18][20][22][24][25]. A basic P-tree represents one attribute bit that is reorganized into a tree structure by recursively sub-dividing, while recording the predicate truth value regarding purity for each division. Each level of the tree contains truth-bits that represent pure sub-trees and can then be used for fast computation of counts. This construction is continued recursively down each tree path until a pure sub-division is reached that is entirely pure (which may or may not be at the leaf level). The basic and complement P-trees are combined using boolean algebra operations to produce P-trees for values, entire tuples, value intervals, or any other attribute pattern. The root count of any pattern tree will indicate the occurrence count of that pattern. The P-tree data structure provides the structure for counting patterns in an efficient manner.

Binary representation is intrinsically a fundamental concept in vertical data structures. Let x be a numeric value of attribute A1. Then the representation of x in b bits is written as:

andare the highest and lowest order bits respectively.

Figure 1. Distribution of density based on influence function


Vertical Set Square Distance (VSSD) for a point ‘a’ in a data set ‘X’ is defined as follows [24]: