NII International Internship Project:

Distributed Data Clustering

Supervisor: Michael HOULE, Visiting Professor

For data mining applications, the size of main memory typically limits the size of the data sets that can be analyzed. Avoiding the main-memory limitation necessitates a choice between the use of external memory (disk) or distributed processing (multiple cores). Either approach also requires a clustering method that is inherently decomposable. Relatively few parallelizable clustering methods are known, most of which involve the partitioning of the data set, the independent clustering of each partition, and the merging of the result clusters across all partitions. For data mining applications, this divide-and-conquer approach has the effect of missing those very small aggregations (the nuggets of information) that may prove to be the most valuable to the user. For example, if the data set is partitioned into 10 subsets for clustering, any aggregation of 30 points would have (on average) 3 points in any given partition – typically too few to be recognized as a cluster for that partition.

The project will investigate the application of the relevant-set correlation (RSC) clustering model [1] to the clustering of data from distributed databases, in such a way that the smallest nuggets of information are still preserved. Developed at NII, RSC is a generic model for clustering that requires no direct knowledge of the nature or representation of the data. In lieu of such knowledge, the model relies solely on the existence of an oracle for queries-by-example, that accepts a reference to a data item and returns a ranked set of items relevant to the query. In principle, the role of the oracle could be played by any similarity search structure, or even a search engine whose internal ranking function and relevancy scores are secret. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets.

Based on the RSC model, a general-purpose scalable clustering heuristic, GreedyRSC, has already been developed and demonstrated for very large, high-dimensional datasets, using a fast approximate similarity search structure (the SASH [2]) as the oracle [1]. The features of GreedyRSC include:

·  The ability to scale to large data sets, both in terms of the number of items and the size of the attribute sets.

·  Genericity, in its ability to deal with different types of attributes (categorical, ordinal, spatial).

·  Automatic determination of an appropriate number of clusters, with the user specifying as input parameters only the minimum desired cluster size and the maximum allowable correlation (proportion of overlap) between pairs of clusters.

·  Robustness with respect to noisy data.

·  The ability to identify clusters of any size (as small as three items).

As currently implemented, GreedyRSC is a batch method that makes use of a single CPU. It is capable of clustering large data sets through the use of external memory (disk), by breaking the data into several chunks, each of which can reside in main memory. However, the computational cost due to the division of the data into c chunks increases by a factor of c.

The specific goals of this project are:

·  To develop a true parallel clustering tool based on the GreedyRSC heuristic, suitable for use on a network of PCs.

·  To demonstrate the efficiency and effectiveness of the parallel GreedyRSC implementation by an experimental comparison with sequential GreedyRSC and other clustering methods. In particular, the ability of the method to discover arbitrarily-small clusters is to be demonstrated.

·  To make the clustering tool freely available under the GNU public licence.

The ideal duration of this project is 6 months, although visits of as short as 5 months will still be considered. Although it is possible to reduce the length of the internship after being accepted, it may be difficult to extend the duration beyond that which is stated in the candidate’s application. Therefore, candidates are strongly recommended to state in their application only the longest possible duration for their intended stay at NII.

3. References

[1] M. E. Houle, "The relevant-set correlation model for data clustering", in Proc. 8th SIAM International Conference on Data Mining (SDM 2008), pp. 775-786, Atlanta, GA, USA, 2008.

[2] M. E. Houle and J. Sakuma, "Fast approximate similarity search in extremely high-dimensional data sets", in Proc. 21st IEEE International Conference on Data Engineering (ICDE 2005), pp. 619-630, Tokyo, Japan, 2005.