A Database Clustering Methodology and Tool
Tae-Wan RyuDepartment of Computer Science
CaliforniaStateUniversity, Fullerton
Fullerton, California92834
/ Christoph F. Eick
Department of Computer Science
University of Houston
Houston, Texas77204-3010
Abstract
Clustering is a popular data analysis and data mining technique. However, applying traditional clustering algorithms directly to a database is not straightforward due to the fact that a database usually consists of structured and related data; moreover, there might be several object views of the database to be clustered, depending on a data analyst’s particular interest. Finally, in many cases, there is a data model discrepancy between the format used to store the database to be analyzed and the representation format that clustering algorithms expect as their input. These discrepancies have been mostly ignored by current research.
This paper focuses on identifying those discrepancies and on analyzing their impact on the application of clustering techniques to databases. We are particularly interested in the question on how clustering algorithms can be generalized to become more directly applicable to real-world databases. The paper introduces methodologies, techniques, and tools that serve this purpose. We propose a data set representation framework for database clustering that characterizes objects to be clustered through sets of tuples, and introduce preprocessing techniques and tools to generate object views based on this framework. Moreover, we introduce bag-oriented similarity measures and clustering algorithms that are suitable for our proposed data set representation framework. We also demonstrate that our approach is capable of dealing with relationship information commonly found in databases through the bag-oriented clustering. We also argue that our bag-oriented data representation framework is more suitable for database clustering than the commonly used flat file format and produce better quality of clusters.
Keywords and Phrases: database clustering, preprocessing in KDD, data mining,
data model discrepancy, similarity measures for bags.
1Introduction
Current technologies for collecting data such as scanners and other data collection tools have generated a huge amount of data, and the volume of data is growing rapidly every year. Database systems provide tools and an environment that manage and access the large volume of data systematically and efficiently. However, extracting useful knowledge from databases is very difficult without additional computer assistance and more powerful analytical tools. In general, there is a significant gap between data generation and data understanding. Consequently, automatic powerful analytical tools for discovering useful and interesting patterns from databases are desirable. Knowledge discovery in data (KDD) is such a generic approach to analyze and extract useful knowledge from databases using fully automated techniques. Recently, many techniques and tools [HK01] have been proposed for this purpose. Popular KDD-tasks include classification, data summarization, dependency modeling, deviation detection, etc., are the popular techniques used in KDD.
The focus of this paper is database clustering. The goal of database clustering is to take a database that stores information concerning a particular type of objects (e.g., customers or purchases) and identify subgroups of those objects, such that objects belonging to the same subgroup are very similar to each other, and such that objects belonging to different subgroups are quite different from each other.
Figure 1.Example of Database Clustering
Suppose that a restaurant owner has a database that contains customer information and he wants to obtain a better understanding of his main customer groups for marketing purposes. In order to accomplish this goal, as depicted in Figure 1, the restaurant database will be first preprocessed for clustering and a clustering algorithm is applied to the preprocessed data set; for example the algorithm might reveal that there are three clusters in the customer database. Finally, characteristic knowledge that summarizes each cluster can be generated, telling the restaurant owner that his major customer groups are young people that come at midnight, white collar people that come for dinner, and retirees that come for lunch. This knowledge definitely will be useful for marketing purposes, and for designing his menu.
The paper is organized as follows. Section 2 introduces the different steps that have to be taken when clustering a database, and explains how database clustering is different from traditional flat file data clustering. Based on the discussion of Section 2, Section 3 introduces a “new” data set representation framework for database clustering that characterizes objects through sets of tuples. Moreover, preprocessing techniques for generating object views based on this framework are introduced. In Section 4 similarity measures for our bag-oriented knowledge representation framework are introduced. Section 5 introduces the architecture and the components of a database clustering environment we developed. Moreover, the problems of generalizing traditional clustering algorithms for database clustering will be addressed in this section. Section 6 reviews the related literature and Section 7 summarizes the findings and contributions of the paper.
2Database Clustering
2.1Steps of Database Clustering
Due to the fact that database clustering has not been discussed very much in the literature, we think it is useful to discuss the different steps of database clustering first. In general, we consider database clustering to be an activity that is conducted by passing through the following seven steps:
(1)Define Object-View
(2)Select Relevant Attributes
(3)Generate Suitable Input Format for the Clustering Tool
(4)Define Similarity Measure
(5)Select Parameter Settings for the Chosen Clustering Algorithm
(6)Run Clustering Algorithm
(7)Characterize the Computed Clusters
The first three steps of the suggested database clustering methodology center on preprocessing the database and on generating a data set that can be processed by the employed clustering algorithm(s). In these steps, a decision has to be made what objects in the database (usually databases contain multiple types of objects) and which of their properties will be used for the purpose of clustering; moreover, the relevant information has to be converted to a format that can be processed by the selected clustering tool(s). In the fourth step similarity measures for the objects to be clustered have to be defined. Finally, in steps 5-7 the clustering algorithm has to be run, and summaries of the obtained clusters are generated.
2.2Differences between Database Clustering and Ordinary Clustering
Data collections are stored in many different formats such as flat files, relational or object-oriented databases. The flat file format is the simplest and most frequently used format in the traditional data analysis area. When using flat file format, data objects (e.g., records, cases, examples) are represented through vectors in then-dimensional space, each of which describes an object, and the object is characterized by n attributes, each of which has a single value. Almost all existing data analysis and data mining tools, such as clustering tools, inductive learning tools, and statistical analysis tools, assume that data sets to be analyzed are represented in a flat file format. The well-known inductive learning environment C4.5 [Quin93] and similar decision tree based rule induction algorithm [Domi96], conceptual clustering algorithms such as COBWEB [Fish87], AutoClass [Chee96], ITERATE [Bisw95], statistical packages, etc. make this assumption.
Due to the fact that databases are more complex than flat files, database clustering faces additional problems that do not exist when clustering flat files; these problems include:
- Databases contain objects that belong to different types; consequently, it has to be defined what objects in the database need to be clustered.
- Databases contain 1:1, 1:n and n:m relationships between objects of the same and different types.
- The definition of object similarity is more complex due to the presence of bags of values (or related information) that characterize an object.
- Attributes of objects have different types which makes the selection of an appropriate similarity measure more difficult.
The first two problems will be analyzed in more detail in the next two subsections; the third and fourth problem will be addressed in Section 4.
2.3Support for Object Views for Database Clustering
Because databases usually contain objects belonging to different classes, there can be several ways of viewing a database depending on what classes of objects need to be clustered. To illustrate the problems of database clustering, let us use the following simple relational database that consists of a Customer and a Purchase table; a particular state of this database is shown in Figure 2 (a). The underlined attributes in each relation represent the primary key in the relation.
It is not possible to directly apply a clustering algorithm to a relational database, such as the one that is depicted in Figure 2 (a). Before a clustering algorithm can be applied to a database it has to be determined what classes of objects should be clustered: should customers or purchases be clustered? After it has been decided which objects have to be clustered, in the next step relevant attributes have to be associated with the particular objects to be clustered. The availability of preprocessing tools that facilitate the generation of such object-views is highly desirable for database clustering, because generating such object-views manually can be quite time consuming.
2.4Problems with Relationships
In general, a relational database usually consists of several related relations (or of related classes when using the object-oriented model), which frequently describe many to one, and many to many relationships between objects. For example, let us assume that we are interested in clustering the customers belonging to the relational database that was depicted in Figure 2 (a). It is obvious that the attributes found in the Customer relation alone are not sufficient to accomplish this goal, because many important characteristics of persons are found in other “related” relations, such as the Purchase relation. Prior to clustering customers, the relevant information has to be extracted from the relational database and associated with each customer object. We call a data structure that stores the results of this process an object view. An example of such an object view is depicted in Figure 2 (c). The depicted data set was generated by grouping related tuples into a unique object (based on cid). The attributes p.pid, p.plocation, p.ptype, and p.amount are called related attributes with respect to the Customer relation because they had to be imported from a foreign relation, the Purchase relation in this particular case.
(a)A data collection consisting of two relations, Customer and Purchase. The underlined attributes are the keys in each relation. cid (customer id) is a foreign key in the relation Purchase. oid is an order id, pgid is a product group id, ptype is a payment type (e.g., 1 for cash, 2 for credit card, and 3 for check). The cardinality ratio between two relations is 1:n.
Customer Purchase
(b) A single-valued data set created by performing an outer join on cid. The related attributes (from Purchase relation) have prefix p. For example, p.pgid is a pgid in Purchase relation.
(c) Amulti-valued data set created by grouping related tuples into an object. For example, the three tuples that charactizeJohnny are grouped into one “Johny”object using separate bags for his payment type, product group, and amount spent on each product group.
(d) Asingle-valued data set created by averaging of multi-valued attributes in (c). For the symbolic multi-valued attributes such as p.pgid, p.location, and p.ptye, we picked the first value in the set (arbitrarily) since we cannot calculate the averages.
Figure 2. Various representations of a data set consisting of two related relations
In general, as the example shows, object views frequently contain bags of values if the relationship cardinality between the two relations is 1:n. Note that in a relational database n:m relationships are typically designed to two 1:n relationships. Unlike a set, a bag allows for duplicate elements, but the elements must take values in a same domain. For example, the bag {400, 70, 200} for the amount attribute might represent three purchases, 400, 70, and 200 dollars by the customer “Johny”. Ryu and Eick [Ryu98c] call such a data set in Figure 1 (c), a multi-valued data set and use term single-valued data set for the traditional flat files in (a) or (b). They use curly brackets to represent a bag of values with the cardinality of the bag greater than one (e.g., {1,2,3}), null to denote an empty bag, and give its element, if the bag has one element. Most traditional similarity measures for single-valued attributes cannot deal with multi-valued attributes such as p.pid, p.plocation, p.ptype, and p.amount. Measuring similarity between bags of values requires group similarity measures. For example, how do we compute similarity between a pair of objects, “Andy” and “Post” for a multi-valued attribute p.amount, {390,100}:30, or between “Andy” and “Johny”, {390,100}:{400,70,200}? One simple idea may be to replace the bag of values for multi-valued attributes by a single value by applying certain aggregate function (e.g., average, sum or count), as depicted in Figure 2 (d). Another alternative would be to use an outer join with respect to cid attribute to obtain a single-valued data set, as depicted in Figure 2 (b).
The problem with the first approach is that by applying the aggregate function frequently valuable information may be lost. For example, if the average purchase amount is used to replace the bag of individual purchase amounts, this approach does not consider other potentially relevant information, such as total amount, and the number of purchases, in computing similarity. Another problem is that aggregate functions are only applicable to numerical attributes. Using aggregate functions for symbolic attributes, such as the attribute locationor ptypein the example database, does not make sense at all. In summary, the approach of replacing bag of values by a single value faces a lot of technical difficulties.
If we look at the single-valued data set in Figure 2 (b) which has been generated by using an outer join approach we observe a different problem.A clustering algorithm would treat each tuple in the obtained single-valued data set as a separate object (e.g. Johny’s 3 purchases would be considered to be different objects, and not as data that are related to the customer “Johny”), which means no longer the 4 customers would be clustered, but rather the 7 purchases; obviously, if our goal is to cluster customers, clustering purchases instead seems to be quite confusing.
3A Data Set Representation Framework for Database Clustering
In the following a data set representation framework for database clustering is proposed; similarity measures that are suitable in the context of the proposed framework will then be introduced in Section 4. In general, the framework consists of the following mechanisms:
- An object identification mechanism that defines what classes of objects will be clustered and how those objects will be uniquely identified.
- Mechanisms to define modular units based on object similarity have to be provided; each modular unit represents a particular perspective of the objects to be clustered; similarity of different modular units is measured independently. In the context of the relational data model modular units are defined as procedures that associate a bag of tuples with a given object. Using this framework, objects to be clustered are characterized by a set of bags of tuples, one bag for each modular unit.
- The similarity between two objects is measured as a weighted sum of the similarity of all its modular units. To be able to do that a weight and a (bag) similarity measurehave to be provided for each modular unit.
Figure 3. An example of the bag-oriented clustering framework
To illustrate this framework, let us assume that we are still interested in clustering customers. In this case the attribute cid of the relation Customer that uniquely identifies customers serves as our object identification mechanism. After the object identification mechanism has been selected, relevant attributes to define similarity between customers have to be selected. In the particular case, we assume that we consider the customer’s age/gender information, the amount of money they spend on various product groups, and the customer’s daily spending pattern to be relevant for defining customer similarity. In the next step, modular units to measure customer similarity have to be defined. In this particular example, we identify three modular units each of which characterizes customers through a set of tuples. For example, the customer with cid 1 is characterized as a 43 years old male, who spent 400, 70, and 200 dollars on product groups p1, p2, and p3, and who purchased all his goods in a single day of the reporting period, spending total 670 dollars. There are different approaches to define modular units. When the relational data model is used, modular units can be defined using SQL queries that associate customers (using cid) with a set tuples that are specific for the modular unit. In the example, depicted in Figure 3, the following three SQL queries associate customers with the characteristic knowledge with respect to each modular unit:
Modular Unit 1 := SELECT cid, age, gender
FROM Customer;
Modular Unit 2 := SELECT Customer.cid, pgid, amount
FROM Customer, Purchase
WHERE Customer.cid=Purchase.cid;
Moduler Unit 3 := SELECT Customer.cid, sum(amount), date
FROM Customer, Purchase
WHERE Customer.cid=Purchase.cid
GROUPED BY Customer.cid, date;
As we have seen throughout the discussions of the last two sections, many different object views can be constructed from a given database. There are “simple” object views based on flat file format, such as those in Figure 2 (b) and Figure 2 (d); in this section, a more complicated scheme for defining object views has been introduced that characterizes object through sets of bags of tuples. We claim that this data set representation framework is more suitable for database clustering, and will present arguments to support our claim in Section 5.