Paper-2CS590LWS2002
Topic: Distributed Classification Based on Association rules (DCBA) algorithm
By Shuanghui Luo
This paper is organized as follows
Introduction
- Trend and significance of data mining
- Basic concept about association rule mining and classification rule mining
-Association rule mining:
- Basic concept, transaction model
- Apriori algorithm,
- FP-growth
-Classification rule mining
- Associative classification
- Recent researches on the integration of Association rule mining and
Classification rule mining
-Classification Based on Associations (CBA) algorithm
-Classification Based on Multiple Association Rules (CMAR) algorithm
- Problem and high-level solution
Related work
- CBA algorithm
- CMAR algorithm
- Distributed boosting algorithm
- Parallel computing
- Our approach:
-Distributed CBA (DCBA) algorithm and Distributed CMAR (DCMAR) algorithm
-Issues: accuracy, efficiency, and scalability
-Feasibility
Introduction
Trend and significance of data mining
Data mining refers to search for valuable information hidden in large quantities of data using statistical and machine learning techniques. It is a multidisciplinary field with a major impact in scientific and commercial environment. The major reason that data mining has attracted a great deal of attention is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Data mining has been used in market data analysis, catalog design, web log sequence, DNA analysis, high throughput drug design, and etc. J. Han and M. Kamber have discussed the concepts and various techniques in their recent book Data Mining: Concepts and Techniques[1]. Association rule mining and classification rule mining are two important data mining techniques. This paper will focus on the integration of association rule mining and classification rule mining.
Basic concept about association rule mining and classification rule mining
Current researches on data mining are based on simple transaction data models. Given an item set {itemi} and a transaction set {transi}, an association rule is defined as an implication of the form, XY, where X and Y are non-overlap subsets of {itemi}. In classification data set, an item can be viewed as {attribute, value} pair. Two important related quantities are confidence c, which is the percentage of transactions including X and Y to transactions including X, and support s, which is the percentage of transactions including X and Y to all transactions. Classification association rule (CAR) is then Xci where ci is a class label. A training data set is such a set of data items that for each item, there exists a class label associated with it. A classifier is a function that maps attributes to class labels. In general, given a training data set, classification is to build a class model from the training data set such that it can be used to predict the class labels of unknown items with high accuracy.
Association rule mining finds interesting association or correlation relationships among a large set of data items. It first discovers frequent itemsets satisfying user-defined minimum support, and then from which generates strong association rules satisfying user-defined minimum confidence. The most famous algorithm for association rule mining is Apriori algorithm.[2] Most of the previous studies on association rule mining adopt the Apriori-like candidate set generation-and-test approach.Apriori algorithm uses frequent (k – 1)-itemsets to generate candidate frequent k-itemsets and use database scan and pattern matching to collect counts for the candidateitemsets
. Recently, J. Han et al critiqued that the bottleneck of Apriori algorithm is the cost of the candidate generation and multiple scans of database. Han’s group developed another influential method for discovering frequent pattern without candidate generation, which is called frequent pattern growth (FP-growth). It adopts divide-and-conquer strategy and constructs a highly compact data structure (FP-tree) to compress the original transaction database. It focuses on the frequent pattern (fragment) growth and eliminate repeated database scan. The performance study by Han’s group shows that FP-growth is more efficient than Apriori algorithm.[3]
Classification rule mining is to build a class model or classifier by analyzing predetermining training data and apply the model to predict the future cases. Besides other techniques for data classification such as decision tree induction, Bayesian classification, neural network, classification based on data warehousing technology, and etc, the associative classification or classification based on association rules is an integrated technique that applies the methods of association rule mining to the classification. It typically consists of two steps. The first step finds the subset of association rules that are both frequent and accurate using association rule techniques. The second step employs the rules for classification.
We can view association rule mining and classification rule mining as the complementary approaches. Association rule mining is to find all rules in the data that satisfy some user-specified constraints such as minimum support and minimum confidence. For association rule mining, the target of discovery is not pre-determined. Classification rule mining is to find a small set of rules in the database to build an accurate class model (classifier) and apply the model to classify new data. There is one and only one pre-determined target: class for classification rule mining. Association is an unbiased way to obtain all the correlations between data items with no external knowledge needed, while classification is biased ways to pay attentions only to the small set of rules with the help of external knowledge. Naturally, people turn to combine the virtues of classification rule mining and associate rule mining.
Recent researches on the integration of association rule mining and classification rule mining
Recently, Bing Liu et al proposed Classification Based on Association rules (CBA) algorithm as an integration of classification rule mining and association rule mining.[4]
The integration was done by finding a special subset of association rules called class association rules (CARs) and building a classifier from the CARs. The main strength of CBA algorithm is its ability to use the most accurate rules for classification, which explains its better performance compared with some original classification algorithms such as C4.5. Liu’s research group also proposed some methods to deal with the problems of the original CBA algorithm such as single minimum support and not being able to generate long rules for many datasets. The performance of the algorithm was improved by using multiple minimum support (Smin) instead of a single Smin, and combining CBA algorithm with other techniques such as decision tree method.[5,6] More recently, Wenmin Li et al critiqued some weakness of Liu’s approach as follows: (1) simply selection a rule with a maximal user-defined measure may affect the classification accuracy, (2) the efficiency problem of storing, retrieve, pruning, and sorting a large number of rules for classification when there exist a huge number of rules, large training data sets, and long pattern rules. They proposed a new associative classification algorithm: Classification based on Multiple Association Rules (CMAR). The experimental results shows that CMAR provides better efficiency and accuracy compared with CBA algorithm. The accuracy of CMAR is achieved by using multiple association rules for classification. The efficiency of CMAR is achieved by extension of efficient frequent pattern method, FP-growth, construction of a class distribution-associated FP-tree, and applying a CR-tree structure to store and retrieve mined association rules.[7] (Both CBA algorithm and CMAR algorithm will be discussed in detail later in the section of related work.)
Problem and high-level solution
The main issues on the integration of association and classification are: (1) efficiency and accuracy: how to efficiently find out the high quality rules using association rule mining and how to generate more accurate classifier, (2) scalability: it is important when there exist large training data sets, huge number of rules and long pattern rules. The efficiency and accuracy typically affect each other. We need to balance these two issues.
CBA algorithm adopts Apriori-like algorithm to find association rules. The challenging problem of combinational explosion of rules still exists when there are huge amounts of data items and huge number of rules. Although CMAR algorithm has improved the performance, the scalability problem still exists. Moreover, in today’s Internet environment, the databases may be scattered over different locations.
This research is to combine the existing data mining algorithms and distributed techniques to develop a distributed CBA algorithm and distributed CMAR algorithm and apply them to mine very large and distributed databases. The need for parallelization of the Apriori algorithm or FP-growth method comes from the transaction database being too large, or the possible number of frequent itemsets being too large (because of the high dimensionality of the database), or both. Correspondingly, in order to achieve concurrency either the candidates need to be counted in parallel or they need to be generated in parallel, or both these phases need to be done in parallel. Not only is this distributed nature of data but also there are some other considerations, which we have to look into for taking the parallel computing approaches. Many practical applications of association rules involve huge transaction databases, which contain a large number of distinct items. In such situations the serial algorithms like Apriori running on single processor machines may take unacceptably large time. The primary reasons are memory, CPU speed and IO bandwidth limitations faced by a single processor machines. As an example, in the Apriori algorithm if the number of candidate itemsets become too large, then they might not all fit in the main memory and multiple database passes would be required within each iteration, incurring expensive IO cost. This implies that, even with the highly efficient pruning method of Apriori, the task of finding all association rules can require a lot of computational and memory resources, especially when the data is enormous and high dimensional (large number of distinct items). This motivates the development of parallel computing.
Use Parallel computing to improve the performance
For efficient parallel data mining, it requires to minimize the interprocess communication. For example, CBA algorithm takes two steps to construct classifiers. The first step is to find the CARs. CARs are built on top of the global data space. Direct computing of support and confidence of each candidate rule requires formidable interprocess communicates. Therefore, construction of CARs based on the CARs sets selected by each process is very desirable. The second is the building of classifier based on the CARs set. This is basically a sorting algorithm that can be relatively easily parallelized.
Here is the high-level proposed distributed CBA algorithm
(1)Construct CARs using the same requirements of minimum support and confidence for each process. CARs satisfy the global requirements should fall into the sum set of all individual CARs
(2)Prune the unqualified CARs on each individual set. This is the most difficult part of distributed CBA algorithm.
Simple method: each process chooses the CAR of the lowest support/confidence and broadcast to all process; each process returns the support/confidence; delete the CAR with unqualified overall support/confidence.
(3)Parallel sorting to build the classifier.
For distributed CMAR algorithm, the basic idea is similar. We can apply the divide-and-conquer strategy to the two phases of CMAR especially. For example, scan each database or segment of training data and build a FP-tree for each database or segment, and then combine the FP-tree and broadcast to all the segments.
Related works
CBA algorithm
As mentioned earlier, Bing Liu’s group proposed CBA algorithm as the integration of association rule mining and classification rule mining.[4]
The Problem for CBA algorithm can be stated as follows:
(1)Assume a relational table D with n attributes. An attribute can be discrete or continuous. (CBA also works with transactional data)
(2)There is a discrete class attribute (for classification).
(3)For a continuous attribute, it is discretized into intervals.
(4)Item: (attribute, value)
(5)Let I be the set of all items in D, and Y be the set of class labels.
(6)A class association rule (CAR) is an implication of the form: X y, where X I, and y Y.
(7)A rule X y holds in D with confidence c and support s (similar to normal association rule mining).
CBA algorithm was carried out in two main stages:
(1)Find the complete set of CARs based on Aprior-like algorithm.
(2)Build a classifier using CARs set based on training data and apply the classifier for new data cases.
Therefore, the main focus to improve CBA algorithm is to
(1)How can we better choose the CARs set using association rule mining
(2)How can we generate more accurate classifier
Liu’s group has attacked on both frontiers. One of the key parameters in association rule mining is the smin, which controls how many rules and what kinds of rules will be generated. The original CBA system uses a single smin in its rule generation. A single smin is inadequate for unbalanced class frequency distribution since using a single smin will result in one of the following two problems: if the smin value is set too high, sufficient rules of infrequent classes may not be discovered; if the smin value is set too low, many useless and overfitting rules for frequent classes may be produced. To solve the problems, Liu et al adopted multiple smin as follows:
For each class ci, a different minimum class support (smin,i) is assigned. The user-defined total minimum support, t_ smin, is distributed to each class according to their class distributions:
smin,i = t_ smin × freqDistr(ci)
The formula gives frequent classes higher smin and infrequent classes lower smin. This ensures that sufficient rules for infrequent classes will be generated and too many overfitting rules for frequent classes will not be produced. Liu’s group consider that cmin has less impact on the classifier quality as long as it is not set too high since they always choose the most confident rules.
To produce the best classifier out of the whole set of rules is infeasible since the number of possible subsets of rules exponentially grows with the number of rules. Liu’s group presented a heuristic algorithm for building a classifier using CARs. They defined a total order (>) on the generated rules and used it in selecting the rules for the classifier.
Given two rules, ri and rj, ri > rj (also called ri precedes rj or ri has a higher precedence than rj) if (1) the confidence of ri is greater than that of rj, or (2) their confidences are equal, but the support of ri is greater than that of rj, or (3) both the confidences and supports of ri and rj are the same, but ri is generated earlier than rj.
Classifier (C) = < r1, r2, …, rn, default_class)
Liu et al proposed several versions of algorithms to build a classifier. The simplest version based on naive algorithm includes three steps: (1) Sort the set of generated rules R according to the relation “>”, (2) Select rules for C from rule set R following the sorted sequence. For each rule r, go through the database to find those cases covered by r, (3) Discard those rules in C that do not improve the accuracy. This simple algorithm needs to make many passes over the database, which is inefficient when the database is not resident in the main memory and when the database is huge. They also presented an improved version of the algorithm, which finds the best rule in R to cover each data case instead of making one pass over the remaining data for each rule.
Liu’s original CBA algorithm is not suitable for huge number of rules or long rules due to combination explosion problem. T deal with this problem, Liu’s group has developed the association data tree (ADT) algorithm. They divide the training data into segments and intelligently choose different classifier for each segment according to a set of conditions.[5, 6]
Liu’s CBA algorithm is a good example that applies association rule mining techniques to classification tasks. It adopted Apriori algorithm to generate the complete set of classification rules (CARs). They have presented a new way to construct classifiers and built a more accurate classifier compared with C4.5. It helps to solve some problems in the existing classification systems such as understandability problem, discovery of interesting and useful rules, and the residency of database on disk rather than in main memory. However, CBA algorithm itself is a compromise between subjective and indiscriminative. CBA algorithm framework has been established but there is still plenty of room for improvement in the CBA such as following:
(1)In Liu’s CBA algorithm there is a step of discretion of data item to be admitted for CARs generation. This can be undesirable because the choice of discretion algorithm can be arbitrary which can affect the final outcome.