A BUILDING BLOCK APPROACH TO GENETIC PROGRAMMING FOR RULE DISCOVERY

AP Engelbrecht, Department of Computer Science, University of Pretoria, South Africa

SE Rouwhorst, Department of Informatics & Mathematics, Vrije Universiteit Amsterdam, The Netherlands

L Schoeman, Department of Computer Science, University of Pretoria, South Africa

INTRODUCTION

The cost of storing data has decreased significantly in recent years,with a consequent explosion in the growth of stored data. These vast quantities of data encapsulate hidden patterns and characteristics in the problem domain. To remain on the competitive edge, there is now an increasing demand for techniques to extract these hidden patterns from large databases. Utilization of such knowledge can substantially increase the revenue of a company, and can save money.

The extraction of useful information from databases – a process commonly known as Knowledge Discovery in Databases (KDD) – has become a widely researched area. Large amounts of data can now be analysed to find patterns that can be used for prediction, visualization, classification, among other tasks. Traditional statistical methods have been shown to be insufficient for many real-world problems. For example, statistical analysis can determine co-variances and correlations between attributes, but cannot characterize the dependencies at an abstract, conceptual level. Also, statistical techniques fail to produce causal explanations of reasons why dependencies among attributes exist. “Intelligent” methods are needed for such tasks.

Recently developed knowledge extraction tools have their origins in artificial intelligence. These new tools combine and refine approaches such as artificial neural networks, genetic algorithms, genetic programming, fuzzy logic, clustering and statistics. While several tools have been developed, this chapter concentrates on a specific evolutionary computing approach, namely genetic programming (GP).

Evolutionary computing approaches to knowledge discovery have shown to be successful in knowledge extraction applications. They are, however, computationally complex in their nature by starting evolution on large, complex, structured individuals. This is especially true in the case of genetic programming where complex decision trees are evolved. This chapter presents a building-block approach to genetic programming, where conditions (or conjuncts) and sub-trees are only added to the tree when needed.

The building-block approach to genetic programming (BGP) starts evolution with a population of the simplest individuals. That is, each individual consists of only one condition (the root of the tree), and the associated binary outcomes – thus representing two simple rules. These simple individuals evolve in the same way as for standard GP. When the simplicity of the BGP individuals fail to account for the complexity of the data, a new building block (condition) is added to individuals in the current population, thereby increasing their representation complexity. This building-block approach differs from standard GP mainly in the sense that standard GP starts with an initial population of complex individuals.

The remainder of the chapter is organized as follows: The next section offers background on current knowledge extraction tools, and gives a motivation for the building-block approach. A short overview of standard GP is also given. The section that follows discusses the building-block approach in detail, with experimental results in the last section.

BACKGROUND

This section gives a short overview of state-of-the-art knowledge extraction tools, motivates the building-block approach and presents a summary of standard GP.

Knowledge Extraction Tools

Several knowledge extraction tools have been developed and shown to be efficient. The first tools came from the machine learning community, namely C4.5 (Quinlan, 1993) and CN2 (Clark et al, 1989). C4.5 builds a decision tree, from which if-then rules are extracted. CN2 uses a beam strategy to induce rules directly. Several rule extraction algorithms for neural networks (NN) have been developed, of which the KT-algorithm of Fu (1994) and the N-of-M algorithm of Craven and Shavlik (1994) are popular. The NN rule extraction algorithms transform the knowledge embedded in the numerical weights of the network into symbolic rules.

Several evolutionary computing algorithms have also been developed to evolve decision structures. Two ways exist in which a Genetic Algorithm (GA) can play a role in data mining on classification problems. The first way is to let a GA optimise the parameters for other kinds of data mining algorithms, for example, to use the GA in selecting a subset of features out of the total set of available features. After a subset has been chosen, a machine-learning algorithm like C4.5 or AQ15 is used to obtain a classifier using the feature subset. Studies by Cherkauer et al (1996) are examples of research using this approach. The studies show that the combination of GA plus a traditional induction algorithm gives better results than using only the traditional induction algorithm. The improvement was seen on the accuracy as well as the number of features used by the resulting classifier.

The second way of using a GA is to let the GA do all the searching instead of combining the GA with another search or induction algorithm. One example of this approach is GABIL (De Jong et al, 1991) that performs an incremental search for a set of classification rules. Classification rules are represented by fixed-length bit-strings. GABIL uses only features with nominal values. The bit-strings used by the GA represent a rule using nki+ 1 bits, where n is the total number of features and ki is the number of values of feature i, i n. The last bit of the string is used to store the classification. GABIL initially accepts a single instance from a pool of instances and searches for as perfect a rule set as possible for this example within the time/space constraints given. This rule set is then used to predict the classification of the next instance. If the prediction is incorrect, the GA is invoked to evolve a new rule set using the two instances. If the prediction is correct, the instance is simply stored with the previous instance and the rule set remains unchanged.

Other GA-based knowledge discovery tools include SET-Gen (Cherkauer et al, 1996), REGAL (Giodana et al, 1994) and GA-MINER (Flockhart et al, 1995).

LOGENPRO (Wong et al, 2000) combines the parallel search power of GP and the knowledge representation power of first-order logic. LOGENPRO stands for LOgic grammar based GENetic PROgramming system. It takes advantage of existing inductive logic programming and GP systems, while avoiding their disadvantages. The framework facilitates the generation of the initial population of individuals and the operations of various genetic operators such as crossover and mutation. A suitable grammar to represent rules has been designed and modifications of the grammar to learn rules with different format have been studied. The advantages of a grammar-based GP are that it uses domain knowledge and avoids the need for a closed function set. Bot use GP to evolve oblique decision trees, where the functions in the nodes of the trees use one or more variables (Bot, 1999).

The building-block approach to GP differs from the above GP approaches in that standard decision trees are evolved, from simple trees to complex trees.

Ockham’s Razor and Building Blocks

William of Ockham (1285 – 1347/49) was a leading figure in the fourteenth-century golden age of Oxford scholasticism (McGrade, 1992). He became well-known for his work in theology and philosophy, but also highly controversial for his criticism of earlier Christian principles, as well as his dispute with pope John XXII. Currently, however, his name is mostly associated with the so-called ‘principle of parsimony’ or ‘law of economy’ (Hoffmann et al, 1997). Although versions of this principle are to be found in Aristotle and works of various other philosophers preceding Ockham, he employed it so frequently and judiciously that it came to be associated with his name. Some centuries were to elapse before the principle of parsimony became known as ‘Ockham’s razor’ (The earliest reference appears to be in 1746). The metaphor of a razor cutting through complicated scholastic and theological arguments to reach the core of truth, is probably responsible for the general appeal of the principle and for associating it with Ockham’s name.

The principle of parsimony can be stated in several ways, for example:

  • It is futile to do with more what can be done with fewer. [Frustra fit per plura quod potest fieri per pauciora]
  • Plurality should not be assumed without necessity. [Pluralitas non est ponenda sine necessitate]
  • Entities are not to be multiplied beyond necessity. [Non sunt multiplicanda entia praeter necessitatem]

Although the principle of parsimony was formulated to guide the evaluation of symbolic reasoning systems, it is frequently quoted in scientific disciplines. Ockham’s razor has, among others, inspired the generalization of neural networks with as few as possible connections (Thodberg, 1991), and fitness evaluation based on a simplicity criterion (Bäck et al, 2000b, p. 15).

In evolutionary computation the idea of building blocks is primarily associated with genetic algorithms. The building-block hypothesis states that GAs produce fitter partial solutions by combining building blocks comprising short, low-order highly-fit schemas into more highly-fit higher-order schemas (Hoffmann et al, 1997). In this chapter building blocks are used with genetic programming where the population is all possible computer programs, consisting of functions and terminals. The initial population would usually consist of a large number of computer programs generated at random. These are executed to find a fitness value assigned to that program (Bäck et al, 2000a, p. 103).

In keeping with the economy principle or principle of parsimony as embodied by Ockham’s razor, the building-block approach to genetic programming starts of with an initial population of very simple programs of one node each. Building blocks, like decisions, are added gradually to increase the complexity. At no stage the population of programs will be more complex than what is absolutely necessary, thus no plurality is assumed without necessity.

GENETIC PROGRAMMING FOR DECISION TREES

Genetic programming (GP) is viewed as a specialization of genetic algorithms (Bäck et al, 2000). Similar to GAs, GP concentrates on the evolution of genotypes. The main difference is in the representation scheme used. Where GAs use string representations, GP represents individuals as executable programs (represented as trees). The objective of GP is therefore to evolve computer programs to solve problems. For each generation, each evolved program (individual) is executed to measure its performance. The results, or performance of the evolved computer program is then used to quantify the fitness of that program.

In order to design a GP, a grammar needs to be defined that accurately reflects the problem and all constraints. Within this grammar, a terminal set and function set are defined. The terminal set specifies all the variables and constants, while the function set contains all the functions that can be applied to the elements of the set. These functions may include mathematical, arithmetic and/or boolean functions. Decision structures such as if-then-else can also be included within the function set. Using tree terminology, elements of the terminal set form the leaf nodes of the evolved tree, and elements of the function set form the non-leaf nodes.

In terms of data mining, an individual represents a decision tree. Each non-leaf node represents a condition, and a leaf node represents a class. Thus, the terminal set specifies all the classes, while the non-terminal set specifies the relational operators and attributes of the problem. Rules are extracted from the decision tree by following all the paths from the root to leaf nodes, taking the conjunction of the condition of each level.

The fitness of a decision tree is usually expressed as the coverage of that tree, i.e. the number of instances correctly covered. Crossover occurs by swapping randomly selected sub-trees of the parent trees. Several mutation strategies can be implemented:

  • Prune mutation: A non-leaf node is selected randomly and replaced by a leaf node reflecting the class that occurs most frequently.
  • Grow mutation: A node is randomly selected and replaced by a randomly generated sub-tree.
  • Node mutation: The content of nodes are mutated, in any of the following ways: (1) the attribute is replaced with a randomly selected one from the set of attributes, (2) the relational operator is replaced with a randomly selected one, and (3) perturb the threshold values with Gaussian noise in the case of continuous-valued attributes, or replace with a randomly selected value for discrete-valued attributes.

For standard genetic programming, the initial population is created to consist of complete decision trees, randomly created.

BUILDING-BLOCK APPROACH TO GENETIC PROGRAMMING (BGP)

This section describes the building-block approach to genetic programming for evolving decision trees. The assumptions of BGP are first given, after which the elements of BGP are discussed. A complete pseudo-code algorithm is given.

Assumptions

BGP assumes complete data, meaning that instances should not contain missing or unknown values. Also, each instance must have a target classification, making BGP a supervised learner. BGP assumes attributes to be one of four data types:

  • Numerical and discreet (which implies an ordered attribute), for example the age of a patient.
  • Numerical and continuous, for example length.
  • Nominal (not ordered), for example the attribute colour.
  • Boolean, which allows an attribute to have a value of either true or false.

Elements of BGP

The proposed knowledge discovery tool is based on the concept of a building block. A building block represents one condition, or node in the tree. Each building block consists of three parts: <attribute> <relational operator> <threshold>. An <attribute> can be any of the attributes of the database. The <relational operator> can be any of the set {=, , <, , >, } for numerical attributes, or {=, } for nominal and boolean attributes. The <threshold> can be a value or another attribute. Allowing the threshold to be an attribute makes it possible to extract rules such as

IF Income > Expenditure THEN…

The initial population is constructed such that each individual consists of only one node (the root), and two leaf nodes corresponding to the two outcomes of the condition. The class of a leaf node depends on the distribution of the training instances over the classes in the training set and the training instances over the classes, propagated to the leaf node. To illustrate this, consider a training set consisting of 100 instances which are classified into four classes A, B, C and D. Of these instances, 10 belong to class A, 20 to class B, 30 to class C and 40 to class D. Thus, the distribution of the classes in this example is skewed [10,20,30,40]. If the classes were evenly distributed, there would be 25 instances belonging to class A, also 25 belonging to class B etc. Now let's say there are 10 out of the 100 instances, which are propagated to the leaf node, for which we want to determine the classification. Suppose these 10 instances are distributed in the following way: [1,2,3,4]. Which class should be put into the leaf node when the distribution of training instances over the classes in the training set is the same as the distribution in the leaf node? In this case we chose to put the class with the highest number of instances into the leaf node, which is class D.

What happens if the two distributions are dissimilar to each other? Lets say the overall class distribution is again [10,20,30,40] and the distribution of the instances over the classes propagated to the leaf node this time is [2,2,2,2]. A correction factor that accounts for the overall distribution will be determined for each class first. The correction factors are 25/10, 25/20, 25/30 and 25/40 for classes A, B, C and D respectively, where 25 is the number of instances per class in case of an equal distribution. After this, the correction factors are combined with the distribution of the instances in the leaf node and the class corresponding to the highest number is chosen ( [(25/10)*2, (25/20)*2, (25/30)*2, (25/40)*2 ] which equals [ 5, 1.25, 3.33, 1.88 ] and means class A will be chosen).

A first choice for the fitness function could be the classification accuracy of the decision tree represented by an individual. However, taking classification accuracy as a measure of fitness for a set of rules when the distribution of the classes among the instances is skewed, does not account for the significance of rules that predict a class that is poorly represented. Instead of using the accuracy of the complete set of rules, BGP uses the accuracy of the rules independently and determines which rule has the lowest accuracy on the instances of the training set that are covered by this rule. In other words, the fitness of the complete set of rules is determined by the weakest element in the set. This type of fitness evaluation has the advantage that even rules that cover only a few instances, have to meet up to the same standard as rules that cover a lot of the instances.

In the equation below, the function C(i) returns 1 if the instance i is correctly classified by rule R and 0 if not. If rule R covers P instances out of the whole training set, then