Multi- Relational Data Mining

Multi-Relational Rule Discovery

Progress Report

November 2002

Mahmut Uludag

Eastern MediterraneanUniversity

Supervisor: Prof. Dr. Mehmet Tolun

Summary

This report presents the current status of the multi-relational rule discovery research. As part of the research, a data mining system is being developed that allows discovery of multi-relational rules using data from a relational database. The basic assumption of the system is that objects analysed are stored in a set of tables.

Multi-relational rules discovered would either be used in predicting an object attribute value (by looking at the other object attribute values), or they would make it possible to see the hidden relationship between the object attribute values. The second way of using the discovered rules can be described as ‘data summarisation’.

The system already developed is not yet able to use data available from every possible ‘connected’ schema but it requires that there is a central table directly connected to all other tables. However, this report gives details of the general solution for a ‘connected’ schema. We plan to implement the general solution during December 2002.

The report also gives details of the possible usage of the rule discovery algorithm in SRS, Lion biosciences’ data integration system [14]. Basically, when a relational library is defined in SRS most of the input required by the rule discovery algorithm becomes automatically defined. The rule discovery process can be started by having a button on the SRS query results page allowing the user to link to “hidden relations in the results”.

1. Introduction

Most of the current data mining algorithms are designed to use datasets from a single table. They require that each object be described by a fixed set of attributes. Compared to a single table dataset, a relational database containing multiple tables makes it possible to represent more complex and structured data. In addition, today, a significant amount of scientific data is stored in relational databases, such as gene expression data. For these reasons, it is important to have discovery algorithms running for relational data in its natural form without requiring the data to be viewed in one single table.

Previous work on multi-relational data mining concentrated on producing decision tree classifiers [11] [5] or association rule mining [7]. However, in a more recent study [12] multi-relational data mining was applied to a rule discovery algorithm as a test case for a new paradigm.

In [10], a multi-relational data-mining framework was described that can be used to analyse data in relational databases containing multiple relations. In this study, we have applied the described framework to a specific rule induction algorithm, namely ILA (Inductive Learning Algorithm). ILA is a ‘covering’ type machine learning algorithm that takes each class in turn and seeks a way of covering all instances in it, at the same time excluding instances not in the class [2]. The concepts suggested in the framework, such as selection graphs, target table, target attribute, all helped much during the process of building the multi-relational version of the ILA algorithm. However we needed to extend the framework to have a sense of distinction between covered and not yet covered objects.

A relational data model consisting of multiple tables may represent several object classes, i.e., within a schema while one set of tables represents a class of object a different set of tables may represent another class. Before starting the multi-relational data mining process one should analyse the schema and select the list of tables that represents the kind of objects to be analysed. One of the selected tables is central for the object and each row in the table should correspond to a single object in the database. In the previous multi-relational data mining publications, this central table is named as ‘target table’ in [11] and [5], ‘primary table’ in [7] and [8], ‘master relation’ in [13], and ‘hub table’ in [14]. Part of the object information stored in other selected tables can be found by following the associations between the target table and the other tables.

The target table in multi relational data mining terminology has a similar meaning to the fact table in data warehouses, where the database usually consists of a fact table, storing all measures, with dimensional tables around it. Having one table per dimension leads to a star schema, and by normalising the dimension tables we a snowflake schema is obtained [9]. The fact table is related to several dimension tables using a foreign key relationship to each dimension table. That is, each row inserted into the fact table will have values in the foreign key columns that correspond to primary key data in the various dimensions.

In many cases a database may have several kinds of objects. In these cases multiple target tables can be selected. However Relational-ILA can analyse examples of only one kind of object at a time so target tables should be selected one by one for each execution of Relational-ILA.

ILA requires a particular feature of the object under consideration to be used as a dependent attribute for classification. For this reason one of the columns within the target table should be defined as the target attribute.

The most apparent difference between a standard rule and a multi-relational rule is that in case of a multi-relational rule, attribute names in conditions are annotated using the name of the relational table to which the attribute is related. However, the table name detail does not need to be presented to the user, see for example Figure 1.

Figure 1. An example multi-relational rule, from PharmaDM web site[15].

2. The Algorithm

This section summarises the basic multi-relational ILA algorithm. Input to the algorithm can be listed as follows.

The list of table names,

The name and type of each column in each table,

The list of foreign key relationships between the tables,

The name of the target table,

The name of the target attribute.

Given the input, the multi-relational ILA algorithm generates a list of multi-relational rules that cover all objects stored in the given instances of the list of tables. It is assumed that the target table is connected to all other tables through foreign key relations.

Relational-ILA starts by selecting the first value of the target attribute, and then attempts to generate rules that will correctly classify objects having this attribute value, in other words objects belonging to the current class. For this purpose, the following steps are repeated until all examples of the current class are covered, i.e. correctly classified. Then the same steps are repeated for the other class values.

2.1 Candidate generation

The first step in the ILA algorithm is to generate the hypotheses. In ILA terminology, ‘hypotheses’ means a set of potentially interesting rules. Relational-ILA first attempts to cover all the class examples by finding hypotheses with only one condition. After exploiting all possible size-one hypotheses, ILA starts to refine the set of hypotheses by adding further conditions one at a time. The refinement process continues until either all examples are covered or the maximum size for rules is reached.

The following is the current version the function, which is used to build the initial hypotheses.

Vector buildHypothesis(String classValue){

Vector hypothesis = new Vector();

For each table {

For each column in the selected table{

If ( column is not the target attribute

& column is not a primary key

& column is not a foreign key) {

Check whether the table is the target table

Check whether the column is numeric

Select the proper SQL template and generate SQL

resultSet = execute generated SQL

hypothesis += generate hypothesis using resultSet

Execute SQL in the database

Retrieve hypotheses and their frequency values

For each hypothesis {

If the false values are less then a predefined threshold

Then append the hypothesis to the current hypothesis set

}

return the hypotheses set

}

For the general solution, we need to change this function such that it will not search the table space by just iterating over the table list, but will search by following the relations between tables. So we need a table name as a second argument. Also the loop that iterates over the table list will be replaced by a loop that will iterate over the foreign key relations from/to the current table. So, the general solution will be as follows. When the function first executed, it should be provided by the target table name.

Vector buildHypothesis(String classValue, String tableName){

Vector hypothesis = new Vector();

For each column in the selected table{

If ( column is not the target attribute

& column is not a primary key

& column is not a foreign key) {

Check whether the table is the target table

Check whether the column is numeric

Select the proper SQL template and generate SQL

resultSet = execute generated SQL

hypothesis += generate hypothesis using resultSet

Execute SQL in the database

Retrieve hypotheses and their frequency values

For each hypothesis {

If the false values are less then a predefined threshold

Then append the hypothesis to the current hypothesis set

}

For each table linked by a foreign key relation{

call buildHypothesis(classValue, linkedTableName)

}

return the hypotheses set

}

2.1.1 Candidates in the target table

A list of hypotheses is generated by first searching the target table using the following SQL template. The template is applied for each column other than the class and the primary key column.

Select attr, count(attr) from targetTable, covered where targetAttr = currentClass and covered.id = targetTable.pk and covered.mark=0 group by attr;

Here,

attr is one of the column names in the target table which cannot be the primary key column(s) or the class column, i.e. the target attribute,

targetTable is the central table which has one row for each object for which classification rules are being searched,

targetAttr is the class column which is to be used in the right side of the classification rules,

pk is the name of the primary key column in the target table.

In the template, the table ‘covered’ is a temporary table generated by Relational-ILA. In order to reduce the complexity of communication between Relational-ILA and the database system, the information about covered objects is stored in the temporary table rather than in an internal data structure. An alternative approach would be to keep this information in the Relational-ILA main memory. This requires the hypothesis generation SQL templates to be extended such that the list of identifiers of the covered elements is included in the SQL. The alternative approach was not selected because it can make the SQL statements too long when the number of marked objects is large.

The temporary table ‘covered’ has two attributes named as ‘id’ and ‘mark’. When ILA starts processing a class, this table is initialised as follows. A new row is inserted for each object belonging to the current class. The ‘id’ field gets the value of the primary key and the mark field is set to zero. When a new rule is generated, the ‘mark’ fields of the rows that refer to the covered objects are set to one.

The above SQL template generates hypotheses together with the frequency values. However ILA also needs to know about the frequency of the hypotheses in other classes. For this reason, the following SQL template is used to calculate the frequency values of the hypotheses in the other classes.

Select attr, count(attr) from targetTable where targetAttrcurrentClass group by attr;

2.1.2 Candidates in the other tables

After generating the list of all possible size-one hypotheses in the target table, the next step is to find the size-one hypotheses using the other tables. The following SQL template is used for this purpose. The SQL queries are repeated for each table and for each column except the foreign key column(s).

Select attr, count(attr) from table, covered, targetTable where table.fk = targetTable.pk and targetTable.targetAttr = currentClass and covered.id = targetTable.pk and covered.mark=0 group by attr

Here,

table refers to the current table where the hypothesis is being searched,

fk is the foreign key column which is connected to the target table.

Similar to the hypotheses in the target table case we need to calculate the frequency of the hypotheses in the other classes. The frequency values are calculated using SQL statements derived from the following SQL template.

Select attr, count(attr) from table, targetTable where table.fk = targetTable.pk and targetTable.targetAttrcurrentClass group by attr

After each time the above queries are executed, using the returned result set, ILA builds an internal list of hypotheses using the Hypothesis class, which has the following attributes.

The table attribute is used to store the table name information.

The column attribute is used to store the column name information.

The value attribute refers to the column value for which the hypothesis stands.

The trueValues attribute is used to store the number of times the hypothesis is true in the not yet covered elements of the training data.

The falseValues attribute is used to store the number of times the hypothesis is false.

The Hypothesis class has also an attribute called connectedTo, which relates hypotheses to each other so that it is possible to form multi-relational hypotheses. A multi-relational hypothesis is composed of simple hypotheses all connected by the ‘and’ logical operator. Similar to selection graphs [10] multi-relational hypotheses can be translated into SQL in a straightforward manner. For this purpose we have implemented the toSQL method for the Hypothesis class. The method simply traverses the hypotheses connected to each other and prepares the SQL equivalent of the hypothesis object.

2.2 Hypothesis Evaluation and Rule Generation

At this stage, the list of the hypotheses is ready with the frequency values, for each hypothesis showing how each hypothesis classifies the training data. This is the typical information needed by many rule discovery algorithms for selection of rules. This means that the results of this study are not bound by the ILA algorithm but can be used for other rule discovery algorithms to have their multi-relational versions.

The rule selection criterion of ILA is simple, i.e. select the hypothesis with the maximum number of the trueValues but the falseValues should be nil. If there is any hypothesis satisfying this criterion then it is used to generate a new rule. Rules in Relational-ILA are represented using the following attributes.

The hypothesis attribute refers to the hypothesis object from which the rule was derived. The hypothesis object forms the left-hand side of a rule.

The targetTable attribute is used to store the name of the target table.

The targetAttr attribute is used to store the name of the target attribute.

In summary, the hypothesis attribute forms the left-hand side of a multi-relational rule while the targetTable and the targetAttr attributes forms the right-hand side of it. Different to a conventional classification rule, a multi-relational rule includes the name of the relational table to which the target attribute belongs.

After a new rule has been generated, the objects classified by the new rule are marked as covered in the table ‘covered’ as described in the previous section. Then, in the further steps of the algorithm, the new set of hypotheses is generated only from the objects that have not yet been covered.

The hypothesis generation and the rule generation steps are repeated until no valid rule can be found in the current hypothesis set or all objects of the current class are covered by the already generated rules. If no valid rule can be produced but there are objects not covered then the active set of hypotheses is refined as described in the next section. Before describing the refinement algorithm we may first describe the basic algorithm as follows.

For each class {

Reset covered examples table (classValue);

Repeat {

Vector hypothesis = Build Hypothesis (classValue);

Rule rule = Select Rules (hypothesis, classValue);

If no rule was selected {

int size = 1;

Refine Search (hypothesis, classValue, size);

Stop processing for the current class

}

Else {

Update covered examples information(rule);

If all examples are covered

Then stop processing for the current class

}

2.3 Refinement of Hypothesis

Refinement of a multi-relational hypothesis means extending the actual description of the hypothesis. It results in a new selection of objects that is a subset of the selection associated with the original multi-relational hypothesis.