Principles for Measuring Association Rules

Victor Shi, Shan Duanmu and William Perrizo

Computer Science Dept., North Dakota State University

Fargo, ND 58105

{Victor.Shi, Shan.Duanmu, William.Perrizo}@ndsu.nodak.edu

1  Introduction

Association rule mining searches for interesting relationships among items in a given data set. Such interesting relationships are typically expressed in an association rule in the form of , where and are sets of items. It can be read that, whenever a transaction T contains X, it probably will contain Y. The probability is defined as the percentage of transactions containing Y in addition to X with respect to the overall number of transactions containing X. This probability is called confidence (or strength). While the confidence measure represents the certainty of a rule, support is used to represent the usefulness of the rule [1]. Formally, the support of a rule is defined as the percentage of transactions containing both X and Y with respect to the number of transactions in the database.

The purpose of association rule mining is to find interesting rules. A rule is considered to be interesting if its confidence and support exceed certain thresholds. Such thresholds are generally assumed to be given by domain experts. Many algorithms use the support threshold to reduce the algorithmic complexity and the number of rules generated for human analysis [2, 3], since the number of rules grows exponentially with the number of items.

While the support-confidence framework has been widely used for measuring the interestingness of association rules, it is known that the resulting rules may be misleading [4-8]. A rule with high support and high confidence may still not indicate that X and Y are dependent. The use of thresholds of support and confidence for pruning may obscure important rules, and also many unimportant rules may remain in the resulting rule set. S. Brin et al [4, 5] proposed Chi-square test, interest and conviction measures to overcome the weaknesses of the support-confidence framework. S. Ahmed [6] defined a reliability measure to fix the symmetry problems of Brin’s framework, namely that Interest gives the same value to rules and , and that Conviction gives that the same value to rules and .

Giving the same value to rules and is a serious problem. For example, if X is BUY_COMPUTER and Y is BUY_SOFTWARE and if , and , by definition [4] the rules and have the same Interest (2). This is counter-intuitive, since we know a customer will likely buy software when he buys a computer. It is much less likely that a customer will buy computer when he buys software. In fact, in the example the confidence of rule is 100%, which is much larger that the confidence of rule (33.3%). S. Ahmed et al [6] claim that Reliability is a better choice than Conviction by saying that Conviction gives the same value to rules and while Reliability does not. Logically, is equivalent to [9], so it could be argued that an interesting measure should respect this logical equivalence. However, just because they are logically equivalent (have the same truth value when the confidence is 100%), that doesn't necessarily mean they are equally interesting.

In the literature, many other measures [8, 10] have been proposed to capture the correlation property of X and Y. Are there other properties that need to be captured? What measures should be used together to fully characterize the interestingness properties of association rules? Is there a single measure that incorporates all the properties and thus can be sorted in total order to rank rules for presenting the top N rules to users? Few papers have addressed these questions comprehensively and systematically. In this paper we present six objective principles for evaluating interestingness measures. With the proposed principles, we try to answer the above questions based on the analysis of 5 interestingness measures. We will also answer the question whether Reliability is a better interestingness measure than Conviction by detailed analysis. The major contributions of this paper are:

·  We propose 6 principles, compare and contrast them with previous work.

·  We conclude that conviction or reliability is a good measure for ranking rules if only positively-correlated rules are concerned.

·  We point out that the support-confidence framework discussed in [11] for mining the most interesting rules through partial order is not adequate in capturing all the properties a good measuring system must have for rule ranking. We conclude that a three measure framework must be used.

·  We conclude that the support-confidence-interest framework is the best among existing measures if all principles proposed in this paper must be satisfied.

·  We report that the interest of a rule also reflects the rule’s novelty.

·  We use theoretical reasoning instead of experiments on artificially generated data or manually collected data when making conclusive statements, which avoids the possibility of making incorrect or inconclusive statements due to the intrinsic nature of incompleteness in data collection or data generation.

The paper is organized as follows. We present the six principles in section 2. In section 3 we choose five interestingness measures and compare them using the proposed principles. Section 4 discusses possible pruning techniques. Section 5 discusses related work. We conclude the paper in Section 5.

2  Objective principles

An association rule can be interpreted as the presence of itemset X in a transaction would imply the occurrence of itemset Y. Logically (100% confidence) is equivalent to [9]. Also it can be rewritten as . Thus (confidence) and (conviction) both would be good measures with regard to implication. Further, we impose a constraint on the implication measure that, when then the implication of rule should be larger than the implication of rule . This can be explained from the example in section 1, where , and . Definitely the implication of should be larger than the implication of . We will see in the next section both confidence and conviction follow this constraint. Thus we write the implication principle as follows.

Principle 1 (implication principle): If a set of measures is defined to reflect the interestingness of an association rule , then at least one measure in the set should satisfy the constraint when .

To mine a rule , we assume there is certain relationship between X and Y. It makes no sense to say that we have a rule but its antecedent and consequent are independent. Confidence measure does have this drawback, though it satisfies the Implication Principle. Much research effort has been devoted to development of new measures that reflect the strength of the correlation between the antecedent and the consequent of a rule [9]. Theoretically, the covariance of X and Y is the best way to reflect the correlation strength. In practice, we also hope that the measure may have some sort of closure property so that we can utilize it to reduce computation complexity. In general, we only need the correlation measure to be proportional to the covariance.

Principle 2 (correlation principle): If a set of measures is defined to reflect the interestingness of an association rule , then at least one measure mi in the set should be directly proportional to the covariance of X and Y.

When presenting rules to user, it is desirable that the rules contain new information. Most efforts in this regard are devoted to removing redundant rules from generated rule set [1]. One of two rules is considered to be redundant if the two are “very close” to each other. Here we define novelty from a different perspective. In general, we say a rule is less novel if it is closer to common knowledge. The difficulty is how to quantitatively measure “closer to common knowledge”. We define “common knowledge” to be the rules we can get from the database when (or ). This is reasonable, since we are sure there is a rule if , no matter how small is, and vice versa. The closer is to 1, the less novel the rule is. Also the closer ) is to 1, the less novel the rule is, since X will be more likely to co-occur with Y if ) is to 1. We then draw the following novelty principle.

Principle 3 (novelty principle): If a set of measures is defined to reflect the interestingness of an association rule, then for a given , at least one measure mi in the set should reflect its novelty. The novelty measure mi should be inversely proportional to p=max{,}.

In this paper we deliberately use , and instead of occurrence frequencies of X, Y and . This is because rule has statistical significance only when X, Y and occur “frequently enough”. The frequency of may have to exceed a certain threshold for the rule to be intriguing to the user. The occurrence frequency threshold of is therefore determined by the “statistical” significance or user’s “financial” interest, whichever is larger. Thus we have the following utility principle.

Principle 4 (utility principle): If a set of measures is defined to reflect the interestingness of an association rule, then at least one measure mi in the set should reflect its utility, i.e., mi is a monotone increasing function with respect to .

In practice, association rule mining may produce too many rules to be examined by humans. It is desirable to present only “the most interesting N rules” to user for further examination. A synthetic measure with respect to which we can sort the rules in descending order can help meet this need.

Principle 5 (top-N-rule principle): If a synthetic measure is defined to sort the rules for presenting the top N rules to users, then it is desirable that this measure obeys the principles 1-4.

Arithmetic complexity is an important issue in association rule mining. It is impossible to mine large databases without efficient algorithms. Enormous amounts of effort in the past have been devoted to reduce computation complexity. We thus define the efficiency principle as follows.

Principle 6 (efficiency principle): If a set of measures is defined to reflect the interestingness of an association rule, then it is desirable that thresholds used with those measures help reduce computation complexity.

Among the proposed principles, the implication principle and correlation principle are most important, Compared to principles 5 and 6, principles of novelty and utility are more fundamental. While potentially there are many measures that may satisfy principles 1-4, the efficiency principle and the top-N-rule principle may help decide which measure will be most successful. In the next section we will see examples in this regard.

3  Evaluating measures

In the preceding section we elaborate 6 principles for characterizing association rules. Are these principles complete? Is there redundancy in these principles? We see Implication, Correlation, Novelty and Utility as four fundamental principles. Does that mean we need only four measures to complete the measuring set? So far we do not have a measure that can incorporate the four fundamental principles, what should we do then? What is the best measure? In this section we select five example measures to analyze using the principles we defined. You will see most of these questions can be answered after the analysis is done. For convenience, we rewrite the definitions of the selected measures as follows:

Support (3.1)

Confidence (3.2)

Interest (3.3)

Conviction (3.4)

Reliability (3.5)

3.1  Support

Formula (3.1) defines support as the joint occurrence probability of X and Y in a transaction. Obviously it is a utility measure as many researchers have pointed out – it satisfies principle 4. Also it satisfies the Efficiency principle (principle 6) because of its downward closure property [4]. Further examination shows that support does not satisfy the other principles.

3.2  Confidence

Formula (3.2) defines confidence as the conditional probability of Y given X. Here we prove that it satisfies the Implication principle (principle 1) when :

Thus confidence is an implication measure as we expected. Similarly, we know that confidence can not reflect the correlation principle and the utility principle, and has no closure property we can use to reduce complexity [4]. It also does not satisfy the novelty principle.

3.3  Interest

Formula (3.3) defines Interest. It is known that the interest is defined to capture the strength of correlation between X and Y [4, 8]. Interestingly, we can prove interest satisfies the novelty principle.

3.4  Conviction

Formula (3.4) defines conviction. It can be rewritten as . If conviction is a good implication measure, we must have –>0 when , as the implication principles requires. We can prove that conviction can be used as an implication measure when we are only interested in positively correlated rules. Also it can be used as a correlation measure as the authors of [8] concluded. Conviction does not do well with regard to principle 3.

3.5  Reliability

Formula (3.5) defines Reliability. It reflects correlation between X and Y as discussed in [6]. We can prove that Reliability satisfies the implication principle when we are only interested in rules that have positive correlations.

We summarize the discussion of this section in table 1. All the proofs are omitted from this paper due to space limits. Interested readers are referred to an expanded version of this paper for details [15].

Table 1: the properties of interestingness measures

Support / confidence / Interest / Conviction / Reliability
Implication / X / X (when positively correlated) / X (when positively correlated)
Correlation / X / X / X
Novelty / X / X (when negatively correlated)
Utility / X

A few conclusions can be made from table 1.

1.  No measure is absolutely better than the others for obtaining the Top-N ruleset.

2.  When using a measure such as reliability or conviction, support is still an important utility measure. Interest still should be used as a novelty measure in order to fully characterize rules.