Dean L. Zeller, Chibuke Muoh

Methods of Discretization

{dzeller,cmuoh}@cs.kent.edu

Department of Computer Science

KentStateUniversity

Spring, 2006



Abstract – The following is a discussion of several methods of data discretization. Standard discretization methods, including user-supplied, equal-width-intervals, and equal-frequency-intervals, produce results that are fragile, unreliable, or labor intensive for the user. There are numerous data mining algorithms available for this purpose as well; this paper discusses implementations of ChiMerge and MDLP. Examples of algorithms using consistent data are included.

Index terms – data mining, discretization

“The control of large numbers is possible, and like unto that of small numbers, if we subdivide them.”

Sun Tze (544-496 BC)

Chinese military strategest

“Everything should be made as simple as possible, but not simpler.”

Albert Einstein (1879-1955)

German/U.S. physicist

“He who can properly define and divide is to be considered a god.”

Plato (ca 427-347BC)

Greek Philosopher

I. Introduction

Discretization of continuous data is a controversial issue in the world of statistics. Honorable men can disagree as to whether discretization is necessary, and if so, which method to use. To some studies, the answer is clear-cut; for others, there is no “optimal” answer to strive for. That makes for a rich publishing topic.

It is often beneficial in research to categorize continuous datasets. Discretization is a process in which a series of cutpoints is applied to a set of data (continuous or interval), creating a number of discrete groups (hereafter referred to with the variable k). Table 1 shows one possible discretization for an age variable within a dataset. The discretization in table 1 does not represent all significant age categories. For example, age 19 is considered an adult, but not of drinking age (21), and is not differentiated in the discretization. Alcohol as a potential factor in a study using this discretization is not represented. User-supplied-cutpoints (USC) represent the simplest of discretization methods. In this case, the user manually decides the cut-values (c1..ck-1) for the k groups based on some external meaning to the number. The remaining discretization methods produce cut-values through an algorithm rather than a human decision.

Table 2 is a discretization of actual salary data from 1998, adjusted for 2006 inflation. This data has the convenience of group frequencies being approximately equal, and with cutpoints easily recognizable to humans. Statisticians may disagree on the “correct” cutpoints for this dataset. Table 3 presents an alternate discretization of the same data, this time with the requirement that all groups are equal in frequency. Which discretization is better, Table 2 or 3? It depends on the context of how the researcher intends to use the results. It is much easier to express the results in table 2 via written or verbal communication, as the ranges are simple. It is certainly more consise to describe the upper-class as “having a six-figure salary” rather than “having a salary or $94,834 or more”, or “more than approximately 95 thousand dollars.” However, if having equal group frequencies is vital for some purpose in a research study, then table 3 is a “better” discretization.

One must also consider the potential systematic problems with the table 2 discretization. The mere existence of recognizable salaries may influence the data. There could be many employees purposely seeking jobs that pay at least some round number (such as $20,000), serving as a career goal. This could create a significant portion of employees making just over $20,000 a year. It could be statistically beneficial to categorize these employees in group1 (LC) instead of group 2 (LMC). A similar situation can occur with $40,000 or $100,000. This entire discussion is theoretical in nature, stating two possible viewpoints of the same data.

User-supplied cut-values are generally used when there is some external meaning to the value. In the age example, there is a legal significance to 18 as the minimum cutpoint for adult. For reasonably-sized datasets, a good statistician can analyze the data for a few minutes and come up with reasonable cut-values. However, in the age of information, one cannot assume datasets are “reasonably-sized.” Sufficiently large datasets need an algorithm for analysis, and hence is the motivation for this paper.

It should be noted that the discretization process itself causes a loss of accuracy within the data. An age of 23 contains far more information than “adult”. The advantage discretization provides is a categorical representation of the data, which can be easier to work with. However, this causes a small increase in a variable to have drastic effects on the group selection process. At age 44 years and 364.9 days, one is an “adult,” in the same groups as someone that just turned 18; the very next day he is “elderly.” The same can be said for the worker making $39,998 a year getting a 10 cent/hour raise and getting pushed into the middle-class category. The use of discretization is an option for researchers, and is not suited for every study.

Discretization is more of a support method for future statistical analysis. Rarely is discretization itself useful to anyone other than statisticians. To a layman, splitting data into ranges with distinct patterns is very nice, but not terribly interesting or glamorous. Once the discretization cutpoints are determined for continuous data, a researcher can use categorical statistical analysis on the data. The challenge remains – what is the best discretization method?

II. Problem Statement

The general format for most discretization algorithms is the same. The input is a set of continuous or interval data. The output is a set of k-1 cutpoints, creating groups g1…gk. Along with the continuous variable, there could be an associative discrete value, such as in the iris classification data used later in this paper. Some algorithms only use the continuous value to determine the cutpoints, while others use both the continuous and discrete values. The exact algorithms differ in purpose and nature.

Discretization poses an interesting and rich problem. Usually in statistical algorithms, there is a an “optimal” answer to strive for. Comparisons between algorithms can be made based on specific examples to analyze the effectiveness compared to this optimal value. Discretization has no such optimal answer. The “best” answer depends on the context on how the data will be used. As such, there are various measures of quality of discretization methods, but no such optimal exists for comparison purposes. Discretization problems must have a “quality measurement” for comparison purposes, which can be very subjective. Some notion of quality is needed to design and evaluate an algorithm. A discretization method appropriate for one context may be inappropriate in another.

III. Algorithms

A. Equal-Width-Groups (EWG)

This first method is preferred by many statisticians. The range of the data is the only parameter required to create the cutpoints. EWG divides the data-range into k groups of equal width within the range of the data. The equal sized groups make for easier statistical analysis. While statisticians may prefer this method, it has some big disadvantages. Group frequencies have no bearing on the cutpoints, so one group could contain most of the members, while many others remain at 0.

Algorithm 1 – Equal Width Groups

mn = min(D)

mx = max(D)

range = mx – mn

size = range/k

for i = 1 to k-1

cut[i] = min + i * size

return cut

B. Equal-Frequency-Groups (EFG)

Another common discretization method involves creating cutpoints with approximately equal members within each group. The basic idea is simple: read the stores values in sorted order until a calculated group size is achieved. The statistical advantage of having equal frequency groups is clear.

Algorithm 2 – Equal Frequency Groups

sort(D)

size = n / k

for i = 1 to k

cut[i] = D[(i-1)*size]

return cut

The concept of “continuous” is somewhat of a misnomer in statistical data. In “well-behaved continuous” data, no two values are equal. The algorithm above will work perfectly fine on data without repeated values. However, due to precision limitations, all values are discrete values, whether that is the intent or not. In reality, height is continuous: no two people are exactly the same height. It is only due to measurement limitations that person A and person B could both be 5 foot 9 inches tall. Or 5 foot 9.3 inches. Or 5 foot 9.38 inches. The mathematical term continuous has an implication that no two members will have the exact same value. In digital data, that implication is completely false – it is common for multiple occurrences of the same data value. When the data is sorted, these equal values will be grouped together. With this revised definition of “continuous,” members with equal values must go in the same group. The purpose of discretizating data is to choose the appropriate cut values, possibly for future incoming data. It must be clear as to which group a data element belongs..

If the data contains a series of equal values that happens to overlap with an optimal cutpoint, a decision must be made to determine whether the actual cutpoint should contain or exclude those members. Observe the datasets in table 4. The goal is to create a 4-group discretization for the data using three optimal cutpoints, A, B, and C.

EFG divides the the data into groups of equal size size. To calculate the optimal group size, divide the size of the dataset by the number of groups. For the dataset in table 4, the optimal size is 204 = 5 data elements. Notice that dataset A has no repeated values. For well-behaved data with no repeated values, the equal-frequency-groups algorithm works perfectly for any number of groups.

Dataset B in table 4 poses a difficult problem. The first optimal cutpoint is valid because the last element in group 1 is different than the first element in group 2. The second cutpoint is not valid, as the two values of 1.98 must be in the same group. Either both 1.98-values should be in group 2 (resulting in six members for group 2), or both should be put in group 3 (resulting in four members for group 2). The decision to include or exclude overlapping values can have drastic effects on future decisions. The third cutpoint presents a more difficult problem, as all five 2.09-values must be in the same group. If they are put in group 3, it will be larger than optimum, and group 4 will have only two members! If they are put in group 4, then group 3 will have only two members, and group 4 will have seven members. It is not difficult to create a dataset example that will raise more difficult questions for the algorithm.

Table 5 presents four possible discretizations for dataset B in table 4. The only difference between them is the decision to include or exclude equal values that overlap an optimal cutpoint. Which is the best? It is clear that the group sizes are closest to the optimal in the first set of results. But these are minor decisions that drastically affect the outcome.

There are further disadvantages to the EFG method. For large datasets with many cutpoints, the same algorithm could create two correct cutpoint arrays, even on the same data! If the decision to include or exclude is random if the proposed cutpoint is right in the middle, then a “domino effect” could alter the remainder of the cutpoints. One minor decision can have great effects on the outcome. Stability and robustness are key in any statistical measure. The decision to exclude or include is made based on localized choices of the surrounding data.

While EFG is a valid method of discretization, stronger methods are necessary to properly break up continuous data. The major drawback of EFG is its lack of stability. As such, data miners have created more robust results, where minor differences in decisions don’t lead to drastically different results.

Anotehr disadvantage of EFG is that only the continuous variable is considered. Many datasets (including the iris classification dataset [7] used in this paper) have an associative class attribute as a discrete value for each instance in the data. EFG does not consider a class attribute in the frequencies.

Mathematically simple discretization algorithms such as EWG and EFG were initially used for most discretization tasks. While found to be generally effective, they were fragile, occasionally producing discretizations with obvious and serious deficiencies. Statisticians were seldom confident that a given discretization was reasonable for the given research context. Oftentimes analysts would create several discretizations and use human decisions to choose which was best. For large datasets, this is quite difficult.

C. Minimum Description Length Principle (MDLP)

Typically when confronted with the problem of discretizing continuous values without class labels, unsupervised methods such as EFG or EWG are natural choices. But in the presence of class attribute for each instance in the data, then such approaches would be inadequate since they do not take into account class attributes, and thus a less meaningful partitioning of the continuous range would be achieved.

Supervised discretization methods consider the class labels when discretizating on a continuous range. The goal is to produce discrete intervals that convey maximal information about the given range. One such method of conveying information about the intervals is to measure the class information entropy[1], E(A,T;S), and compare it with entropy of the (whole) continuous range.

The goal of supervised discretization methods that use entropy is to induce a split on the set S into S1 and S2 via a cutpoint T such that . The problem then is to find the cutpoint, TA, (where TA is a cutpoint on the boundary of an attribute, A) that results in minimum E(A,TA;S). Algorithm 3 shows a simple algorithm for selecting a cutpoint which minimizes entropy. The intuition here is to find a cut that partitions S into two parts such that the sum of the partition entropies is less that the entropy of S.

Cutpoint TA returned by the MDLP algorithm splits S into two partitions with minimum class information entropy, and thus inducing a binary discretization for the set S. To achieve more than two discretization intervals, one can set to recursively apply the binary discretization algorithm on the partitions, selecting the best cutpoints returned. But then one is now faced with a different problem of deciding when to refrain from excessively splitting a partition.

Algorithm 3 -- MDLP

S = sort(dataset, feature)

GetBestCutPoint (S)

{

cuts  getCandidateCuts(S)

min  E(A,cuts[0];S)

T = cuts[0]

for i  1 to |cuts|

if ( E(A,cuts[0];S) < min )

T = cuts[i]

min = E(A,cuts[0];S)

return T

}

Fayyad and Irani introduce MDLP in [5] as a measure which can be used as a stopping criterion when performing discretization using entropy splitting. In simple terms, MDLP can be considered an optimization problem of encoding a data set into partitions using a minimal number of bits. The optimal encoding for the partitions could be found using Huffman coding, but in this case we are only interested in finding the average length of the minimal encoding of each partition. This average code is easily available by measuring the entropy value of the partition. Thus a cost function Φ for the optimization problem measures the average code length needed to encode the data set, and the partition induced by a cut-point T is accepted if and only if Φ(partition-accepted) is less than Φ(partition-not-accepted). The MDLP stopping criterion in [5] thus states that a partition on T is accepted if and only if

where

and

Algorithm 4 – Entropy Splitting

S = sort(dataset, feature)

T = GetBestCutPoint(S)

if MDLP-StopCriterion(S,T) == true

return

S1 = GetLeftPart(S,T)

S2 = GetRightPart(S,T)

Splitting(S1)

Output T

Splitting(S2)

Algorithm 4 performs discretization using entropy as the splitting criterion and MDLP as a stopping criterion. It utilizes a divide and conquer approach to obtain multiple discretization intervals on the set S by recursively splitting the partitions induced by some cutpoint and then using MDLP to determine if the partition should be accepted or not.

D. ChiMerge

The ChiMerge algorithm was written in 1992 by Randy Kerber [2]. It was later modified by Huan Liu and Rudy Setiono, creating Chi2 [3]. The following is a discussion of ChiZ, a C++ implementation of ChiMerge, written by Dean Zeller in 2006. While there were other implementations in C++ and Lisp available on the Internet, this version was written completely from scratch. The process of “reinventing the wheel” proved to be quite valuable in learning how the ChiMerge algorithm works. ChiZ was written based only on the general descriptions of the algorithm presented in Kerber’s paper. There are discrepancies between the results of ChiMerge and ChiZ on the well-known iris classification problem [7]. These discrepancies will be shown and analyzed.

(pronounced “kīskwâr”) is a statistical test to determine distance between actual frequencies and expected frequencies for a given table. Chimerge uses to choose between two options:

1.the relative class frequencies of adjacent intervals are distinctly different and should not be merged, or

the relative class frequencies of adjacent intervals are similar enough to justify merging into a single interval.

tests the hypothesis that two discrete attributes are statistically independent. The basic calculations for the test are not complex in nature. It is a measure of difference between the expected frequency for the data as a whole versus the frequency for a specific frequency row. If the data is completely random, group sizes will be approximately what the overall frequency predicted, and will result in near zero. In the cases of discretization, it tests the hypothesis that the class attribute is independent of which two adjacent interval an example belongs. If indicates the class is independent of the intervals, then they should be merged. If concludes they are not independent, it indicates that the difference in relative class frequencies is statistically significant and therefore the intervals should remain separate. [2] is calculated as follows:

m = 2 (there are two intervals being compared)