Predictive Modeling

UNIT 3

PREDICTIVE MODELING

Introduction

Classification is a supervised learning method. It attempts to discover the relationship between the input attributes and the target attribute. Clustering is a method of grouping the data to form clusters. This chapter focuses on the issues of design, implementation and analyzing classification and clustering process.

Learning Objectives

To Explore the types of classifiers
To Explore classifier types like decision trees, Bayesian classifiers and other classifier types
To study about the evaluation of classifiers
To explore the clustering algorithms
To analyze and perform evaluation of clustering algorithms

3.1 Classification and Prediction Model

Classification is a supervised learning method. The input attributes of the classification algorithms are called independent variables. The target attribute is called dependent variable. The relationship between the input and target variable is represented in the form of a structure which is called a classification model. The classification problem can be stated formally as

Formal Definition

Given a database D which consists of tuples The tuple has input attributes and a nominal target attribute Y from an unknown fixed distribution D over a labeled space. The classification problem is to define a mapping function which maps the D to C where each ti is assigned to a class with minimum generalization errors.

The classes must be

Predefined
Non-overlapping
Partition the entire database

For example, an email classifier may classify incoming mails into valid and invalid mail (spam). It can classify an incoming email into spam if

It comes with unknown address
Possibly involve Marketing source that user don’t want to receive.
Possibly mails may have viruses/Trojans that user don’t want to receive.
Possibly involve contents that are not suitable for the user.

Email classification program may classify mails as valid or spam based on these attributes.

Classification model is both descriptive and predictive. The model can explain the given dataset and it can be used to classify the unknown attributes also as a predictive model.

Classification models can be constructed using the regression models. Regression models map the input space into a real value domain. For example the regression models can be used not only predict the class labels but also other variables which can be a real valued. Estimation and prediction are viewed as types of classification. For example guessing the grade of the student is a classification problem. Prediction is a task of classifying an instance into a set of possible classes. While prediction involves the continuous variable, classification generally restricts to the discrete variable.

The classification model is implemented in two steps.

1. Training phase

Testing phase

In phase 1, a suitable model is created with a large set of training data. The training data has all possible combinations of data. Based on the data, a data-driven classification model is created. Once a model is developed, the model is applied to classify the tuples from the target database.

3.2 Issues Regarding Classification and Prediction

The major issues that are associated with the implementations of this model are

1. Over fitting of the model

The quality of the model depends on the amount of good quality data. However if the model fits the data exactly, then it may not be applicable to a broader population of data. For example, a classifier should not be developed with training data more than necessary. Otherwise it leads to generalization error.

2. Missing data

The missing values may cause problems during both training and testing phases. Missing data forces classifiers to produce inaccurate results. This is a perennial problem for the classification models. Hence suitable strategies should be adopted to handle missing data.

3. Performance

The performance of the classifier is determined by evaluating the accuracy of the classifier. The process of classification is a fuzzy issue. For example, classification of emails requires extensive domain knowledge and requires domain experts. Hence performance of the classifier is very crucial.

3.3 Types of classifiers

There are different methodologies for constructing classifiers. Some of the methods are given below

Statistical methods

These methods use statistical information for classification. Typical examples are Bayesian classifiers and linear regression methods.

Distance based methods use distance measures or similarity measures for classification.
Decision tree methods are another popular category for classification using decision trees.
Rule based methods use IF-THEN rules for classification.

Softcomputing techniques like neural networks, genetic algorithms and rough set theory are also suitable for classification purposes.

In this chapter, the statistical methods and decision tree methods are explored in a detailed manner and an overview of other methods arebriefly explained.

3.3.1 Decision Tree Induction

A Decision Tree is a classifier that recursively partition the tuples. For example, for determining whether a mail is valid or spam, a set of questions are repeatedly asked like

Is it from any legal source?

Is it contains any illegal program content?

Each question results in a partition. For example, Figure 3.1 shows a split.

Figure 3.1: Show of Split

The set of possible questions and answers are organized in the form a DT which is a hierarchical structure of nodes and edges. Figure 3.2 shows the structure of a decision tree.It can be noted that the tree has three types of nodes

Figure 3.2: Sample Decision Tree

Root

The root node is a special node of the tree. It has no incoming edges. It can have zero or more outgoing edges.

Internal nodes:

These nodes, like root node are non-terminal nodes. It contains the attribute test conditions to separate records.

Terminal nodes

Terminal nodes are also known as leaf nodes. These represent classes and have one incoming edge and no outgoing edges.

The basic questions involved for inducing DT are

How the records should be splitted?

A suitable attribute test condition based on some objective measures of quality should be selected to divide the records into smaller subsets.

What is the stop criterion?

A possible answer to the procedure is repeated till all the leaves belong to the class. Sometimes the procedures can be terminated earlier to gain some advantage.

Some of the attribute conditions and its corresponding outcomes for different attribute types are given below

Binary attributes:

The test condition for a binary attribute generates two outcomes. Here checking of gender is a attribute test. The outcomes of binary attributes are always resulting in two children. Figure 3.3 shows an example of a binary attribute.

Figure 3.3: Binary Attribute

Nominal attributes:

Nominal attributes have many values. This can be shown as multi-way split tree where the number of outcomes depends on the distinct values of the continuous attributes. This can also be shown as binary splits in 2k-1 -1ways. Some examples of the multi-way split is shown in Figure 3.4 and Figure 3.5.

Figure 3.4: A Multi-way split

Or as a binary split as

Figure 3.5: Multiway Split as Binary Split

Ordinal types:

Ordinal attributes produce binary or multi-way splits. Ordinal values can be grouped as long as the basic order is not violated.

For example the values of the attribute {low, medium, high}. The different ways of grouping can be any one of this type as shown in Figure 3.6.

Figure 3.6: Ordinal Type

Some of the grouping like {low, high} is invalid.

Continuous attributes:

The continuous attributes can be expressed as a comparison with binary outcomes or multi-way splits. For example a student test marks can be expressed as a binary split and shown in Figure 3.7a and 3.7b.

Figure 3.7a: Split of Continuous Attribute.

>80

<81 &

> 49

Figure 3.7b: Split of Continuous Attribute.

Splitting criteria

The measures developed for selecting the best split are based on the degree of impurity of the child nodes. If the degree of impurity is smaller, then the distribution is more skewed. If the attribute conditions split the records or instances evenly, then the attribute is considered to exhibit highest impurity.

The degree of impurity is measured by many factors like Entropy, Gini and Information gain.

1. Information Entropy is measured as

Entropy (t) = -

Here, c is the number of classes for a given tuple.

Entropy values ranges from zero to one. If the value is zero, then all the instances belong to only one class and if the instances are equally distributed in all classes then the value of entropy is close to one.

2. Gini index is expressed as

Gini (t) = 1 - 2

Information Gain

If tuple is partitioned into r subsets, then the gain is measured as

Gain (t1, t2, …, tr) = I (t) - I(tj)

Where |t| is the cardinality of t and I(t) can be either entropy or Gini.

3.3.2 ID3 Algorithm

The algorithm ID3 is a top down approach and recursively develop a tree. It uses theoretic measure Information gain to construct a decision tree. At every node the procedure examines an attribute that provides the greatest gain or greatest decrease in entropy.

The algorithm can be stated as follows

Calculate the initial value of the entropy

Entropy (t) = -

Select an attribute which results in the maximum decrease of entropy or gain in information. This serves the root of the decision tree.

The next level of the decision tree is constructed on this criterion.

The steps 2 to 3 are repeated till a decision tree is constructed where all the instances are assigned to a class or the entropy of the system is zero.

Example: consider a following data set shown in Table 3.1.

Instance / A1 A2 / Class
1 / Yes Yes / C1
2 / Yes Yes / C1
3 / Yes No / C2
4 / No No / C1
5 / No Yes / C2
6 / No Yes / C2
7 / No No / C2
8 / Yes No / C1
9 / No Yes / C2

Table 3.1: Sample Data Set.

The decision tree can be constructed as

There are four C1 classes and five C2 classes. Hence the probability of the classes

P(C1) = 4/9 and P(c2)= 5/9

The entropy of the training sample is given as

Entropy = -4/9 log2 (4/9) – 5/9 log2 (5/9)

= 0.9911

The Information gain of the attributes A1 and A2 can be calculated as follows.

For the attribute A1, the corresponding contingency table is shown in Table 3.2.

A1 / C1 C2
Yes / 3 1
No / 1 4

Table 3.2: Contingency Table of A1

The entropy of the attribute A1 is

=4/9 [ -(3/4) log (3/4) – ¼ log (1/4)] + 5/9 [ -(1/5) log (1/5) – (4/5) log (4/5)]

= 0.7616

Therefore the information gain for the attribute A1 is

= 0.9911 – 0.7616

= 0.2294

The same procedure can be repeated for the attribute A2

For the attribute A2, the corresponding contingency table is shown in Table 3.3.

A2 / C1 C2
Yes / 2 3
No / 2 2

Table 3.3: Contingency Table of A2

The entropy of the attribute A1 is

=5/9 [ -(2/5) log (2/5) – 3/5 log (3/5)] + 4/9 [ -(2/4) log (2/4) – (2/4) log (2/4)]

= 0.9839

Therefore the information gain for the attribute A1 is

= 0.9911 – 0.7616

= 0.2294

Hence the best attribute to split is A1

Gini Index Calculation

The Gini index for the attribute A1 is

= 4/9 [ 1 – (3/4)2 - (1/4)2 ) + 5/9 [ 1 – (1/5)2 – (4/5)2 ]

= 0.3444

The Gini index for the attribute A2 is

= 5/9 [ 1 – (2/5)2 - (3/5)2 ) + 5/9 [ 1 – (2/4)2 – (2/4)2 ]

= 0.4889

Since Gini Index for the attribute A1 is smaller, it is chosen first. The resultant decision tree is shown in Figure 3.8.

Figure 3.8: Decision Tree

One of the inherent benefits of a decision tree model is that it can be understood by many users as it resembles a flow chart. The advantages of the decision tree is its graphical format and the interactability it provides to the user to explore the decision tree.

Decision tree output is often presented as a set of rules. The rules of the form If-Then and it provides a more concise knowledge mechanism especially when the tree is too large and understanding becomes more difficult..

Decision trees can be both predictive and descriptive models. It can help us to predict case-by-case basis by navigating the tree. More often, Prediction is done on automatic manner without much interference from the user and for multiple new cases through the tree or rule set automatically and generating an output file with the predicted value. In an exploratory mode, the decision tree shows the insight about relationships between independent and dependent variables thus helping the data investigation.

Overfitting:

A decision tree which performs well for the given training set but fails for test tuples is said to lack the necessary generalization. This is called overfitting problem. Overfitting is due to the presence of noise in the data or due to the presence of irrelevant data in the training set. This is also due to the smaller number of training data. Overfitting reduces the classifier accuracy drastically.

After a decision tree is formed, the decision tree must be explored. Sometimes the exploration of the decision tree may reveal nodes or subtrees that are undesirable. This has happened because of Overfitting Pruning is a common technique that is used to removes splits and the subtrees created by Overfitting. Pruning is controlled by user defined parameters that cause splits to be pruned. By controlling the parameters the users can experiment with the tree induction to get an optimal tree ideal for effective prediction.

The methods that are used commonly to avoid over fitting are

Pre – pruning: In this strategy the generalization of the tree size beyond certain point is stopped based on the measures like chi-square or information gain. Decision Tree is then assessed for the goodness of fit.

Post – pruning: In this technique, the tree is allowed to grow. Then post pruning techniques are applied to remove the unnecessary branches. This procedure leads to a compact and reliable tree. The post-pruning measures involve

cross validation of data

Minimum Description Length (MDL)

compilation of statistical bounds.

In practice both the methods of pre-pruning and post-pruning are combined to achieve the desired result. Post-pruning required more effort than pre-pruning.

Although the pruned trees are more compact than the original, still they may be large and complex. Typically, two problems that are encountered are repetition and replication. Repetition is a repeated testing of an attribute (like if age < 30 and then if age < 15 and so on). In replication duplicate sub trees are present in the tree. A good use of multivariate splits based on combined attributes would solve these problems.

Testing a Tree

Decision tree should be tested and validated prior to integrating with the decision support systems. It is also desirable to test the tree periodically over a larger period to ensure that the decision tree maintains the desired accuracy as classifiers are known to be unstable.

3.4 Bayes Classification

Bayesian classifier is a probabilistic model which estimates the class for a new data. Bayes classification is both descriptive and predictive.

Let us assume that the training data has d attributes. The attributes can be numeric or categorical. Let each point xi be the d-dimensional vector of the attribute xi = {}.

Bayes classifier, maps xj to ci where ci is one of the k – classes.

c = {c1, c2, ….., ck }

Let be the conditional probability of assigning xj to class ci which has highest probability.

The Bayes theorem can be given as

From the training set, the values that can be determined are p(xi), p(xi / ci ) and p(cj). From these, Bayes theorem helps us to determine the posterior probability p(cj / xi) .

The procedure is given as

First determine priori probability p(cj) for each class by counting the classes of that appear in the data set.
Similarly the number of occurrences xi can be calculated to determine p(xi).
Similarly p(xi | cj) can be estimated by counting how often instance xi belongs to cj.

From these derived probabilities, a new tuple is classified using the Bayesian method of combining the effects of different attribute values.

Suppose if the tuple ti has p independent variables {} then we can estimate p (ti / cj) by

p (ti /cj) =

In other words ti is assigned to class cj

iff

p(cj / x) > p(ci/x)

Advantages:

It is easy to use.
Only one scan of the database is used.
Missing values won’t cause much problem as they are ignored.
In a dataset with simple relationships, this technique would produce good results.

Disadvantages:

Bayes classifier assumes that attributes are always independent. This assumption will not work for real-time datasets.
Bayes classification will not work for continuous data.

Example:

Let us assume a simple data set shown in Table 3.4. Let us apply the Bayesian Classifier to predict X (1,1)

a1 a2 class

10c1

01c1

12c2

01c2

31c2

Table 3.4: Sample Data set

HERE c1 = 2 and c2 = 3

p(c2) = 1/3;p(c1) = 1/2 The conditional probability is estimated.

p(a1 = 1 / c1 ) = 1/2;p(a2=1 / c2) = 1/2

p(a2 = 0 / c1) = 1/2 ;p(a2=1 / c2) = 2/3

p(x/c1) = p(a1=1/ c1) * p(a2 =1/c1)

=1/2 1/2

=1/4

p(x/c2)=p(a1=1/c2) * p(a2=1/c1)

=1/2 2/3

=1/3

This is used to evaluate

p(c1 / x) = 1/4 1/2

=1/8

p(c2 / x)=1/3 1/3

=1/9

p(c1 / x)p(c2 / x)

Hence the sample is predicted to be in class c1.

Bayesian Network

A Bayesian network is a graphical model. This model shows the probabilistic relationships among variables of a system. A Bayesian network consists of vertices and edges. The vertices represent the variables and their interrelationships are edges and associated probability values. By using their conditional probabilities, we can reason and calculate the unknown probabilities. Figure 3.9 shows a simple Bayesian network

Figure 3.9: Simple Bayesian Network.

Here the nodes or vertices are variables. The edge connects two nodes. If causal relationship exists, that is A is the cause of B, then the edge would be directed. Otherwise the relationship is just correlation, and then the edge would be undirected. In the above example, there exists a correlation relationship between A and B, and B and C. There is no direct relationship between A and C, but still there may be some impact since both are connected to B.

In addition, each edge consists of a conditional probability table which stores all probabilities that may be used to reason or make inferrences within the system.

The probability here is a Bayesian probability, a subjective measure, indicates the degree of belief in the event. The difference between the physical probability and Bayesian probability is that for Bayesian probability, the repeated trails are not necessary. For example, consider a game, the outcome of the game cannot be determined by earlier trials or the trails cannot be repeated to make probability calculations. Hence it is a degree of belief reflected as probability.

Apart from the Bayesian probability, the edge represents causal relationships between variables. If a node X is manipulated by some action some times changes the values of Y. Then X is said to be cause of Y.

Also the causal relationships indicate strength. This is done by associating a number to edge.

The conditional probability for this model would be P(Xi/Pai). Here, Pai is the set of parents that render Xi independent of all its other parents. Then the joint probability distribution can be given as