Cse 8392 Spring 1999 Project 1

MALINI RAJAH

ID# 26402822

CSE 7331 Fall 2007FINAL

DUE DATE: 12/15/07 8:00 am

General Description:

The study of biodegradation of compounds in nature is an important research area in Environmental Engineering. However, the accurate prediction of which compounds actually biodegrade and the speed with which they degrade is a difficult problem yet to be solved. Previous prediction algorithms tend to rely on structural properties of compounds to do the prediction and create somewhat simplistic linear regression models. (Part of the problem with previous prediction algorithms is the lack of large amounts of reliable data. Unfortunately this will also be a problem with this project.)

Your project requires the development and comparison of data mining classification algorithms to predict biodegradability of compounds. You are provided a small dataset from which machine learning can take place. (If possible, a separate validation dataset will be provided.)

Requirements:

You are to implement and compare the performance of three different classification algorithms using three different sets of attributes. You must choose the algorithms, and decide whether to use an existing data mining tool to perform the classification or write your own algorithms. Each of the three algorithms is to be trained using the provided data. You then need to compare the performance of the algorithms using two different metrics you have chosen. You may choose algorithms discussed in class, but this isn’t necessary.

The provided dataset has many attributes. Some of these are related to each other and some may appear to be more related to the prediction problem. You are to run experiments with at least three subsets of the attributes (as predictors for the bio-degradation rate):

First set of attributes are the final four attributes with probability values obtained from previous prediction models. You are to conduct three sets of experiments building classifiers using these attributes alone.
Second set of attributes are: Freeze Point, Boiling Point, Solubility, and Melting Point. Use the same classification algorithms (as used in step one) to build classification models which have these four attributes as input.
The final set of three experiments is to be conducted with your choice of attribute subset different from 1 and 2. Again use the same classification algorithms (as used in step one) to build classification models which have these attributes as input.

Submission:

1)(10 pts) For all programs you write yourself: submit program listings and output of actual runs. For all tools you use: indicate and describe tool and provide output of actual runs.

The tool that I am using here for all the runs is WEKA.

ALGORITHMS CHOSEN:

Usually the most common type of data mining tasks involves the instances that are labeled with some distinguished attribute (Target) and the goal is to predict the target for new unlabeled instances. But the problem that is given here contains attributes whose values have to be used in predicting the value of bio-degradation rate. Hence it is more of prediction than classification. Hence after careful analysis of data we find that the three algorithms taken into consideration are:

1)Linear Regression:

2)Neural Network

3)Support Vector Machines

WHY Linear Regression:

- Target attribute is numeric

- Can visualize the predicted bio-degradation value

- Can check the accuracy of the model built using the correlation co-efficient

WHY NEURAL NETWORK?

- Numeric values can be used

- Can easily adjust the weights, hidden layers, number of nodes per layer to see which model fits the data very well.

- Can be suited to work well even if the network is not trained before actually applying the test data.

- Can be used in prediction as well as classification.

WHY SVM?

WEKA uses an alogorithm called SMOreg to implement support Vector Machines.

SVM – Reg Implements Alex J.Smola and Bernhard Scholkopf sequential minimal optimization algorithm for training a support vector regression using polynomial or RBF kernels. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes by default. (Note that the coefficients in the output are based on the normalized/standardized data, not the original data.)

In the parlance of SVM literature, a predictor variable is called an attribute, and a transformed attribute that is used to define the hyperplane is called a feature. The task of choosing the most suitable representation is known as feature selection. A set of features that describes one case (i.e., a row of predictor values) is called a vector. So the goal of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors.

- Model defines a hyperplane that separates classes

-Target is defined by a linear equation.

-SVM requires that each instance is represented as a vector of real numbers.

- Different Kernel choices - Linear, RBF, Polynomial kernel, sigmoid kernel

- Automatically treats missing values

- Transform nominal into binary ones.

-Can be used in classification and in prediction

PRE-PROCESSING OF DATA:

The data that I wish to analyze here by classification techniques are

Incomplete and
Noisy - containing errors and outlier values that deviate from the expected
Very small dataset

How ‘0’ values are treated?

From the second parameter list we can see that the compound PYRROLE contains a ‘0’ in the Freezing point attribute. We cannot use interpolation method(by referring the value of the compound above or below) here since each of the compounds are different here and the data set is very small. Hence we use linear regression method here to find the value.

Uniform Metric:

The Melting Point values are in Centigrade, whereas Freezing Pt and Boiling Pt are in Kelvin. Thus Melting point values are converted to Kelvin values to make a uniform comparison.

Normalize:

Since neural networks are based on distance measurements the values have to be normalized to make sure all the values are not represented in different scales of measurements.

To normalize the values the data after treating missing values is taken. The file is pre-processed by pressing the open file tab. By clicking the filter button, Weka>Filters> Unsupervised>Attribute> Normalize is chosen. Apply button is then pressed to apply Normalization to the selected attributes.

PART 1: PROBABILITY VALUES OF ATTRIBUTES

TRANSFORMATION OF DATA:

The value of probability can never be greater than 1. But in the dataset there are 3 values that are greater than 1. There are only 2 ways to transform this data.

Since the value is greater than 1 and the greatest value can be only 1, we can just use a global constant ‘1’ to replace those values greater than 1, then use any method to predict the value of bio-degradation rate.

We can change those values to ‘0’ and treat them as missing values. Use the other probability values to find those missing values using linear regression. Fill in those values and then use any method to predict the value of bio-degradation rate.

We select the first method and just replace the missing values by ‘1’.

METHOD 1: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING

LINEAR REGRESSION:

PARAMETERS CHOSEN:

- In the attributeSelectionMethod Noattribute is selected.

- Biodegradation whose value is to be predicted is chosen from the drop down list

- The missing values of Biowin BIODEG 2 is replaced by 1 and the attribute name is replaced by Biowin BIODEG 2 -1 to distinguish it against the other attribute.

OUTPUT:

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 1 -R 1.0E-8

Relation: probability-weka.filters.unsupervised.attribute.Remove-R1

Instances: 20

Attributes: 5

Biodegradation Rate

Biowin BIODEG 2 -1

Biowin BIODEG 2

Biowin MITI 1

Biowin MITI 2

Test mode: evaluate on training data

=== Classifier model (full training set) ===

Linear Regression Model

Biodegradation Rate =

1.0254 * Biowin BIODEG 2 -1 +

0.3207 * Biowin BIODEG 2 +

-0.4167 * Biowin MITI 1 +

0.6325 * Biowin MITI 2 +

0.3027

Time taken to build model: 0 seconds

=== Predictions on training set ===

inst#, actual, predicted, error

1 1.924 1.766 -0.158

2 2.076 1.892 -0.184

3 1.903 1.815 -0.088

4 1.73 1.695 -0.035

5 1.74 1.695 -0.045

6 1.74 1.695 -0.045

7 1.894 1.852 -0.042

8 1.801 1.882 0.081

9 1.279 1.246 -0.033

10 1.531 1.564 0.033

11 1.428 1.701 0.273

12 1.742 1.732 -0.01

13 1.767 1.89 0.123

14 1.777 1.766 -0.011

15 1.929 1.862 -0.067

16 1.851 1.772 -0.079

17 1.716 1.673 -0.043

18 1.62 1.865 0.245

19 1.643 1.66 0.017

20 1.477 1.544 0.067

=== Evaluation on training set ===

=== Summary ===

Correlation coefficient 0.8003

Mean absolute error 0.084

Root mean squared error 0.1118

Relative absolute error 59.804 %

Root relative squared error 59.9595 %

Total Number of Instances 20

METHOD 2: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING

NEURAL NETWORK

Parameters Chosen:

The autoBuild is set to ‘True’ where the hidden layers are added and connected up.
The parameters learningRate and momentum is set to 0.3 and 0.2 respectively to perform slow learning. These parameters can be overridden in the graphical interface.
The decay parameter causes the learning rate to decrease with time; it divides the starting value by epoch number to obtain the current rate. This sometimes improves performance and may stop the network from diverging. This is set to “False”.
The reset parameter automatically resets the network with a lower learning rate and bgins training again if it diverges from the answer. This is set to “False”.
The trainingTime parameter sets the number of training epochs. It is set to different values to see at which value the network converges.
ValidationSetSize is set to 0 which means, no validation set is used.
NormalizeAttributes is set to ‘true’ which normalizes all the attributes and requires no preprocessing in terms of normalizing the attributes beforehand.
The other parameters are default values.

Under WEKA we choose MultiLayerPerceptron to run the dataset. The attributes chosen for this run are Biowin BIODEG 2 -1, Biowin BIODEG 2, Biowin MITI 1, Biowin MITI 2.

NETWORK CONVERGES:

The Error per Epoch does not converge until 500000 epochs.

- The number of nodes in hidden layer is 10 and there is only 1 hidden layer

- Learning rate is set at 0.3 and momentum is set at 0.2

- normalizeAttributes is set to ‘True’

OUTPUT:

=== Run information ===

Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500000 -V 0 -S 0 -E 20 -H 10 -G -B -C -R

Relation: probability-weka.filters.unsupervised.attribute.Remove-R1

Instances: 20

Attributes: 5

Biodegradation Rate

Biowin BIODEG 2 -1

Biowin BIODEG 2

Biowin MITI 1

Biowin MITI 2

Test mode: evaluate on training data

=== Classifier model (full training set) ===

Linear Node 0

Inputs Weights

Threshold 3.933594631828762

Node 1 -0.9783199689103592

Node 2 0.7224941433805308

Node 3 2.7231318365967154

Node 4 -0.29364362180108144

Node 5 -3.1570067670437387

Node 6 -7.257240444556116

Node 7 3.6356026420461536

Node 8 1.883024482863157

Node 9 -2.156884214803949

Node 10 -1.337750821693855

Sigmoid Node 1

Inputs Weights

Threshold 0.21465281942625822

Attrib Biowin BIODEG 2 -1 5.0138196265829835

Attrib Biowin BIODEG 2 -3.2145565960064557

Attrib Biowin MITI 1 1.9631616499694475

Attrib Biowin MITI 2 -4.792619029618996

Sigmoid Node 2

Inputs Weights

Threshold 4.030017024711695

Attrib Biowin BIODEG 2 -1 -3.027303900286876

Attrib Biowin BIODEG 2 -4.61615542896227

Attrib Biowin MITI 1 -4.39740759901025

Attrib Biowin MITI 2 3.3223816094720684

Sigmoid Node 3

Inputs Weights

Threshold -1.255278689756961

Attrib Biowin BIODEG 2 -1 -0.1350383581698951

Attrib Biowin BIODEG 2 -4.89063610524344

Attrib Biowin MITI 1 -0.5683913720758256

Attrib Biowin MITI 2 1.4711038260088778

Sigmoid Node 4

Inputs Weights

Threshold -2.2824294178232476

Attrib Biowin BIODEG 2 -1 -0.48897386084433153

Attrib Biowin BIODEG 2 -1.6744306970725502

Attrib Biowin MITI 1 -0.6955691484542303

Attrib Biowin MITI 2 -0.3618185102341199

Sigmoid Node 5

Inputs Weights

Threshold -1.213514551110898

Attrib Biowin BIODEG 2 -1 -6.94082531690281

Attrib Biowin BIODEG 2 10.59543076588969

Attrib Biowin MITI 1 2.962308498021926

Attrib Biowin MITI 2 -5.336457513173256

Sigmoid Node 6

Inputs Weights

Threshold 6.42729094594343

Attrib Biowin BIODEG 2 -1 11.86119768519412

Attrib Biowin BIODEG 2 -24.23406709359751

Attrib Biowin MITI 1 -4.832458024488896

Attrib Biowin MITI 2 6.955548596351002

Sigmoid Node 7

Inputs Weights

Threshold -1.1739379376604704

Attrib Biowin BIODEG 2 -1 0.48667009587622273

Attrib Biowin BIODEG 2 -5.372199269555603

Attrib Biowin MITI 1 -0.9916289739264834

Attrib Biowin MITI 2 1.9820724980439903

Sigmoid Node 8

Inputs Weights

Threshold -4.122573603114901

Attrib Biowin BIODEG 2 -1 3.905180928827441

Attrib Biowin BIODEG 2 -1.104460993115406

Attrib Biowin MITI 1 -7.6272313364149

Attrib Biowin MITI 2 -5.256928873603537

Sigmoid Node 9

Inputs Weights

Threshold -1.4606352748167564

Attrib Biowin BIODEG 2 -1 -1.8041566307782526

Attrib Biowin BIODEG 2 -1.042551971873019

Attrib Biowin MITI 1 0.615783213465384

Attrib Biowin MITI 2 -1.3471402098611278

Sigmoid Node 10

Inputs Weights

Threshold -1.159271543801691

Attrib Biowin BIODEG 2 -1 -0.20089837948088868

Attrib Biowin BIODEG 2 -2.4574257719754184

Attrib Biowin MITI 1 0.2616092866435466

Attrib Biowin MITI 2 -1.5176060264186448

Class

Input

Node 0

Time taken to build model: 467.64 seconds

=== Predictions on training set ===

inst#, actual, predicted, error

1 1.924 1.926 0.002

2 2.076 2.076 0

3 1.903 1.915 0.012

4 1.73 1.741 0.011

5 1.74 1.741 0.001

6 1.74 1.741 0.001

7 1.894 1.894 0

8 1.801 1.802 0.001

9 1.279 1.292 0.013

10 1.531 1.542 0.011

11 1.428 1.439 0.011

12 1.742 1.743 0.001

13 1.767 1.768 0.001

14 1.777 1.788 0.011

15 1.929 1.93 0.001

16 1.851 1.851 0

17 1.716 1.723 0.007

18 1.62 1.62 0

19 1.643 1.644 0.001

20 1.477 1.478 0.001

=== Evaluation on training set ===

=== Summary ===

Correlation coefficient 0.9997

Mean absolute error 0.0044

Root mean squared error 0.0066

Relative absolute error 3.1242 %

Root relative squared error 3.5321 %

Total Number of Instances 20

No of Epoch / Error per Epoch / Correlation Co-efficient
10000 / 0.00796 / 0.9101
150000 / 0.0000121 / 0.9997
200000 / 0.000012 / 0.9997
250000 / 0.0000119 / 0.9997
300000 / 0.0000118 / 0.9997
350000 / 0.0000017 / 0.9997
500000 / 0.0000114 / 0.9997

METHOD 3: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING

SUPPORT VECTOR MACHINES:

Parameters Chosen:

c -- The complexity parameter C. It is set to 0.30

checksTurnedOff -- Turns time-consuming checks off - use with caution.

debug -- If set to true, classifier may output additional info to the console. It is set to “False”

eps -- The epsilon for round-off error. It is left at default value.

epsilon -- The amount up to which deviations are tolerated. Watch out, the value of epsilon is used with the (normalized/standardized) data.

filterType -- Determines how/if the data will be transformed.

kernel -- The kernel to use is chosen to be RBF Kernel.

toleranceParameter -- The tolerance parameter is left at default value 0.0010.

OUTPUT:

=== Run information ===

Scheme: weka.classifiers.functions.SMOreg -S 0.09 -C 3.0 -T 0.0010 -P 1.0E-12 -N 0 -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01"

Relation: probability-weka.filters.unsupervised.attribute.Remove-R1

Instances: 20

Attributes: 5

Biodegradation Rate

Biowin BIODEG 2 -1

Biowin BIODEG 2

Biowin MITI 1

Biowin MITI 2

Test mode: evaluate on training data

=== Classifier model (full training set) ===

SMOreg

Kernel used:

RBF kernel: K(x,y) = e^-(0.01* <x-y,x-y>^2)

Support Vector Expansion :

(normalized) Biodegradation Rate =

3 * K[X(0), X]

+ 3 * K[X(1), X]

+ 3 * K[X(2), X]

+ 2.0091 * K[X(6), X]

+ -3 * K[X(8), X]

+ -3 * K[X(9), X]

+ -3 * K[X(10), X]

+ 3 * K[X(14), X]

+ 3 * K[X(15), X]

+ -3 * K[X(17), X]

+ -2.0091 * K[X(18), X]

+ -3 * K[X(19), X]

+ 0.5254

Number of support vectors: 12

Number of kernel evaluations: 210 (100 % cached)

Time taken to build model: 0.02 seconds

=== Evaluation on training set ===

=== Summary ===

Correlation coefficient 0.7517

Mean absolute error 0.1133

Root mean squared error 0.1443

Relative absolute error 80.6594 %

Root relative squared error 77.4072 %

Total Number of Instances 20

PART 2: USING 4- ATTRIBUTES

TRANSFORMATION OF DATA:

SCATTER PLOT:

By looking at the plot matrix we can see that to determine the Melting Point value, Freezing point and boiling point values can be used as they tend to show a linear relationship. We use Linear Regression method to predict the missing value of melting point attribute.

PRINCIPAL COMPONENT ANALYSIS:

=== Run information ===

Evaluator: weka.attributeSelection.PrincipalComponents -R 0.95 -A 5

Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1

Relation: degrade4-weka.filters.unsupervised.attribute.Remove-R1

Instances: 20

Attributes: 5

Biodegradation Rate

Freeze Pt (K)

Boil Pt ( K)

Solubility

Melt Pt (K)

Evaluation mode: evaluate on all training data

=== Attribute Selection on all input data ===

Search Method:

Attribute ranking.

Attribute Evaluator (unsupervised):

Principal Components Attribute Transformer

Correlation matrix

1 0.75 -0.42 0.98

0.75 1 -0.11 0.75

-0.42 -0.11 1 -0.42

0.98 0.75 -0.42 1

eigenvalueproportioncumulative

2.83154 0.70788 0.707880.579Melt Pt (K)+0.579Freeze Pt (K) +0.493Boil Pt ( K)-0.296Solubility

0.90903 0.22726 0.93514-0.893Solubility-0.446Boil Pt ( K)-0.044Melt Pt (K)-0.033Freeze Pt (K)

0.23652 0.05913 0.99427-0.747Boil Pt ( K)+0.41 Freeze Pt (K) +0.399Melt Pt (K)+0.338Solubility

Eigenvectors

V1 V2 V3

0.5786-0.0326 0.4103Freeze Pt (K)

0.4926-0.4461-0.7472Boil Pt ( K)

-0.2956-0.8933 0.3385Solubility

0.579 -0.044 0.3985Melt Pt (K)

Ranked attributes:

0.29212 1 0.579Melt Pt (K)+0.579Freeze Pt (K) +0.493Boil Pt ( K)-0.296Solubility

0.06486 2 -0.893Solubility-0.446Boil Pt ( K)-0.044Melt Pt (K)-0.033Freeze Pt (K)

0.00573 3 -0.747Boil Pt ( K)+0.41 Freeze Pt (K) +0.399Melt Pt (K)+0.338Solubility

Selected attributes: 1,2,3 : 3

This output also proves that attributes Bio-degradation Rate, Freezing point and Melting point are correlated variables and hence are more significant. Hence the value of boiling point can be used to calculate the missing freezing point value. ( This method is used mainly because the compound PYRROLE has a lot of missing information )

LINEAR REGRESSION – PREDICTION

OUTPUT:

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 1 -R 5.0

Relation: degrade2-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R3

Instances: 20

Attributes: 3

Freeze Pt (K)

Boil Pt ( K)

Melt Pt (K)

Test mode: evaluate on training data

=== Classifier model (full training set) ===

Linear Regression Model

Freeze Pt (K) =

0.3387 * Boil Pt ( K) +

0.6416 * Melt Pt (K) +

-63.5711

Time taken to build model: 0 seconds

=== Predictions on training set ===

inst#, actual, predicted, error

1 183.9 170.489 -13.411

2 216 248.092 32.092

3 314 264.332 -49.668

4 304.1 288.023 -16.077

5 285.4 280.269 -5.131

6 307.9 295.372 -12.528

7 404 382.563 -21.437

8 395.6 367.348 -28.252

9 267 262.786 -4.214

10 0 233.162 233.162 Predicted Value for the missing Freezing Point

11 262.7 233.033 -29.667

12 298 275.072 -22.928

13 278.7 234.869 -43.831

14 353.5 329.469 -24.031

15 291 314.046 23.046

16 146.9 156.232 9.332

17 184.7 174.632 -10.068

18 260.2 262.664 2.464

19 252.5 238.621 -13.879

20 242 237.025 -4.975

=== Evaluation on training set ===

=== Summary ===

Correlation coefficient 0.7669

Mean absolute error 30.0097

Root mean squared error 56.8357

Relative absolute error 48.9522 %

Root relative squared error 65.0816 %

Total Number of Instances 20

METHOD 1: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING

LINEAR REGRESSION:

PARAMETERS CHOSEN:

- The value of the missing freezing point which was found out to be 233.162 was inserted and the Linear Regression analysis is done.

-From the test options choose “Training Set”. Performing cross-validation on this dataset does not yield the right results because each tuple is a different compound and the dataset as such is very small.

- Choose “Biodegradation Rate” – Numeric; from the list of attributes.

- No attribute selection is chosen for the attributeSelectionMethod

-eliminateCollinearAttributes is set to True

The output is depicted as follows:

OUTPUT:

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 1 -R 1.0E-8

Relation: degrade4-weka.filters.unsupervised.attribute.Remove-R1

Instances: 20

Attributes: 5

Biodegradation Rate

Freeze Pt (K)

Boil Pt ( K)

Solubility

Melt Pt (K)

Test mode: evaluate on training data

=== Classifier model (full training set) ===

Linear Regression Model

Biodegradation Rate =

0.0012 * Freeze Pt (K) +

0.0016 * Boil Pt ( K) +

0 * Solubility +

-0.0021 * Melt Pt (K) +

1.2702

Time taken to build model: 0 seconds

=== Predictions on training set ===

inst#, actual, predicted, error

1 1.924 1.777 -0.147

2 2.076 1.725 -0.351

3 1.903 1.794 -0.109

4 1.73 1.729 -0.001

5 1.74 1.764 0.024

6 1.74 1.739 -0.001

7 1.894 1.795 -0.099

8 1.801 1.735 -0.066

9 1.279 1.751 0.472

10 1.531 1.661 0.13

11 1.428 1.632 0.204

12 1.742 1.684 -0.058

13 1.767 1.573 -0.194

14 1.777 1.724 -0.053

15 1.929 1.895 -0.034

16 1.851 1.723 -0.128

17 1.716 1.666 -0.05

18 1.62 1.777 0.157

19 1.643 1.694 0.051

20 1.477 1.729 0.252

=== Evaluation on training set ===

=== Summary ===

Correlation coefficient 0.3562

Mean absolute error 0.1289

Root mean squared error 0.1742

Relative absolute error 91.7856 %

Root relative squared error 93.4393 %

Total Number of Instances 20

METHOD 2: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING

NEURAL NETWORK

Parameters Chosen:

The autoBuild is set to ‘True’ where the hidden layers are added and connected up.
The parameters learningRate and momentum is set to 0.3 and 0.2 respectively to perform slow learning. These parameters can be overridden in the graphical interface.
The decay parameter causes the learning rate to decrease with time; it divides the starting value by epoch number to obtain the current rate. This sometimes improves performance and may stop the network from diverging. This is set to “False”.
The reset parameter automatically resets the network with a lower learning rate and bgins training again if it diverges from the answer. This is set to “False”.
The trainingTime parameter sets the number of training epochs. It is set to different values to see at which value the network converges.
ValidationSetSize is set to 0 which means, no validation set is used.
NormalizeAttributes is set to ‘true’ which normalizes all the attributes and requires no preprocessing in terms of normalizing the attributes beforehand.
The other parameters are default values.

Under WEKA we choose MultiLayerPerceptron to run the dataset