MALINI RAJAH
ID# 26402822
CSE 7331 Fall 2007FINAL
DUE DATE: 12/15/07 8:00 am
General Description:
The study of biodegradation of compounds in nature is an important research area in Environmental Engineering. However, the accurate prediction of which compounds actually biodegrade and the speed with which they degrade is a difficult problem yet to be solved. Previous prediction algorithms tend to rely on structural properties of compounds to do the prediction and create somewhat simplistic linear regression models. (Part of the problem with previous prediction algorithms is the lack of large amounts of reliable data. Unfortunately this will also be a problem with this project.)
Your project requires the development and comparison of data mining classification algorithms to predict biodegradability of compounds. You are provided a small dataset from which machine learning can take place. (If possible, a separate validation dataset will be provided.)
Requirements:
You are to implement and compare the performance of three different classification algorithms using three different sets of attributes. You must choose the algorithms, and decide whether to use an existing data mining tool to perform the classification or write your own algorithms. Each of the three algorithms is to be trained using the provided data. You then need to compare the performance of the algorithms using two different metrics you have chosen. You may choose algorithms discussed in class, but this isn’t necessary.
The provided dataset has many attributes. Some of these are related to each other and some may appear to be more related to the prediction problem. You are to run experiments with at least three subsets of the attributes (as predictors for the bio-degradation rate):
- First set of attributes are the final four attributes with probability values obtained from previous prediction models. You are to conduct three sets of experiments building classifiers using these attributes alone.
- Second set of attributes are: Freeze Point, Boiling Point, Solubility, and Melting Point. Use the same classification algorithms (as used in step one) to build classification models which have these four attributes as input.
- The final set of three experiments is to be conducted with your choice of attribute subset different from 1 and 2. Again use the same classification algorithms (as used in step one) to build classification models which have these attributes as input.
Submission:
1)(10 pts) For all programs you write yourself: submit program listings and output of actual runs. For all tools you use: indicate and describe tool and provide output of actual runs.
The tool that I am using here for all the runs is WEKA.
ALGORITHMS CHOSEN:
Usually the most common type of data mining tasks involves the instances that are labeled with some distinguished attribute (Target) and the goal is to predict the target for new unlabeled instances. But the problem that is given here contains attributes whose values have to be used in predicting the value of bio-degradation rate. Hence it is more of prediction than classification. Hence after careful analysis of data we find that the three algorithms taken into consideration are:
1)Linear Regression:
2)Neural Network
3)Support Vector Machines
WHY Linear Regression:
- Target attribute is numeric
- Can visualize the predicted bio-degradation value
- Can check the accuracy of the model built using the correlation co-efficient
WHY NEURAL NETWORK?
- Numeric values can be used
- Can easily adjust the weights, hidden layers, number of nodes per layer to see which model fits the data very well.
- Can be suited to work well even if the network is not trained before actually applying the test data.
- Can be used in prediction as well as classification.
WHY SVM?
WEKA uses an alogorithm called SMOreg to implement support Vector Machines.
SVM – Reg Implements Alex J.Smola and Bernhard Scholkopf sequential minimal optimization algorithm for training a support vector regression using polynomial or RBF kernels. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes by default. (Note that the coefficients in the output are based on the normalized/standardized data, not the original data.)
In the parlance of SVM literature, a predictor variable is called an attribute, and a transformed attribute that is used to define the hyperplane is called a feature. The task of choosing the most suitable representation is known as feature selection. A set of features that describes one case (i.e., a row of predictor values) is called a vector. So the goal of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors.
- Model defines a hyperplane that separates classes
-Target is defined by a linear equation.
-SVM requires that each instance is represented as a vector of real numbers.
- Different Kernel choices - Linear, RBF, Polynomial kernel, sigmoid kernel
- Automatically treats missing values
- Transform nominal into binary ones.
-Can be used in classification and in prediction
PRE-PROCESSING OF DATA:
The data that I wish to analyze here by classification techniques are
- Incomplete and
- Noisy - containing errors and outlier values that deviate from the expected
- Very small dataset
How ‘0’ values are treated?
From the second parameter list we can see that the compound PYRROLE contains a ‘0’ in the Freezing point attribute. We cannot use interpolation method(by referring the value of the compound above or below) here since each of the compounds are different here and the data set is very small. Hence we use linear regression method here to find the value.
Uniform Metric:
The Melting Point values are in Centigrade, whereas Freezing Pt and Boiling Pt are in Kelvin. Thus Melting point values are converted to Kelvin values to make a uniform comparison.
Normalize:
Since neural networks are based on distance measurements the values have to be normalized to make sure all the values are not represented in different scales of measurements.
To normalize the values the data after treating missing values is taken. The file is pre-processed by pressing the open file tab. By clicking the filter button, Weka>Filters> Unsupervised>Attribute> Normalize is chosen. Apply button is then pressed to apply Normalization to the selected attributes.
PART 1: PROBABILITY VALUES OF ATTRIBUTES
TRANSFORMATION OF DATA:
The value of probability can never be greater than 1. But in the dataset there are 3 values that are greater than 1. There are only 2 ways to transform this data.
- Since the value is greater than 1 and the greatest value can be only 1, we can just use a global constant ‘1’ to replace those values greater than 1, then use any method to predict the value of bio-degradation rate.
OR
- We can change those values to ‘0’ and treat them as missing values. Use the other probability values to find those missing values using linear regression. Fill in those values and then use any method to predict the value of bio-degradation rate.
We select the first method and just replace the missing values by ‘1’.
METHOD 1: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING
LINEAR REGRESSION:
PARAMETERS CHOSEN:
- In the attributeSelectionMethod Noattribute is selected.
- Biodegradation whose value is to be predicted is chosen from the drop down list
- The missing values of Biowin BIODEG 2 is replaced by 1 and the attribute name is replaced by Biowin BIODEG 2 -1 to distinguish it against the other attribute.
OUTPUT:
=== Run information ===
Scheme: weka.classifiers.functions.LinearRegression -S 1 -R 1.0E-8
Relation: probability-weka.filters.unsupervised.attribute.Remove-R1
Instances: 20
Attributes: 5
Biodegradation Rate
Biowin BIODEG 2 -1
Biowin BIODEG 2
Biowin MITI 1
Biowin MITI 2
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Regression Model
Biodegradation Rate =
1.0254 * Biowin BIODEG 2 -1 +
0.3207 * Biowin BIODEG 2 +
-0.4167 * Biowin MITI 1 +
0.6325 * Biowin MITI 2 +
0.3027
Time taken to build model: 0 seconds
=== Predictions on training set ===
inst#, actual, predicted, error
1 1.924 1.766 -0.158
2 2.076 1.892 -0.184
3 1.903 1.815 -0.088
4 1.73 1.695 -0.035
5 1.74 1.695 -0.045
6 1.74 1.695 -0.045
7 1.894 1.852 -0.042
8 1.801 1.882 0.081
9 1.279 1.246 -0.033
10 1.531 1.564 0.033
11 1.428 1.701 0.273
12 1.742 1.732 -0.01
13 1.767 1.89 0.123
14 1.777 1.766 -0.011
15 1.929 1.862 -0.067
16 1.851 1.772 -0.079
17 1.716 1.673 -0.043
18 1.62 1.865 0.245
19 1.643 1.66 0.017
20 1.477 1.544 0.067
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.8003
Mean absolute error 0.084
Root mean squared error 0.1118
Relative absolute error 59.804 %
Root relative squared error 59.9595 %
Total Number of Instances 20
METHOD 2: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING
NEURAL NETWORK
Parameters Chosen:
- The autoBuild is set to ‘True’ where the hidden layers are added and connected up.
- The parameters learningRate and momentum is set to 0.3 and 0.2 respectively to perform slow learning. These parameters can be overridden in the graphical interface.
- The decay parameter causes the learning rate to decrease with time; it divides the starting value by epoch number to obtain the current rate. This sometimes improves performance and may stop the network from diverging. This is set to “False”.
- The reset parameter automatically resets the network with a lower learning rate and bgins training again if it diverges from the answer. This is set to “False”.
- The trainingTime parameter sets the number of training epochs. It is set to different values to see at which value the network converges.
- ValidationSetSize is set to 0 which means, no validation set is used.
- NormalizeAttributes is set to ‘true’ which normalizes all the attributes and requires no preprocessing in terms of normalizing the attributes beforehand.
- The other parameters are default values.
Under WEKA we choose MultiLayerPerceptron to run the dataset. The attributes chosen for this run are Biowin BIODEG 2 -1, Biowin BIODEG 2, Biowin MITI 1, Biowin MITI 2.
NETWORK CONVERGES:
The Error per Epoch does not converge until 500000 epochs.
- The number of nodes in hidden layer is 10 and there is only 1 hidden layer
- Learning rate is set at 0.3 and momentum is set at 0.2
- normalizeAttributes is set to ‘True’
OUTPUT:
=== Run information ===
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500000 -V 0 -S 0 -E 20 -H 10 -G -B -C -R
Relation: probability-weka.filters.unsupervised.attribute.Remove-R1
Instances: 20
Attributes: 5
Biodegradation Rate
Biowin BIODEG 2 -1
Biowin BIODEG 2
Biowin MITI 1
Biowin MITI 2
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Node 0
Inputs Weights
Threshold 3.933594631828762
Node 1 -0.9783199689103592
Node 2 0.7224941433805308
Node 3 2.7231318365967154
Node 4 -0.29364362180108144
Node 5 -3.1570067670437387
Node 6 -7.257240444556116
Node 7 3.6356026420461536
Node 8 1.883024482863157
Node 9 -2.156884214803949
Node 10 -1.337750821693855
Sigmoid Node 1
Inputs Weights
Threshold 0.21465281942625822
Attrib Biowin BIODEG 2 -1 5.0138196265829835
Attrib Biowin BIODEG 2 -3.2145565960064557
Attrib Biowin MITI 1 1.9631616499694475
Attrib Biowin MITI 2 -4.792619029618996
Sigmoid Node 2
Inputs Weights
Threshold 4.030017024711695
Attrib Biowin BIODEG 2 -1 -3.027303900286876
Attrib Biowin BIODEG 2 -4.61615542896227
Attrib Biowin MITI 1 -4.39740759901025
Attrib Biowin MITI 2 3.3223816094720684
Sigmoid Node 3
Inputs Weights
Threshold -1.255278689756961
Attrib Biowin BIODEG 2 -1 -0.1350383581698951
Attrib Biowin BIODEG 2 -4.89063610524344
Attrib Biowin MITI 1 -0.5683913720758256
Attrib Biowin MITI 2 1.4711038260088778
Sigmoid Node 4
Inputs Weights
Threshold -2.2824294178232476
Attrib Biowin BIODEG 2 -1 -0.48897386084433153
Attrib Biowin BIODEG 2 -1.6744306970725502
Attrib Biowin MITI 1 -0.6955691484542303
Attrib Biowin MITI 2 -0.3618185102341199
Sigmoid Node 5
Inputs Weights
Threshold -1.213514551110898
Attrib Biowin BIODEG 2 -1 -6.94082531690281
Attrib Biowin BIODEG 2 10.59543076588969
Attrib Biowin MITI 1 2.962308498021926
Attrib Biowin MITI 2 -5.336457513173256
Sigmoid Node 6
Inputs Weights
Threshold 6.42729094594343
Attrib Biowin BIODEG 2 -1 11.86119768519412
Attrib Biowin BIODEG 2 -24.23406709359751
Attrib Biowin MITI 1 -4.832458024488896
Attrib Biowin MITI 2 6.955548596351002
Sigmoid Node 7
Inputs Weights
Threshold -1.1739379376604704
Attrib Biowin BIODEG 2 -1 0.48667009587622273
Attrib Biowin BIODEG 2 -5.372199269555603
Attrib Biowin MITI 1 -0.9916289739264834
Attrib Biowin MITI 2 1.9820724980439903
Sigmoid Node 8
Inputs Weights
Threshold -4.122573603114901
Attrib Biowin BIODEG 2 -1 3.905180928827441
Attrib Biowin BIODEG 2 -1.104460993115406
Attrib Biowin MITI 1 -7.6272313364149
Attrib Biowin MITI 2 -5.256928873603537
Sigmoid Node 9
Inputs Weights
Threshold -1.4606352748167564
Attrib Biowin BIODEG 2 -1 -1.8041566307782526
Attrib Biowin BIODEG 2 -1.042551971873019
Attrib Biowin MITI 1 0.615783213465384
Attrib Biowin MITI 2 -1.3471402098611278
Sigmoid Node 10
Inputs Weights
Threshold -1.159271543801691
Attrib Biowin BIODEG 2 -1 -0.20089837948088868
Attrib Biowin BIODEG 2 -2.4574257719754184
Attrib Biowin MITI 1 0.2616092866435466
Attrib Biowin MITI 2 -1.5176060264186448
Class
Input
Node 0
Time taken to build model: 467.64 seconds
=== Predictions on training set ===
inst#, actual, predicted, error
1 1.924 1.926 0.002
2 2.076 2.076 0
3 1.903 1.915 0.012
4 1.73 1.741 0.011
5 1.74 1.741 0.001
6 1.74 1.741 0.001
7 1.894 1.894 0
8 1.801 1.802 0.001
9 1.279 1.292 0.013
10 1.531 1.542 0.011
11 1.428 1.439 0.011
12 1.742 1.743 0.001
13 1.767 1.768 0.001
14 1.777 1.788 0.011
15 1.929 1.93 0.001
16 1.851 1.851 0
17 1.716 1.723 0.007
18 1.62 1.62 0
19 1.643 1.644 0.001
20 1.477 1.478 0.001
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.9997
Mean absolute error 0.0044
Root mean squared error 0.0066
Relative absolute error 3.1242 %
Root relative squared error 3.5321 %
Total Number of Instances 20
No of Epoch / Error per Epoch / Correlation Co-efficient10000 / 0.00796 / 0.9101
150000 / 0.0000121 / 0.9997
200000 / 0.000012 / 0.9997
250000 / 0.0000119 / 0.9997
300000 / 0.0000118 / 0.9997
350000 / 0.0000017 / 0.9997
500000 / 0.0000114 / 0.9997
METHOD 3: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING
SUPPORT VECTOR MACHINES:
Parameters Chosen:
c -- The complexity parameter C. It is set to 0.30
checksTurnedOff -- Turns time-consuming checks off - use with caution.
debug -- If set to true, classifier may output additional info to the console. It is set to “False”
eps -- The epsilon for round-off error. It is left at default value.
epsilon -- The amount up to which deviations are tolerated. Watch out, the value of epsilon is used with the (normalized/standardized) data.
filterType -- Determines how/if the data will be transformed.
kernel -- The kernel to use is chosen to be RBF Kernel.
toleranceParameter -- The tolerance parameter is left at default value 0.0010.
OUTPUT:
=== Run information ===
Scheme: weka.classifiers.functions.SMOreg -S 0.09 -C 3.0 -T 0.0010 -P 1.0E-12 -N 0 -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01"
Relation: probability-weka.filters.unsupervised.attribute.Remove-R1
Instances: 20
Attributes: 5
Biodegradation Rate
Biowin BIODEG 2 -1
Biowin BIODEG 2
Biowin MITI 1
Biowin MITI 2
Test mode: evaluate on training data
=== Classifier model (full training set) ===
SMOreg
Kernel used:
RBF kernel: K(x,y) = e^-(0.01* <x-y,x-y>^2)
Support Vector Expansion :
(normalized) Biodegradation Rate =
3 * K[X(0), X]
+ 3 * K[X(1), X]
+ 3 * K[X(2), X]
+ 2.0091 * K[X(6), X]
+ -3 * K[X(8), X]
+ -3 * K[X(9), X]
+ -3 * K[X(10), X]
+ 3 * K[X(14), X]
+ 3 * K[X(15), X]
+ -3 * K[X(17), X]
+ -2.0091 * K[X(18), X]
+ -3 * K[X(19), X]
+ 0.5254
Number of support vectors: 12
Number of kernel evaluations: 210 (100 % cached)
Time taken to build model: 0.02 seconds
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.7517
Mean absolute error 0.1133
Root mean squared error 0.1443
Relative absolute error 80.6594 %
Root relative squared error 77.4072 %
Total Number of Instances 20
PART 2: USING 4- ATTRIBUTES
TRANSFORMATION OF DATA:
SCATTER PLOT:
By looking at the plot matrix we can see that to determine the Melting Point value, Freezing point and boiling point values can be used as they tend to show a linear relationship. We use Linear Regression method to predict the missing value of melting point attribute.
PRINCIPAL COMPONENT ANALYSIS:
=== Run information ===
Evaluator: weka.attributeSelection.PrincipalComponents -R 0.95 -A 5
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: degrade4-weka.filters.unsupervised.attribute.Remove-R1
Instances: 20
Attributes: 5
Biodegradation Rate
Freeze Pt (K)
Boil Pt ( K)
Solubility
Melt Pt (K)
Evaluation mode: evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Attribute Evaluator (unsupervised):
Principal Components Attribute Transformer
Correlation matrix
1 0.75 -0.42 0.98
0.75 1 -0.11 0.75
-0.42 -0.11 1 -0.42
0.98 0.75 -0.42 1
eigenvalueproportioncumulative
2.83154 0.70788 0.707880.579Melt Pt (K)+0.579Freeze Pt (K) +0.493Boil Pt ( K)-0.296Solubility
0.90903 0.22726 0.93514-0.893Solubility-0.446Boil Pt ( K)-0.044Melt Pt (K)-0.033Freeze Pt (K)
0.23652 0.05913 0.99427-0.747Boil Pt ( K)+0.41 Freeze Pt (K) +0.399Melt Pt (K)+0.338Solubility
Eigenvectors
V1 V2 V3
0.5786-0.0326 0.4103Freeze Pt (K)
0.4926-0.4461-0.7472Boil Pt ( K)
-0.2956-0.8933 0.3385Solubility
0.579 -0.044 0.3985Melt Pt (K)
Ranked attributes:
0.29212 1 0.579Melt Pt (K)+0.579Freeze Pt (K) +0.493Boil Pt ( K)-0.296Solubility
0.06486 2 -0.893Solubility-0.446Boil Pt ( K)-0.044Melt Pt (K)-0.033Freeze Pt (K)
0.00573 3 -0.747Boil Pt ( K)+0.41 Freeze Pt (K) +0.399Melt Pt (K)+0.338Solubility
Selected attributes: 1,2,3 : 3
This output also proves that attributes Bio-degradation Rate, Freezing point and Melting point are correlated variables and hence are more significant. Hence the value of boiling point can be used to calculate the missing freezing point value. ( This method is used mainly because the compound PYRROLE has a lot of missing information )
LINEAR REGRESSION – PREDICTION
OUTPUT:
=== Run information ===
Scheme: weka.classifiers.functions.LinearRegression -S 1 -R 5.0
Relation: degrade2-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R3
Instances: 20
Attributes: 3
Freeze Pt (K)
Boil Pt ( K)
Melt Pt (K)
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Regression Model
Freeze Pt (K) =
0.3387 * Boil Pt ( K) +
0.6416 * Melt Pt (K) +
-63.5711
Time taken to build model: 0 seconds
=== Predictions on training set ===
inst#, actual, predicted, error
1 183.9 170.489 -13.411
2 216 248.092 32.092
3 314 264.332 -49.668
4 304.1 288.023 -16.077
5 285.4 280.269 -5.131
6 307.9 295.372 -12.528
7 404 382.563 -21.437
8 395.6 367.348 -28.252
9 267 262.786 -4.214
10 0 233.162 233.162 Predicted Value for the missing Freezing Point
11 262.7 233.033 -29.667
12 298 275.072 -22.928
13 278.7 234.869 -43.831
14 353.5 329.469 -24.031
15 291 314.046 23.046
16 146.9 156.232 9.332
17 184.7 174.632 -10.068
18 260.2 262.664 2.464
19 252.5 238.621 -13.879
20 242 237.025 -4.975
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.7669
Mean absolute error 30.0097
Root mean squared error 56.8357
Relative absolute error 48.9522 %
Root relative squared error 65.0816 %
Total Number of Instances 20
METHOD 1: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING
LINEAR REGRESSION:
PARAMETERS CHOSEN:
- The value of the missing freezing point which was found out to be 233.162 was inserted and the Linear Regression analysis is done.
-From the test options choose “Training Set”. Performing cross-validation on this dataset does not yield the right results because each tuple is a different compound and the dataset as such is very small.
- Choose “Biodegradation Rate” – Numeric; from the list of attributes.
- No attribute selection is chosen for the attributeSelectionMethod
-eliminateCollinearAttributes is set to True
The output is depicted as follows:
OUTPUT:
=== Run information ===
Scheme: weka.classifiers.functions.LinearRegression -S 1 -R 1.0E-8
Relation: degrade4-weka.filters.unsupervised.attribute.Remove-R1
Instances: 20
Attributes: 5
Biodegradation Rate
Freeze Pt (K)
Boil Pt ( K)
Solubility
Melt Pt (K)
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Regression Model
Biodegradation Rate =
0.0012 * Freeze Pt (K) +
0.0016 * Boil Pt ( K) +
0 * Solubility +
-0.0021 * Melt Pt (K) +
1.2702
Time taken to build model: 0 seconds
=== Predictions on training set ===
inst#, actual, predicted, error
1 1.924 1.777 -0.147
2 2.076 1.725 -0.351
3 1.903 1.794 -0.109
4 1.73 1.729 -0.001
5 1.74 1.764 0.024
6 1.74 1.739 -0.001
7 1.894 1.795 -0.099
8 1.801 1.735 -0.066
9 1.279 1.751 0.472
10 1.531 1.661 0.13
11 1.428 1.632 0.204
12 1.742 1.684 -0.058
13 1.767 1.573 -0.194
14 1.777 1.724 -0.053
15 1.929 1.895 -0.034
16 1.851 1.723 -0.128
17 1.716 1.666 -0.05
18 1.62 1.777 0.157
19 1.643 1.694 0.051
20 1.477 1.729 0.252
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.3562
Mean absolute error 0.1289
Root mean squared error 0.1742
Relative absolute error 91.7856 %
Root relative squared error 93.4393 %
Total Number of Instances 20
METHOD 2: PREDICTING THE VALUE OF BIO-DEGRADATION RATE USING
NEURAL NETWORK
Parameters Chosen:
- The autoBuild is set to ‘True’ where the hidden layers are added and connected up.
- The parameters learningRate and momentum is set to 0.3 and 0.2 respectively to perform slow learning. These parameters can be overridden in the graphical interface.
- The decay parameter causes the learning rate to decrease with time; it divides the starting value by epoch number to obtain the current rate. This sometimes improves performance and may stop the network from diverging. This is set to “False”.
- The reset parameter automatically resets the network with a lower learning rate and bgins training again if it diverges from the answer. This is set to “False”.
- The trainingTime parameter sets the number of training epochs. It is set to different values to see at which value the network converges.
- ValidationSetSize is set to 0 which means, no validation set is used.
- NormalizeAttributes is set to ‘true’ which normalizes all the attributes and requires no preprocessing in terms of normalizing the attributes beforehand.
- The other parameters are default values.
Under WEKA we choose MultiLayerPerceptron to run the dataset