Pecan Analysis Data

10/26/189:53 AM

Pecan Analysis data

Got excel file NAmPecanAll.xls

Made into ARFF format

Deleted columns that were categorical

SOIL

SOIL2

Made all cols numeric except a few

SEASON 1,2,3,4

CLASS D,W

Pecan 0,1

Made pecans.arff

Header from the “variable names” worksheet

Exported the data as data.csv

Combined (in word) as Pecans.arff

Trouble at line 191

If we include thru case 4, it’s oK (pecans-small.arff)

If thru 100, not OK (pecans-small2.arff)

1 – 89 pecans-small3 no good

1-79 pecans-small4 – OK

therefore b/w 80 & 89

looked in excel file

#83 starts to be real # not 1-4

fixed in data.csv only

blew up on line 2898

fixed RCORR on station 2790

was blank

made it 1.0000

fixed only in pecans.arff

blew up on 2921 = station 2813

it was formatted as 1,032.67

removed the commas in data.csv

also fixed 2790

re-made pecans.arff

ran with weka explorer J48 –C 0.25 –M 2

remember to give extra memory

java –Xmx300m –jar weka.jar
J48 classifier – ran 10-fold validation in about 5 mins

=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

Relation: PresenceOfPecans-weka.filters.unsupervised.attribute.Remove-R1

Instances: 4637

Attributes: 103

[list of attributes omitted]

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree

------

MWM <= 24.28: 0 (2358.0/7.0)

MWM > 24.28

| RLOW <= 20.32: 0 (710.0/3.0)

| RLOW > 20.32

| | LPTOAE <= 0.068

| | | TRANGE <= 24.63

| | | | PERWRET <= 0.5714

| | | | PERWRET > 0.5714

| | | | | | | | | MWM <= 27.67: 0 (5.0)

| | | | | | | | | MWM > 27.67: 1 (6.0/1.0)

| | | TRANGE > 24.63

| | | | LPTOWATR <= 1.0465

| | | | | | | | | | MCM <= -1.17

| | | | | | | | | | MCM > -1.17

| | | | LPTOWATR > 1.0465

| | LPTOAE > 0.068

| | | LCOKLM <= 1.8716: 0 (46.0)

| | | LCOKLM > 1.8716

| | | | ELEV <= 1205

| | | | | | | | | | MWM <= 25.11: 0 (2.0)

| | | | | | | | | | MWM > 25.11

| | | | | | | | | | | | | | MWM <= 27.5: 0 (6.0)

| | | | | | | | | | | | | | MWM > 27.5

| | | | ELEV > 1205

| | | | | | LET <= 1.1751: 0 (52.0)

| | | | | | LET > 1.1751

| | | | | | | MWM <= 27.83: 0 (9.0)

| | | | | | | MWM > 27.83

Number of Leaves : 96

Size of the tree : 185

Time taken to build model: 13.38 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 4368 94.1988 %

Incorrectly Classified Instances 269 5.8012 %

Kappa statistic 0.6889

Mean absolute error 0.0635

Root mean squared error 0.2306

Relative absolute error 33.6934 %

Root relative squared error 75.1416 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.969 0.287 0.966 0.969 0.968 0

0.713 0.031 0.73 0.713 0.721 1

=== Confusion Matrix ===

a b <-- classified as

4020 129 | a = 0

140 348 | b = 1

Conclusion:

94% accurate!!!

Kappa is low because the pecans are rare in the data set.

Should be able to do this on the command line and get the classified instances

(looked in the Weka tutorial)

in the weka directory

java –mx300m weka.classifiers.trees.J48 – C 0.25 – M 2 –t ../PecanData/pecans.arff

-d ../PecanData/J48-classifier.model

doesn’t work from command line

can’t find class weka/classifiers/trees/J48

hmmm...

try

and also add in stuff –i –k to get more info

java -cp weka.jar -mx300m weka.classifiers.trees.J48 -C 0.25 -M 2 -t ../PecanData/peca.arff -i -k -d ../PecanData/J48-classifier.model

worked!

Time taken to build model: 12.72 seconds

Time taken to test model on training data: 0.1 seconds

=== Error on training data ===

Correctly Classified Instances 4587 98.9217 %

Incorrectly Classified Instances 50 1.0783 %

Kappa statistic 0.9427

K&B Relative Info Score 412419.0911 %

K&B Information Score 2004.0102 bits 0.4322 bits/instance

Class complexity | order 0 2250.7576 bits 0.4854 bits/instance

Class complexity | scheme 277.1448 bits 0.0598 bits/instance

Complexity improvement (Sf) 1973.6128 bits 0.4256 bits/instance

Mean absolute error 0.019

Root mean squared error 0.0974

Relative absolute error 10.0721 %

Root relative squared error 31.7479 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.994 0.051 0.994 0.994 0.994 0

0.949 0.006 0.949 0.949 0.949 1

=== Confusion Matrix ===

a b <-- classified as

4124 25 | a = 0

25 463 | b = 1

=== Stratified cross-validation ===

Correctly Classified Instances 4373 94.3067 %

Incorrectly Classified Instances 264 5.6933 %

Kappa statistic 0.6949

K&B Relative Info Score 268786.582 %

K&B Information Score 1305.6899 bits 0.2816 bits/instance

Class complexity | order 0 2250.7629 bits 0.4854 bits/instance

Class complexity | scheme 131711.8722 bits 28.4045 bits/instance

Complexity improvement (Sf) -129461.1092 bits -27.9192 bits/instance

Mean absolute error 0.0629

Root mean squared error 0.2301

Relative absolute error 33.3937 %

Root relative squared error 74.9854 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.969 0.281 0.967 0.969 0.968 0

0.719 0.031 0.734 0.719 0.727 1

=== Confusion Matrix ===

a b <-- classified as

4022 127 | a = 0

137 351 | b = 1

looks good!

now have classifier J48-classifier.model

try to get it to classify the data

labuser% java -cp weka.jar weka.classifiers.trees.J48 -l ../PecanData/J48-classifier.model -T ../PecanData/pecans.arff -p 1

works and gives data lines like

4633 0 0.9970313825275657 0 (4634)

the values are

the instance number (0-indexed)
the predicted value
the confidence in the prediction
the actual value
(the first attribute) – in this case, the station ID

ran to put results into J48-output.txt

opened in excel and made J48output.xls

need to fix

since the station ID comes in as (1), it is entered as a negative #!

multiplied by -1 and copied values

Tried 1b1 – lazy single nearest neighbor – took about 20 mins

=== Run information ===

Scheme: weka.classifiers.lazy.IB1

Relation: PresenceOfPecans

Instances: 4637

Attributes: 104

[list of attributes omitted]

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

IB1 classifier

Time taken to build model: 0.16 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 4392 94.7164 %

Incorrectly Classified Instances 245 5.2836 %

Kappa statistic 0.7212

Mean absolute error 0.0528

Root mean squared error 0.2299

Relative absolute error 28.0327 %

Root relative squared error 74.9065 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.97 0.244 0.971 0.97 0.97 0

0.756 0.03 0.745 0.756 0.751 1

=== Confusion Matrix ===

a b <-- classified as

4023 126 | a = 0

119 369 | b = 1

looks a little better

try K-nearest neighbors – K = 3 (3 nearest neighbors)

=== Run information ===

Scheme: weka.classifiers.lazy.IBk -K 3 -W 0

Relation: PresenceOfPecans

Instances: 4637

Attributes: 104

[list of attributes omitted]

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

IB1 instance-based classifier

using 3 nearest neighbour(s) for classification

Time taken to build model: 0.08 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 4415 95.2124 %

Incorrectly Classified Instances 222 4.7876 %

Kappa statistic 0.7449

Mean absolute error 0.0602

Root mean squared error 0.1951

Relative absolute error 31.9603 %

Root relative squared error 63.5862 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.974 0.232 0.973 0.974 0.973 0

0.768 0.026 0.775 0.768 0.772 1

=== Confusion Matrix ===

a b <-- classified as

4040 109 | a = 0

113 375 | b = 1

slightly better still

It might be worth trying a “reduced error pruned tree”

it is supposed to make smaller trees

see if it is better.

runs in less than 10 mins!!

=== Run information ===

Scheme: weka.classifiers.trees.J48 -R -N 3 -Q 1 -M 2

Relation: PresenceOfPecans

Instances: 4637

Attributes: 104

[list of attributes omitted]

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree

------

MWM <= 24.5: 0 (1662.0/10.0)

MWM > 24.5

| RLOW <= 20.57: 0 (434.0/2.0)

| RLOW > 20.57

| | LPTOAE <= 0.0575

| | | TRANGE <= 26.33: 0 (437.0/27.0)

| | | TRANGE > 26.33

| | | | EXPREY <= 540.4335

| | | | | MCM <= -3.5: 0 (13.0/2.0)

| | | | | MCM > -3.5

| | | | | | BIO5 <= 25632.9578: 1 (25.0/2.0)

| | | | | | BIO5 > 25632.9578: 0 (12.0/3.0)

| | | | EXPREY > 540.4335: 0 (32.0/4.0)

| | LPTOAE > 0.0575

| | | LCOKLM <= 1.9957: 0 (41.0)

| | | LCOKLM > 1.9957

| | | | WATDGRC <= 3

| | | | WATDGRC > 3

| | | | | LET <= 1.1957

| | | | | LET > 1.1957

Number of Leaves : 27

Size of the tree : 53

Time taken to build model: 7.96 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 4348 93.7675 %

Incorrectly Classified Instances 289 6.2325 %

Kappa statistic 0.6524

Mean absolute error 0.0729

Root mean squared error 0.2247

Relative absolute error 38.6837 %

Root relative squared error 73.2403 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.972 0.35 0.959 0.972 0.965 0

0.65 0.028 0.729 0.65 0.687 1

=== Confusion Matrix ===

a b <-- classified as

4031 118 | a = 0

171 317 | b = 1

not quite as good as the full tree but it is very fast

try other rule-generating things because they give interpretable output

try JRip

ran in 15 mins

=== Run information ===

Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1

Relation: PresenceOfPecans

Instances: 4637

Attributes: 104

[list of attributes omitted]

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

JRIP rules:

======

(MWM >= 26.5) and (BAR5 >= 14.6915) and (PTOAE >= 1.1925) and (ELEV <= 300) => Pecan=1 (82.0/4.0)

(AE >= 652.4943) and (PTOAE >= 1.1295) and (WATDGRC <= 3) and (WRET >= 104.8334) and (ELEV <= 625) => Pecan=1 (72.0/4.0)

(MWM >= 24.6) and (CVRAIN <= 44.3185) and (WSTORAGE >= 181.796) and (ELEV <= 1030) => Pecan=1 (165.0/50.0)

(MWM >= 24.3) and (TRANGE >= 24.7) and (RLOW >= 25.91) and (PTOWATR >= 10.8738) and (Site <= 1517) => Pecan=1 (51.0/4.0)

(AE >= 622.0895) and (COKLM >= 506.9) and (EXPREY <= 520.5728) and (PTOWATR >= 8.7045) => Pecan=1 (59.0/13.0)

(MWM >= 24.8) and (TRANGE >= 24.7) and (RLOW >= 25.91) and (RLOW <= 46.74) => Pecan=1 (52.0/24.0)

(MWM >= 27.22) and (RLOW >= 71.88) and (EXPREY <= 439.1472) and (WRET >= 102.7854) and (TEMP <= 56.0959) => Pecan=1 (15.0/1.0)

(MWM >= 27.44) and (CVRAIN <= 34.6388) and (WSTORAGE <= 161.2) => Pecan=1 (77.0/37.0)

=> Pecan=0 (4064.0/52.0)

Number of Rules : 9

Time taken to build model: 69.58 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 4394 94.7595 %

Incorrectly Classified Instances 243 5.2405 %

Kappa statistic 0.7153

Mean absolute error 0.0744

Root mean squared error 0.2155

Relative absolute error 39.4775 %

Root relative squared error 70.2129 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.974 0.275 0.968 0.974 0.971 0

0.725 0.026 0.765 0.725 0.744 1

=== Confusion Matrix ===

a b <-- classified as

4040 109 | a = 0

134 354 | b = 1

about as good as the J45.

for comparison purposes, do the “null model” = zeroR (pick the majority type)

=== Classifier model (full training set) ===

ZeroR predicts class value: 0

Time taken to build model: 0.02 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 4149 89.476 %

Incorrectly Classified Instances 488 10.524 %

Kappa statistic 0

Mean absolute error 0.1885

Root mean squared error 0.3069

Relative absolute error 100 %

Root relative squared error 100 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

1 1 0.895 1 0.944 0

0 0 0 0 0 1

=== Confusion Matrix ===

a b <-- classified as

4149 0 | a = 0

488 0 | b = 1

only 89% agreement.

so the others are an improvement

Get some scored data sets for mapping

1) “J48 reduced” = the one from page 11 – using “reduced error pruning”

java -cp weka.jar -mx300m weka.classifiers.trees.J48 -R -N 3 -Q 1 -M 2-t ../PecanData/pecans.arff -i -k -d ../PecanData/J48-reduced-classifier.model

got this result

Options: -R -N 3 -Q 1 -M 2

J48 pruned tree

------

MWM <= 24.5: 0 (1662.0/10.0)

MWM > 24.5

| RLOW <= 20.57: 0 (434.0/2.0)

| RLOW > 20.57

| | LPTOAE <= 0.0575

| | | TRANGE <= 26.33: 0 (437.0/27.0)

| | | TRANGE > 26.33

| | | | EXPREY <= 540.4335

| | | | | MCM <= -3.5: 0 (13.0/2.0)

| | | | | MCM > -3.5

| | | | | | BIO5 <= 25632.9578: 1 (25.0/2.0)

| | | | | | BIO5 > 25632.9578: 0 (12.0/3.0)

| | | | EXPREY > 540.4335: 0 (32.0/4.0)

| | LPTOAE > 0.0575

| | | LCOKLM <= 1.9957: 0 (41.0)

| | | LCOKLM > 1.9957

| | | | WATDGRC <= 3

| | | | WATDGRC > 3

| | | | | LET <= 1.1957

| | | | | LET > 1.1957

Number of Leaves : 27

Size of the tree : 53

Time taken to build model: 6.6 seconds

Time taken to test model on training data: 0.11 seconds

=== Error on training data ===

Correctly Classified Instances 4453 96.0319 %

Incorrectly Classified Instances 184 3.9681 %

Kappa statistic 0.781

K&B Relative Info Score 287357.2097 %

K&B Information Score 1396.3146 bits 0.3011 bits/instance

Class complexity | order 0 2250.7576 bits 0.4854 bits/instance

Class complexity | scheme 19037.9527 bits 4.1057 bits/instance

Complexity improvement (Sf) -16787.1951 bits -3.6203 bits/instance

Mean absolute error 0.0644

Root mean squared error 0.1852

Relative absolute error 34.1766 %

Root relative squared error 60.3369 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.983 0.232 0.973 0.983 0.978 0

0.768 0.017 0.841 0.768 0.803 1

=== Confusion Matrix ===

a b <-- classified as

4078 71 | a = 0

113 375 | b = 1

=== Stratified cross-validation ===

Correctly Classified Instances 4348 93.7675 %

Incorrectly Classified Instances 289 6.2325 %

Kappa statistic 0.6524

K&B Relative Info Score 233687.7747 %

K&B Information Score 1135.1897 bits 0.2448 bits/instance

Class complexity | order 0 2250.7629 bits 0.4854 bits/instance

Class complexity | scheme 83502.5671 bits 18.0079 bits/instance

Complexity improvement (Sf) -81251.8041 bits -17.5225 bits/instance

Mean absolute error 0.0729

Root mean squared error 0.2247

Relative absolute error 38.6837 %

Root relative squared error 73.2403 %

Total Number of Instances 4637

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class

0.972 0.35 0.959 0.972 0.965 0

0.65 0.028 0.729 0.65 0.687 1

=== Confusion Matrix ===

a b <-- classified as

4031 118 | a = 0

171 317 | b = 1

looks the same as when run from explorer – good!

now, classify the pecan data

java -cp weka.jar weka.classifiers.trees.J48 -l ../PecanData/J48-reduced-classifier.model -T ../PecanData/pecans.arff -p 1 > ../PecanData/J48-reduced-output.txt

open in excel & fix to make J48-reduced-output.xls

2) Do this for the JRip from page 13 as well

java -cp weka.jar -mx300m weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1-t ../PecanData/pecans.arff -i -k -d ../PecanData/JRip-classifier.model

it gave this output:

Options: -F 3 -N 2.0 -O 2 -S 1