Rapidminer: Tutorial Online + Images Processing

RapidMiner:Tutorialonline + Images Processing

Overview:

•RapidMiner is an open source learning environment for data mining and machine learning. This environment can be used to extract meaning from a dataset. There are hundreds of machine learning operators to choose from, helpful pre and post processing operators, descriptive graphic visualizations, and many other features. This environment has a steep learning curve, especially for someone who does not have a background in data mining. This is an example based tutorial that will work through some common tasks in data mining with RapidMiner.

RapidMiner (RM - formerly YALE)1 is available as a stand-alone application for data analysis and it can also be integrated as a DM engine into other applications. RapidMiner is a free, flexible and open-source platform implemented in Java, therefore it runs on every major platform and operatingsystem, and it’s very easy to use your code in it. RapidMiner represents a completely new approach to an application design where even very complicated and complex problems become simple.

RapidMiner is today one of the most widely used data mining and predictive analysis solutions world-wide. Rapidminer has many advantages. Some of them are: it is provided freely as an open source software, but it is provided also under a commercial license suitable for closed-source commercial applications development, experienced free community but also a highly professional paid support is available, mature and growing user and developer community, the cost of development with Rapidminer is relatively low when compared with other data mining solutions, as it is an open source software, it is especially suitable for research purposes, intuitive, well arranged and user friendly graphical user interface, access to

many databases and possibility to read many file formats, modular and well-arranged code, large number of built-in operators for data mining containing more than 250 learning algorithms and many preprocessing, post-processing operators and meta-operators, multi-layered data view concept (different

views on the same data), RapidMiner data core is similar to a standard DBMS (Data Base Management System), powerful scripting language (DM processes are designed as operator trees with a structure defined in XML file - dataflow ordinarily follows Depth First Search)

RAPIDMINER

Created by Ralf Klinkenberg, Ingo Mierswa and Simon Fischer at the University of Dortmund's Artificial intelligence department,in Germany 2001
2006 Mierswa and Klinkenberg founded the company Rapid-I ( main contributor)
International developers are contributing to further development in partnership with Rapid-I
World leading open source for knowledge discovery and data mining
Stand alone application for data mining and as a data mining engine
Some interesting users of the product are IT outfits like IBM,HP, and Cisco.Others are bank of America,Ford, Honda, etc.
2009 Ranked second in data mining/business analytics

RapidMiner capabilities

provides environment for machine learning
Data mining
Text mining
Predictive analytics
Business analytics
Rapid prototyping, research, training, industrial applications and application development
data loading and transformation (ETL)
Data preprocessing & visualization
Modeling evaluation and development
Analyzing data generated by high-throughput instruments used in genotyping, proteomics and mass spectrometry
various plugins available
Can be called from programs written in other languages due to Java API

Rapidminer provides comprehensive data mining capabilities, including

Rapidminer provides comprehensive data mining capabilities, including
Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite
Powerful but intuitive graphical user interface for the design of analysis process
Repositories for process, data, and metadata handling
Only solution with metadata transformation: forget trial and error and inspect results already during design time
Only solution which supports on-the fly error recognition and quick fixes
Complete and flexible: hundreds of data loading, data transformation, data modeling, and data visualization methods

RapidMiner Image Processing Extension

This add-on package provides capabilities including Local image features extraction Segmentation feature extraction Global image feature extraction Extracting features from single/multiple image(s) Detect a template in image ( rotation invariant) Point of interest detection Image transforms Color mode transforms Noise reduction Image segmentation Object detection and object detector training (Haar-like features)
Trainable image segmentation
3D image segmentation
Point of interest detection
Trainable segment selection
Training image object detector

Selected features

Feature extraction
Local-level
Segment-level
Global-level
Object detection
Image classification
Image segmentation
Trainable image segmentation
Similarity measurement between images
Point of interest detection
Image processing
and many others…

Why use it?

Free version has adequate resources to avoid big name options if a small business
It is a quality tool, given its ranking among the other commercial products
GUI is very user friendly.GUI is used to create data mining operators in XML files
XML Standardization is great for utizing various data sources
Ease of use and available tutorials
Works on any operating system

Why not use it

Some options are not available in free product, but you can upgrade
Possibly less customer service available for free version
There can be some restriction on customized use

Conclusion

For small business, RapidMiner might provide the competitive edge to compete with larger organizations with deeper pockets for BI/DW(Business Intelligence and Data Warehousing)

It is a great tool and a good opportunity to see the BI/DW tools in action for students

There are affordable tools out there, but you still have to learn the underlying concepts of BI/DW

INDICE

Installation.

RapidMiner 5.0 Tutorial

◦ Example 1: Decision Tree.

◦ Example 2: Association Rules.

◦ Example 3: Stacking.

◦ Example 4: K-Means.

◦ Example 5: Visualization of SVM.

INDEX

◦ Example 6: Filling missing values.

◦ Example 7: Noise generator.

◦ Example 8: Union sets of examples.

◦ Example 9: Numerical Cross Validation.

◦ Example 10: Learning cost-sensitive and graphic ROC.

◦ Example 11: Learning asymmetric costs.

◦ Example 12: Cost-Sensitive Learning.

◦ Example 13: Principal Component Analysis.

◦ Example 14: Selecting Forward.

◦ Example 15: Selecting multiobjective.

◦ Example 16: Validation Wrapper.

◦ Example 17: Yagga.

◦ Example 18: Setting attributes resulting from Yagga.

◦ Example 19: Generation of User Defined Features.

◦ Example 20: Balancing Evolutionary.

◦ Example 21: Viewing the Data Set and Weights.

◦ Example 22: Optimizing Parameters.

◦ Example 23: Enabler Operators.

◦ Example 24: Threshold Weighting.

◦ Example 25: Test of Significance.

◦ Example 26: Calculations based on groups.

Appendix: Description of the operators used in the Tutorial RM5

◦ 1. → Transformation → Data Aggregation Aggregate

◦ 2. Attribute Transformation → Data Reduction and Transformation → September → Generate Generation Attributes

◦ 3. Attribute Transformation → Data Reduction and Transformation → September → Generate Generation ID

◦ 4. Attribute Transformation → Data Reduction and Transformation → September → Optimization → Optimize Generation

by Generation (Yagga)

◦ 5. Data Transformation → September Attribute Reduction and Transformation → Principal Component Analysis

◦ 6. Attribute Transformation → Data Reduction and Transformation → September → Optimization → Optimize Selection Selection

◦ 7. Data Transformation → September Attribute Reduction and Transformation → Selection → Optimization → Optimize Selection (Evolutionary)

◦ 8. Data Transformation → September Attribute Reduction and Transformation → Selection → Select Attributes

◦ 9. Data Transformation → September Attribute Reduction and Transformation → Selection → Select by Weights

◦ 10. Attribute Transformation → Data Reduction and Transformation → September → Work on Subset Selection

11. Data Transformation → September Attribute Reduction and Transformation → Transformation → Singular Value Decomposition

◦ 12. Data Cleansing Data Transformation → → Replace Missing Values

◦ 13. Data Transformation → Filtering → Filter Examples

◦ 14. Data Transformation and Role Modification → Name → Rename

◦ 15. Data Transformation and Role Modification → Name → Rename by Replacing

◦ 16. Data Transformation and Role Modification → Name → Set Role

◦ 17. Data Transformation → September → Append Operations

◦ 18. Data Transformation → September → Join Operations

◦ 19. Sorting Data Transformation → → Sort

◦ 20. Data Type Conversion Transformation → → → Discretization discretize by Frequency

◦ 21. Data Type Conversion Transformation → → → Discretization Nominal to Binomial

◦ 22. Data Transformation → → Numerical Value Value Modification Modification → Normalize

◦ 23. → → Performance Evaluation Attributes (Attribute Count)

◦ 24. → Attributes → Performance Evaluation (CFS)

◦ 25. Measurement Performance Evaluation → → → Performance Classification and Regression (Binomial Classification)

◦ 26. Measurement Performance Evaluation → → → Performance Classification and Regression (Classification)

◦ 27. Measurement Performance Evaluation → → → Performance Classification and Regression (Regression)

◦ 28. Measurement Performance Evaluation → → Performance

◦ 29. Measurement Performance Evaluation → → Performance (Min-Max)

◦ 30. Measurement Performance Evaluation → → Performance (User-Based)

◦ 31. Evaluation → → ANOVA Significance

◦ 32. Evaluation → → T-Test Significance

◦ 33. Evaluation → → Split Validation Validation

◦ 34. Evaluation → → X-Validation Validation

◦ 35. Evaluation → Validation → Wrapper-X-Validation

◦ 36. Export → Attributes → Write Constructions

◦ 37. Export → Attributes → Write Weights

◦ 38. Export → Other → Write Parameters

◦ 39. Import → Attributes → Read Constructions

◦ 40. Import → Attributes → Read Weights

◦ 41. Import → Other → Read Parameters

◦ 42. Modeling → Item Set Mining Association and Association Rules → Create

◦ 43. Modeling → Item Set Mining Association and FP-Growth →

◦ 44. Attribute Weighting Modeling → → Optimization → Optimize Weights (Evolutionary)

◦ 45. → Modeling → Attribute Weighting Weight by Chi Squared Statistic

◦ 46. Classification and Regression Modeling → → → Naive Bayes Bayesian Modeling

◦ 47. Classification and Regression Modeling → → → Linear Regression Function Fitting

◦ 48. Classification and Regression Modeling → Modeling → → Lazy k-NN

◦ 49. Classification and Regression Modeling → → → Meta Modeling MetaCost

◦ 50. Classification and Regression Modeling → → → Stacking Meta Modeling

◦ 51. Classification and Regression Modeling → → Modeling → Support Vector Support Vector Machine

◦ 52. Classification and Regression Modeling → → Modeling → Support Vector Support Vector Machine (LIBSVM)

◦ 53. Classification and Regression Modeling → → → Tree Induction Decision Tree

◦ 54. Clustering and Segmentation Modeling → → k-Means

◦ 55. Modeling → Application → Apply Model Model

◦ 56. Modeling → Application → Group Models Model

◦ 57. Model Application → Modeling → → Apply Thresholds Threshold

◦ 58. Model Application → Modeling → → Find Threshold Thresholds

◦ 59. Modeling → Application → Ungroup Models Model

◦ 60. Process Control Branch → → Select Subprocess

◦ 61. → → Loop Process Control Loop Attributes

◦ 62. → → Loop Process Control Loop Values

◦ 63. Process Control → Parameter → Optimize Parameters (Grid)

◦ 64. Process Control → Multiply

◦ 65. Process Control → Parameter → Set Parameters

◦ 66. → Retrieve Repository Access

◦ 67. → Access Repository Store

◦ 68. Utility → Data → Add Noise Generation

◦ 69. Utility → Data → Generate Data Generation

◦ 70. Utility → Logging → Log

◦ 71. Extract Utility Macro → Macros →

◦ 72. Utility Macro → Macros → Set

◦ 73. Utility → Miscellaneous → Free Memory

◦ 74. Utility → Miscellaneous → Materialize Data

◦ 75. Utility → Subprocess

REFERENCES

Installation

RapidMiner webpage:

To download RapidMiner CE (community edition) go to this link:

For Window $: RapidMiner-0.0.000x00-install.exe

For Unix environments: RapidMiner-0.0.000.zip

To view an online manual about installing radipminer follow this link: i.com/content/view/17/211/lang in /

To run RapidMiner basically have to follow the steps below: Open a console or terminal.

Roll over the home of RapidMiner. Run: java-jar lib / rapidminer.jar

Tutorialde RapidMiner5.0

This tutorial shows the basics of RapidMiner and configurations of simple processes that can be performed. The user must have some knowledge in the domain of data mining and ETL.

Whenever this tutorial refers to "Tutorial RapidMiner" means the printed version available

You should first read the chapter of RapidMiner Tutorial for better motivation, but it can also make an attempt to start with the tutorial online without reading the printed version. Please read the text carefully and try to at least the suggested steps.

Please note:

Most of RapidMiner provides additional information if you stop your mouse a few moments on the element (tool tip texts). This also describes all the operators and parameters. At the end of this tutorial provides an appendix with descriptions of the operators used in it and referenced by them.

Below is presented a series of examples, each of which requires that you create a new document. To do this, select in the menu bar icon.

This will open a new document window will show the following:

Here you can select the location of the repository in which to save the document as well as the name that will. Or you can press the "Cancel" button to start working without saving the document temporarily.

Example 1: Decision Tree.

This process begins with the data load. After completing the entry operator performed a typical learning step. Here we use an implementation of a decision tree learner that can also handle numeric values (similar to the well known C4.5 algorithm).

Each operator may require some inputs and some outputs delivered. These input and output types are passed between operators. In this example the first operator "Input" does not require input and delivers a set of examples as output. This set of examples is taken by the learner, which delivers the final output: the model learned.

Because this data flow is linear, the design process is called "chain operators." Later we will see more sophisticated processes in the form of a tree of operators.

1. In the left pane select the "Operators". Then select the operator Repository

Access → Retrieve and drag it to the work area.

2. In the "Parameters" in the right pane, use the browser to the right of the parameter

repository entry to locate the file / / Samples / data / Golf.

This image shows some of the views available in RapidMiner. To enable / disable the view, use the View menu View →ShowView and restore the default view, select View→ RestoreDefault Perpective.

3. In the left pane select Operator Classification and Regression Modeling→ClassificationandRegression→Tree Induction→ Decision Tree and drag it to the work area.

4. Connect the output of the Retrieve operator to operator input Decision Tree, by left clicking the connector out (output, output) of the first and then another click the connector work (training set, training set) of the second.

5. Similarly, the output connecting mod (model, model) Decision Tree operator res port work area.

6. Press the icon "run" in the icon bar at the top of the frame. The process should start and after a short time the display of messages in the bottom of the frame shows the message that the process completed successfully. The main frame switches to the "Results", which shows the decision tree learned (a hypothesis that RapidMiner is called Model).

7. Return to edit mode either via the menu item View→Perspectives→Design, the icon of the icon bar, or by pressing the F8 function key.

In this example we built a predictive model to know whether to play tennis or not, based on data collected from past experience. To view this data double click on the table "Golf" tab "Repositories" on the right. Another tab appears between the tabs "Result Overview" and "Tree (Decision Tree)" in the results view, called "ExampleSet (/ / Samples / data / Golf)". Select the Data View.

The first column is the case identification, the second is the target attribute and the remaining are

Predictors attributes.

Now you can use this model to predict whether or not you should play tennis. Eg., For instance: (Sky = Sunny, Temperature = 82, Humidity = 90, windy = true) the answer is NO.

8. Replace the learner by another learning scheme for classification tasks. Right click on the operator DecisionTree and select ReplaceOperator→Modeling→Classificationand Regression→RuleInduction→ Rule Induction. After running the process changed with this example, presented the new model:

Example 2: Association Rules.

This process uses two important preprocessing operators: discretization operator first frequency, which discretizes numeric attributes placing values into intervals of equal size. Second, the nominal filter binominal operator created for each possible nominal value of an attribute a new characteristic polynomial binomial (binary) that is true if the instance has particular nominal value.

These preprocessing operators are needed because certain learning schemes can not handle attributes of certain types of securities. For example, highly efficient mining operator sets FPGrowth items used frequently in this setup can only handle process binominal and numerical characteristics or polynomial.

The next operator is the operator of mining frequent item sets FPGrowth. This operator sets efficiently calculated values of attributes that occur together often. From these so-called frequent item sets are calculated most trusted rules with generating association rules.

Note: To easily locate an operator in the operator tree, we can write the name in the [Filter] tab of the "Operators".

1. Add the Retrieve operator working in the area and locate the file / / Samples / data / Iris browser with repository parameter entry.

2. Add Utility → Subprocess operator. Rename it to "Preprocessing"right-clicking and selecting "Rename" or pressing <F2>.

3. Connect the output of the Retrieve operator to operator input preprocessing (Subprocess) and then double click on the latter (note that there is a button at the top of the frame, next to "Process", and you can switch between the process and threads). In Nested Chain of sublevel panel, add the following operators:

3.1DataTransformation→TypeConversion→Discretization→DiscretizebyFrequency.