Data Mining: e-learning course

General introduction

1.  Slide: Title

Data Mining, or from OLAP to CBR as optimizing problem

2009/2010 1. Semester

Lecturer: L.P.

Supported by miau (http://miau.gau.hu)

2.  Slide: Greeting

Welcome, ladies and gentlemen!

It is a great pleasure for me, to welcome you here in the online data mining course at the virtual university of miau (my-x). You are here right if you come abroad to study interdisciplinary topics in English, or you want to write a study in high qualitative level. This course serves you also in the development of some online knowledge transfer solutions from OLAP to online similarity analysis. Each keyword in this course is highlighted in order to provide immediately the background explanations.

3.  Slide: About us I-II.

About the University of the Future: This course was created in order to approximate the vision of a sustainable university. The students at the University of the Future have to work! under leading of the teachers in such projects, which was defined by the potential employers. The right performance should be rewarded as the same one on the market. The students can deliver also real problems. In this case, the marketing topics are included in the interdisciplinary education. The realized income ensures two pillars for the sustainability: Firstly, the students should learn hardly anything irrelevant for them. Secondly, the education works stable from an economic point of view.

References: For 20 years we complete studies and demos from students in the frame of the online services by miau. We have a self-managed wiki-module including a special online cyclopedia. We are providing for you online tools for analyses and OLAP-solutions for working online in groups.

4.  Slide: Aims of this course

On the basis of previous projects you can now follow, how to prepare (e.g. by pivot tables), and how to manage (s. OLAP) the necessary project databases, how to define similarity problems including their controlling aspects and how to make online analyses and how to interpret and to describe the calculated results in correct text streams and also (as preferred) in an online expert system. By the end of the course you will know about each step for the successful managing of planning, decision making and forecasting.

5.  Slide: 1001 nights – to be continued

If you all are ready to go on, let us make a start into the world of fables…

Once upon a time, in the far orient lived a Sultan and his 5 Scholars. In his time there might be no greater Sultan than him…

The Sultan created a function. This function had 5 variables – for each Scholar there was one input channel into the function. The understanding of the problem can be supported through visual effects:

“Let us bring a bowl filled with 100 identical diamonds! Each Scholar should take a handful of diamonds. The last one takes the rest! Each Scholar has to count and to say, how many gems he has.”


If the Sultan knows the 5 numbers (the distribution ratio), then he will calculate with his function an appropriate response-value. This process can be repeated e.g. 20times. In the next iteration the Scholars should bring a distribution of the diamonds, which will lead through the functions to a new maximal response-value.

If the Scholars can not bring the necessary solution, then they are not proved to be a good expert. Here it is not important who may have the best approximation. Hereby the Sultan will evaluate the Scholars in a collective way. Should they bring a good solution, they can take all the diamonds or else they will die the next day.

Would you like to play this game? Or would you find the risk too high to take?

6.  Slide: Do you know?

Do you know, that this fable describes the challenges of decoding or even the holistic logic of the science itself?

Do you know, whether a prediction should be in general better for the shorter term or for the longer term? <jump to demo>

Do you know whether an analysis based on more data records should be more correct? <jump to demo>

Do you know whether an analysis testing through larger amount of cases should be more fit than some other one without testing? <jump to demo>

7.  Slide: Theoretical background or heresy?

A phenomenon can only be labeled science in case it can be transformed into program-codes (e.g. chess-automat). Each other phenomenon belongs to artistic performance (e.g. studies, lectures).

The human intuition brings the good ideas. But not only human intuition seems to exist (cf. K. Lorenz). All living creatures on the earth have sensors to measure their (inside and outside) environment. The measured values are continuously interpreted in order to find some connection between causes and reactions. “Heureka”! – was already cried directly at the beginning of life!

Data mining has to deliver possible connections based on the measured records. The problem is on the one hand to infer as far as possible what kind of connections (functions) bring the most correct predictions in the future. On the other hand the generally best functions should not definitely deliver any appropriate values in case of the next application/decision. What can be now defined as the fittest function, it is not yet clarified. It seems: The truth (the good decision) should be approximated often by wrong functions…

Let alone: Nobody is worry, but hardly anyone is looking for production functions in the libraries…Why? Do not base the whole economics (incl. business planning, or studies searching for impacts) on these functions?

But before we start with some analysis, everybody has to learn, why the products of the old Gutenberg-Galaxy (the books) should be transformed as far as possible into rule systems instead of creating artistic text streams which from the point of view of combinatory in the most cases are incorrect (cf. overlapping effects, gaps – not to be customized).

8.  Slide: Didactical background

In order to reach a real time speed for analysis (but without special analyzing tools), the students have to learn the most robust CBR-logic in the first phase. Case-based reasoning (CBR) can also be used for calculating predictions, for building of explaining/simulation or making benchmarking (e.g. price-performance analysis). A CBR-algorithm can be defined in an offline way as an MS_Excel-solution (basing on solver), but even as an online solution (e.g. by LPS). Decision trees <jump to demo>, artificial neural networks <jump to demo> (as function types), MCM <jump to demo> and genetic algorithms <jump to demo> (as search strategy) will be introduced at the end of this course often as a black box solution (cf. not free to adapt) on the market.

9.  Slide: Outline

Chapters of the course:

- The world can be interpreted in form of Object-Attribute-Matrixes (OAM)!

- Anomalies of the data assets management (Why is the preparation of an OAM so slow?)

- Preparing OLAP databases (do it yourself, if nobody wants to make it)

- Using OLAP-techniques for OAM (efficiency as the highest priority)

- How it is made: Expert system (rules as universal solution)

- CBR-pattern (OAM from time series, or in bench-marking, or for production functions)

- Solver (be free offline)

- COCO (component based object comparison // be free online)

- Interpretations of results (chess-automates for context free situations)

- Standard expectations of studies (What you may not do and what have to do for a good study?)

10.  Slide: Online tutorial

See http://miau.gau.hu

11.  Slide: Forum

Use miau-wiki for discussions

12.  Slide: Online literature

See http://miau.gau.hu

13.  Slide: Online tests

See: http://gik.hu/regioktatas

14.  Chapter 1: The universal OAM

Do you believe, that the most critical articles or reports in broadcasting have simple indication to OAM? E.g. Is the interest rate of the central bank in Hungary too high? Do we have too many unemployed people in the region? Etc. <jump to demo>

Comparison! This is the keyword of the human knowledge! We make continuously similarity analyses. However, we do it very spontaneously. This is the first phenomenon, that we should prove now. How consequent are we? Can see we always immediately, which offer has trivial inconsistence in the price-performance relation? Another problem is the subjectivity: what might e.g. the importance of attributes mean? What do the weights for attributes? Are they arbitrary to set? <jump to demo>

In general everybody is able to say: whether a VW is better than a DACIA? However, a lot of so simple (pair) comparisons can lead to illogical evaluation value series. E.g. A>B, B>C, C>A… <jump to demo>

On the other hand: it is clear for everyone, such offer does not buy, which is definitely overall not better but more expensive than the competitive offer. <jump to demo>

Most people do not have the level of “autism”, to understand always, what the system might be behind of a lot of numbers.

What can we do, if in special cases we are able, to bring the necessary logical consequences to solve comparison problems?

As far as I see, there are two options. First we believe that we always do it instinctively well if the situation is important enough. Second, we do concentrate to make a computer program for this problem and after that, we use this program as our “conserved knowledge” and we do concentrate to identify and to transform new problems into program-codes or to improve old codes continuously. <jump to demo>

As you surely know and (I hope) you accept it, I am for the option 2. Why? There is no more living (human) object that/who is capable yet, a chess-automat to overpower. The best chess-automats include all the human knowledge, which has accumulated since the invention of this game.

It was said earlier, we humans are firstly spontaneous. This is our genetic advantage. We are no robots (but we can construct them <jump to demo>) to repeat already known steps we are born for being intuitive as well. Intuitive life is a kind of adrenalin bath for our brain.

Hereby we should change the direction and clarify, which disadvantages cause it, that the normal people can not yet live as a Creator more than a Pretender in the XXI century.

15.  Slide: Chapter 2. – Data assets management

Constitutional and practical problems on the field of non-profit data assets …

16.  Slide:

17.  Slide:

18.  Slide:

19.  Slide:

20.  Slide:

21.  Slide: