We Concentrate on a Currently Most Popular Data Type, Entity-To-Feature Data Table Containing

Boris Mirkin

School of Computer Science and Information Systems, Birkbeck University of London UK

Introduction to Data Analysis:

Correlation, Summarization and Visualization

Contents:

0. Introduction: ...... 3

01. Computational Data analysis: Main problems

02. Data visualization.

03. Case study problems.

1. Summarizing and visualizing a single feature: 1D analysis 19

1.1. Quantitative feature: Distribution and histogram

1.2. Further summarization: centers and spreads

1.3. Binary and categorical features

Project 1.1 Analysis of a multimodal distribution

Project 1.2 Data mining with a confidence interval: Bootstrap

Project 1.3 K-fold cross-validation

1.4. Modeling uncertainty: Interval and fuzzy sets

2 Correlation, summarization and visualization in 2D 43

2.1. Two quantitative features

2.1.1. Scatter plot

2.1.2. Linear regression

2.1.3. Validation

2.1.4. Evolutionary regression

Project 2.1. 2D analysis, linear regression and bootstrapping

Project 2.2 Non-linear evolutionary regression versus its linearized version

2.2. Mixed scale case: Nominal and quantitative features

2.3. Case of two nominal features

2.3.1. Contingency table

2.3.2. Logical statements from contingency tables

2.3.3. Quetélet indices and visualization of X2

3. Multivariate correlation with decision rules 78

3.1. Decision structure and decoder

3.2. Linear regression

3.3. Linear discrimination and SVM

3.3.1. Linear discrimination and Bayesian decision rule

3.3.2. Support vector machine criterion

3.3.3. Kernels

3.4. Decision trees

3.5. Learning neural networks

3.5.1. Artificial neuron, perceptron and neural network

3.5.2. Learning multilayer networks

3.5.3. Back-propagation algorithm

Project: Learning petal sizes from sepal sizes in Iris

4. Multivariate summarization: Principal Component Analysis (1) 107

4.1. General

4.1.1. Decision structure and decoder

4.1.2. Data recovery criterion

4.1.3. Data standardization

4.2. Principal component analysis (14)

4.2.1. PCA method and its usage

Mathematical model and its properties

4.2.2. Applications: Latent semantic analysis (29)

4.2.3. Applications: Correspondence analysis (34)

4.2.3. Is there a right number of components?

5. Categorical summarization: K-Means clustering (38) 145

5.1. K-Means clustering

5.1.1. Batch K-Means partitioning

5.1.2. Incremental K-Means

5.1.3. Nature inspired algorithms for K-Means

5.1.4. Partition around medoids PAM

5.1.5. Initialization of K-Means

5.1.6. iK-Means: Iterated Anomalous pattern to precede K-Means

5.2. Cluster interpretation aids

5.3. Extensions of K-Means to different cluster structures (78) 185

5.3.1. Fuzzy clustering

5.3.2. Mixture of distributions and Expectation-Maximization EM algorithm

5.3.3. Kohonen’s self-organizing maps (SOM)

6. Categorical summarization: Hierarchical clustering (85)

6.1. General

6.2. Agglomerative clustering.

6.3. Divisive and conceptual clustering

6.4. Single linkage clustering, connected components and MST (99) 206

7. Categorical summarization: Approximate and spectral clustering for network data (107) 214 7.1. Approximate clusters

7.1.1. Additive clusters

7.1.2. One cluster clustering

7.1.3. Additive and disjunctive clusters

7.2. Spectral clustering

7.2.1. Partitioning criteria as Rayleigh ratios

7.2.2. Laplacian transformation

7.2.3. Spectral clusters

Appendix: (140)

A.1. Basics of multivariate entities

A.2. Basic optimization

A.3. Basic MatLab

A.4. Computational validation

Introduction

0.1. Computational Data analysis

This is a unique course in that it looks at data from the inside rather than the outside which is the conventional perspective in Computer Science.

The term Data Analysis has been around for quite a while starting from earlier developments in cluster analysis and other multivariate techniques, bringing forth the concepts of “exploratory” data analysis and “confirmatory” data analysis (see, for example, Tukey (1977)). The former covered a set of techniques for finding patterns in data, and the latter – classical statistics approaches for hypotheses testing. “A possible definition of data analysis is the process of computing various summaries and derived values from the given collection of data, ” states D. Hand in Berthold and Hand (2003), p. 3, adding that the process may become more intelligent if attempts are made to automate some of the reasoning of skilled data analysts and/or to utilize approaches developed in the Artificial Intelligence areas. This author maintains that a discipline should be defined by goals rather than by methods. Here is a modernized translation of the view expressed in Mirkin (1980), p. 10. The goal of data analysis is amending or enhancing existing knowledge of the specific domain to which the data being analyzed refer. Structurally, the knowledge can be represented as a set of (a) categories and (b) statements relating them, which leads to two main sets of data analysis techniques: (a) those for developing new categories by summarizing data and (b) those for deriving new relations between categories by analyzing correlations between various aspects of the data. By using the adjective “Computational” in the title, I stress on the algorithmic, rather than applied, aspects of data analysis.

Key concepts in this course are those related to computational and structural aspects of the methods transforming the data into knowledge bits:

- Data:

Typically, in sciences and in statistics, a problem comes first, and then the investigator turns to data that might be useful in the problem. In computational data analysis, it is also the case, but a problem can be very broad as, for example, this: Take a look at this data set - what sense can we make of it? This is more reminiscent to a traveler’s view of the world rather than that of a scientist. The traveler deals with what occurs on their way. Helping the traveler in making sense of data is the task of data analysis. Rather than attending to individual problems, computational data analysis focuses on learning broad patterns. It should be pointed out that this view much differs of that accepted in the conventional sciences including classical statistics in which the main goal is to identify a pre-specified model of the world and data is but a vehicle in achieving this goal.

Any data set comprises two parts, metadata and data entries. Metadata may involve names for the entities and their features. Depending on the origin, entities may be alternatively but synonymously referred to as individuals, objects, cases, instances, patterns, or observations. Entity features may be synonymously referred to as variables, attributes, states, or characters. Depending on the way they are assigned to entities, the features can be of elementary structure [e.g., age, sex, or income of individuals] or complex structure [e.g., a picture, or statement, or a cardiogram]. Metadata may involve relations between entities or other relevant information, which we are not going to deal with further on.

- Knowledge:

Knowledge is a complex concept, not quite well understood yet, which relates to understanding things. Structurally, knowledge can be thought of as a set of categories and statements of relation between them. Categories are aggregations of similar entities such as apples or plums or more general categories such as fruit comprising apples, plums, etc. When created over data objects or features these are referred to as clusters or factors, respectively. Statements of relation between categories express regularities relating different categories. These can be of casual or correlation character. We say that two features correlate when a co-occurrence of specific patterns in their values is observed as, for instance, when a feature’s value tends to be the square of the other feature. The observance of a correlation pattern is thought to be a pre-requisite to further inventing a theoretical framework from which the correlation follows. It is useful to distinguish between quantitative correlations such as functional dependencies between features and categorical ones expressed conceptually, for example, as logical production rules or more complex structures such as decision trees. These may be used for both understanding and prediction. In industrial applications the latter is by far more important. Moreover, the prediction problem is much easier to make sense of operationally so that the sciences so far paid much attention to this. The notion of understanding, meanwhile, remains very vague.

We are going to study methods for enhancing knowledge by producing rules for finding either

(a) Correlation of features (Co) or

(b) Summarization of entities or features (Su),

each in either of two ways, quantitative (Q) and categorical (C). Combining them makes four major groups of methods: CoQ, CoC, SuQ, and SuC.

A rule involves a postulated mathematical structure whose parameters are to be learnt from the data. We will be dealing most with the following mathematical structures in the rules:

- linear combination of features;

- neural network mapping a set of input features into a set of target features;

- decision tree built over a set of features;

- partition of the entity set into a number of non-overlapping clusters.

A fitting method relies on a computational model involving a function scoring the adequacy of the mathematical structure underlying the rule – a criterion, and, typically, visualization aids.

The criterion measures either the deviation from the target (to be minimised) or fitness to the target (to be maximised). Currently available computational approaches to optimise the criterion can be partitioned in three major groups:

- global optimisation, computationally feasible sometimes for linear quantitative and simple discrete structures;

- local improvement using such general approaches as:

o gradient descent

o alternating optimization

o greedy neighbourhood search (hill climbing)

- evolution of population, an approach involving relatively recent advancements in computing capabilities, of which the following will be used in some problems:

o genetic algorithms

o evolutionary algorithms

o particle swarm optimization

It should be pointed out that currently there is no systematic description of all possible combinations of problems, data types, mathematical structures, criteria, and fitting methods available. Here we rather focus on the generic and better explored problems in each of the four groups that can be safely claimed as being prototypical within the groups:

Quant Principal component analysis

Categ Cluster analysis

Quant Regression analysis

Categ Supervised classification

The four methods on the right have emerged in different frameworks and usually are considered as unrelated. However, they are related in the context of computational intelligence. Moreover, they can be unified by the so-called least-squares criterion that will be accepted for all main methods described in this text. In fact, the criterion will be part of a unifying, data-recovery, perspective. The data recovery approach involves three stages:

(1) fitting a model to the data (sometimes referred to as “coding”),

(2) deriving data from the model in the format of the data used to build the model (sometimes referred to as “decoding”), and

(3) looking at the discrepancies between the observed data and those recovered from the model. The smaller are the discrepancies, the better the fit – the principle leading to natural model fitting criteria.

There can be distinguished at least three different levels of studying a computational data analysis approach. One can be interested only in learning of the approach on the level of concepts only – what is it for, why it should be applied at all, etc. A somewhat more practically oriented tackle would be of an information system/tool that can be utilized without any knowledge beyond the structure of input and output. A more technically oriented way would be studying the method involved and its properties. Comparable advantages and disadvantages of these three levels are as follows.

Pro Con

Concepts Awareness Superficial

Systems Usable now Short-term

Simple Stupid

Techniques Workable Technical

Extendable Boring

Many in Computer Sciences rely on Systems assuming that good methods have been developed and put in there already. Although it is largely true for well defined mathematical problems such as those implemented in Matlab, the situation is by far different in Data Analysis because there are no well posed problems here – basic formulations are based on intuition not supported by theoretical results. This is why, in many aspects, intelligence of currently available “intelligent methods” is rather superficial and may lead to wrong results and decisions.

Consider, for instance, a very popular concept, the power law – many say that in unconstrained social processes, such as those on the Web networks, this law, expressed with formula y=ax-b where x and y are some features and a, b>0 constant, dominates: the number of people who read news stories on the web decays with time in a power law; the distribution of page requests on a web-site according to their popularity obeys the power law as well as the distribution of website interconnections, etc. According to a very popular recipe, to fit a power law (that is, to estimate a and b from the data), one needs to fit the logarithm of the power-law equation, that is, log(y)=c-b*log(x) where c=log(a), which is much easier to fit because it is linear. Therefore, this recipe advises: take logarithms of the x and y first and then use any popular linear regression program to find the constants. This recipe works well when the regularity is observed with no noise, which is impossible in social processes, because of too many factors affecting them. If the data is not that exact, the recipe may lead to big errors. For example, I generated x (between 0 and 10) and y as related by the power law y=2*x1.07 , which can be interpreted as the growth with the rate of approximately 7% per time moment, with the added Gaussian noise whose standard deviation is 2. The recipe above led to estimates of a=3.08 and b=0.8 to suggest that the process does not grow with x but rather decays. In contrast, when I applied an evolutionary optimization method, which will be introduced later, I obtained realistic estimates of a=2.03 and b=1.076.

This is a relatively simple example, at which a correct procedure can be used. However, in more complex situations of clustering or categorization, the very idea of a correct method seems rather debatable; at least, methods in the existing systems can be and frequently are of a rather poor quality.

One may compare the situation here with that of getting services of an untrained medical doctor or car driver – the results can be as devastating. This is why it is important to study not only How’s but What’s and Why’s of the Computational Data Analysis, which are addressed in this course by focusing on Concepts and Techniques rather than Systems. In a typical case, the exposition goes along with the structure of a data analysis application and comprises the following seven steps:

(i) formulating a specific data-related problem, then

(ii) developing a model and

(iii) method that are going to be used for advancing into the problem, then

(iv) application of the method to the data, sometimes preceded with

(v) the data standardization sometimes followed with