Outlier detection vs. similarity analysis

László PITLIK, 2012. I.

Introduction

The following source-document (Ben-Gal I., Outlier detection, In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers," Kluwer Academic Publishers, 2005, ISBN 0-387-24435-2. http://www.eng.tau.ac.il/%7Ebengal/outlier.pdf) give us an excellent overview about detection of outliers. The next chapter of this article was created, in order to demonstrate, what kind of differences can be observed between the “classic” approaches and the similarity analysis, used for suspicion generating (c.f. http://miau.gau.hu/miau2009/index.php3?x=e0&string=susp). The above mentioned articles had real data assets in the background (for example: http://miau.gau.hu/miau/160/suspicions_v2.xlsx). As well as introduction of the theoretical possibilities, the similarity analysis was already compared with other methods (e.g. WizWhy: http://miau.gau.hu/miau2009/index.php3?x=e0&string=wiz).

Detailed analysis

This chapter follows the simple logic: relevant text-parts will be cited (and visual inverted) from the source-document, in order to be commented focusing on relevant details:

Definition of outliers

Outlier-detection needs hidden assumptions! In case of the similarity analysis these assumptions are simple and flexible: the user has to declare a set of tendencies (c.f. directions). These tendencies describe a ceteris paribus connection in a very “raw” way (e.g. the more, the more) between a variable and the phenomenon of suspicion. Therefore the similarity analysis does not expect any positive patterns or exact thresholds. Based on these ideas (c.f. Plato), the similarity analysis can derive suspicion without any declaration (c.f. non-declarative approach).

Big deviation = suspicion! The similarity analysis creates also deviations from a neutral benchmark called staircase function. The more the volume of deviation is, the more the volume of suspicion. But: the similarity analyses are able to detect their (model-internal) errors. Therefore the volume of deviation is not the only measure to identify risk/suspicion. The errors of similarity analyses can be derived based on symmetry-problems of inverse LP-tasks.

Inconsistence = suspicion! The similarity analysis can be seen as a type of consistence control mechanism. Similarity analyses try to optimize parameters of staircase functions. Optimizing is a sort of ensuring of consistence. The antidiscrimination approach in the similarity analysis has to identify stairs being able to ensure a suspicion-free approximation of the whole data asset. If this aim cannot be achieved, then inconsistent objects are given.

Taxonomy of methods

Univariate vs. multivariate methods: Similarity analyses can be executed with arbitrary variable(s). In real projects the amount of the variables is mostly unlimited.

Known/unknown preconditions: Similarity analyses do not need known pattern, but it is possible arbitrary pattern to interpret. Statistical measurements (e.g. distributions) are not directly involved into similarity analyses. However, there is a simple model assumption in the similarity analyses in case of unknown suspicion pattern: theoretical each object can lead to the same suspicion level.

Model-free approach: Similarity analyses create always a lot of models ensuring a multilayer consistence. Each errors in this model system can be interpreted as a type of risk/suspicion.

Distance-based methods: Similarity analyses create staircases which are special multidimensional distance measurements.

Clustering techniques: Similarity analysis can be seen as a special clustering approach. The basic philosophy (approximation of expected antidiscrimination) is unchanged, but each task can use unique direction vector to describe the business logic of the particular suspicion. Therefore the similarity analysis do not suffer under aimlessness and vonnegutisms (=to read into) like clustering techniques in general.

At least 2*3 non-overlapping sets per direction-setting: The similarity analysis (without real learning/suspicion pattern) has a default classification capacity with three sets: neutral objects, non-suspected and suspected objects. Each of them can be validated or not (based on symmetry-errors). More sophisticated consistence control logics can further increase the amount of risk levels. Direction-settings are responsible for creating interpretable suspicions. From technical point of view the amount of potential suspicion phenomena is 2^n, where n = amount of input variables in the learning pattern (OAM).

Neighbouring relations: Similarity analysis can handle with space and time constructs. Neighbouring objects can build special consistent layers.

Univariate methods

The seemingly independent variables (using alone) lead to information deficit contrary to multivariate models where the connections (combining) of variables (through potential operators) ensure a lot of new variables. Similarity analyses can be created based on additive, multiplicative or hybrid connection types. Therefore univariate methods have a lower potential to detect any type of risks/suspicions…

Multivariate methods

Unique outlier: The similarity analysis is always searching for the neutral objects. Outliers are “side-effects”. Outliers should be occurred mostly in pair or in a group. The negative and positive suspicion potentials should be always in equilibrium (c.f. well-balanced). Therefore the single outlier in the lower left corner make possible to detect (even less relevant) outliers in the opposite site (diagonal line towards higher right corner).

Relative suspicion: Similarity analyses search for relative suspicions. Relative suspicion can be assumed, if each input impact was involved into the model, but some of the proved cases seem to be irregular compared to each other.

Models vs. indicators: A model (like a staircase of a similarity analysis) is a construct of raw and/or simple indicators. Indicators are relative simple combination of raw variables. Indicators are already part of multivariate logic, but they are less complex than models. There is not an exact rule to classify models and indicators. The more the amount of the input variable is (in case of the same amount of observations), the less the chance to identify any suspicion because the models to explore suspicion can find more and more possible combination of the inputs in order to declare: each observation is the same but in a specific way.

Intuitive understanding (kingmaker objects): Similarity analysis is able to detect kingmaker objects (c.f. swamping effects). Without the kingmakers new suspicions can be visualize (c.f. masking effects).

“Relative far”: Without any distribution approach the similarity analysis creates also a type of distance measurement. The antidiscrimination point (the constant Y-value) is always the default benchmark. This sort of distance declaration is purposive (goal-oriented) as far as possible. Therefore the goal of an antidiscrimination calculation is always to detect a suspicion and not “only” to use any “co-ordinate system” which is existent in a context free way…

Parameter free or non-parametric: The similarity analysis does not handle with distribution parameters, but each input value will be converted to partial suspicion parameter. The aggregation (like addition, multiplication or hybridisation) of partial suspicion effects leads to the holistic suspicion.

Ranking of outliers: The similarity analyses try to find staircases being able to eliminate any suspicion (c.f. antidiscrimination). However this special situation is not always given. If suspicion could be detected, then the volume of the suspicion make possible to rank each of them. But: the amount of the potential classes of suspicion is mostly relative little. The reason of it is: the staircases of antidiscrimination states ensure limited suspicion levels based on combinatorial rules.

Clustering and spatial techniques: The relevant comparison was already above handled.

Post/Pre-processing procedures: The similarity analysis is able to use aggregated/selected inputs and also able to disaggregate aggregated outputs (c.f. COCO-ZOOM: http://miau.gau.hu/miau2009/index.php3?x=e39). Similarity analysis can eliminate variables as well as observation being irrelevant in searching process for suspicions.

Factor analysis: The similarity analysis (http://miau.gau.hu/miau/157/faktoranalizis.docx) is always bounded to detection of suspicion. The factor analysis can just help to identify some of suspicion (if human intuition will be involved into the interpretation process).

Expert systems (decision trees): Rule systems or decision trees or expert systems (c.f. WEKA: http://miau.gau.hu/miau/122/wekafa_cocolepcso.xls, WizWhy, http://miau.gau.hu/miau2009/index.php3?x=e0&string=wizwhy) produce classification systems being able to describe even entirely polynomial ceteris paribus connection between inputs and outputs. Ceteris paribus figures in similarity analyses (especially COCO MCM: http://miau.gau.hu/miau/160/saltseer.doc) can be formed quasi in an arbitrary way.

Comparison of methods: There are real difficulties in fact to compare the above mentioned methods/techniques. But one aspect should be highlighted: methods involving human intelligence to interpret the results bring always more danger than such kind of approaches, where each layer of the interpretation can be written into source codes independent from the analyzed content. The similarity analyses can be automated according to each interpretation layers.

Neural networks: Artificial neural networks require in general positive learning cases. The similarity analysis can work according to both approaches. Learning processes will be stopped in neural network mostly ad hoc (c.f. back propagation). The similarity analyses will be closed in frame of (non)-linear programming processes.

Conclusion

The task of outlier detection (= suspicion generating) cannot be evaluated always in an objective way. In cases of bank transaction the known log files do not included each positive cases only the valid detected ones. So the suspicion generating has to offer as far as possible new ideas. Hacker logic is always in change. Context free methods can deliver possible new behavior patterns of system hacker, but the interpreting these constructions needs always human intuition afterwards. The similarity analyses can be used as a type of context free approach, but (and this is the main benefit of it) the similarity analysis make possible to convert in a flexible way low structured expert opinions into LP-tasks.

Summary

It seems to be necessary and beneficial for including of a new method to compare the potential competitive approaches in order to see, which type of task can use some of them in an effective way. Such kinds of comparisons are mostly difficult: there is hardly any well-formed definition of the frequently used terms (c.f. buzz words). The paragraphs of this article try to characterize the similarity analysis according the terminology are involved into to comparison well-known methods. This article is not going to give a detailed description about the whole family of the similarity analyses (which was already taken in earlier articles: c.f. miau.gau.hu). The comparison might seem to be even incomplete and brief, yet one declaration should be made to close the comparison: the similarity analysis (c.f. intuition generation) was directly created to find out suspicions like a human expert!