1
The ANN and Learning Systems in Brains and Machines
Leonid Perlovsky
Harvard University
1. The Chapter Preface
This chapter overviews mathematical approaches to learning systems and recent progress toward mathematical modeling and understanding of the mind mechanisms, higher cognitive and emotional functions,including cognitive functions of the emotions of the beautiful and music. It is clear today that any algorithmic idea can be realized in a neural-network like architecture. Therefore from a mathematical point of view there is no reason to differentiate between ANNs and other algorithms. For this reason words "algorithms," "neural networks," "machine learning," "artificial intelligence," "computational intelligence" are used in this chapter interchangeably. It is more important to acknowledge "in full honest" that the brain-mind is still a much more powerful learning device than popular ANNs and algorithms. Correspondingly, this chapter searches for fundamental principles, which sets the brain-mind apart from popular algorithms, for mathematical models of these principles, and for algorithms built on this models.
What the mind does differently from ANN and algorithms?We are still far away from mathematical models deriving high cognitive abilities from properties of neurons, therefore mathematical modeling of the "mind" often is more appropriate than modeling the details of the brain, when the goal is to achieve "high" human-like cognitive abilities. The reader would derive a more informed opinion by reading this entire book. This chapter devotes attention to fundamental cognitive principles of the brain-mind, and their mathematical modeling in parallel with solving engineering problems. I also discuss experimental evidence derived from psychological, cognitive, and brain imaging experiments about mechanisms of the brain-mind to the extent it helps the aim: identifying fundamental principles of the mind.
The search for the fundamental principles of the brain-mind and their mathematical models begins with identifying mathematical difficulties behind hundreds of ANNs and algorithms. After we understand the fundamental reasons for the decades of failure of artificial intelligence, machine learning, ANN, and other approaches to modeling the human mind, and to developing mathematical techniques with the human mind power, then we turn to discussing the fundamental principles of the brain-mind.
The next section analyzes, and identifies several mathematical difficulties common to wide groups of algorithms. Then we identify a single fundamental mathematical reason for computers failing behind the brain-mind. This reason is reliance of computational intelligence on classical logic.
This is an ambitious statement and we analyze mathematical as well as cognitive reasons for logic being the culprit of decades of mathematical and cognitive failures. Fundamental mathematical as well as psychological reasons are identified for "how and why" logic that used to be considered a cornerstone of science, turned out to be inadequate, when attempting to understand the mind.
Then we formulate a mathematical approach that has overcome limitations of logic. It resulted in hundreds of times improvement in solving many classical engineering problems, and it solved problems that remained unsolvable for decades. It also explained what used to be psychological mysteries for long time. In several cognitive and brain-imaging experiments it has been demonstrated to be an adequate mathematical model for brain-mind processes. Amazingly, this mathematical technique is simpler to code and use than many popular algorithms.
2. A short summary of learning systems and difficulties they face
Mathematical ideas invented in the 1950s and 60s for learning are still used today in many algorithms, therefore let me briefly overview these ideas and identify sources of the mathematical difficulties.
Computational approaches to solving complex engineering problemsby modeling the mind began almost as soon as computer appears.These approaches in the 1950s followedthe known brain neural structure. In 1949 Donald Hebb published what became known as the Hebb-rule:neuronal synaptic connections grow in strength, when they are used in the process of learning. Mathematicians and engineers involved in developing learning algorithms and devices in the early 1950s were sure soon computers would surpass by far the human minds in their abilities. Everybody knows today that Frank Rosenblatt developed a first ANNcapable of learning, called Perceptron. Perceptron, however, could only learn to solve fairly simple problems. In 1969 Marvin Minsky and Seymour Papert mathematically proved limits to Perceptron learning.
Statistical pattern recognition algorithms were developed in parallel (Duda, Hart, & Stork, 2000).They characterized patterns by features. D features formed a D-dimensional classification space; features from learning samples formed distributions in this space, statistical methods were used to characterize the distributions and derive a classifier. One approach to a classifierdesign defined a plane, or a more complex surface in a classification space, which separated classes. Another approachbecame known as a nearest neighbor or a kernel method. In this case neighborhoods in a classification space near known examples from each class are assigned to the class. The neighborhoods are usually defined using kernel functions (often bell-shape curves, Gaussians). Use of Gaussian Mixtures to define neighborhoods is a powerful method;first attempts toward deriving such algorithms have been complex, and efficient convergence has been a problem (Titterington, Smith, & Makov, 1985). Eventually good algorithms have been derived (Perlovsky & McManus, 1991). Today Gaussian Mixtures are widely used (Mengersen, Robert, Titterington, 2011). Yet,these methods turned out to be limited by the dimensionality of classification space.
The problem with dimensionality was discovered by Richard Bellman (1962), who called it “the curse of dimensionality.” The number of training samples has to grow exponentially (or combinatorially) with the number of dimensions. The reason is in the geometry of high-dimensional spaces: there are “no neighborhoods”, most of the volume is concentrated on the periphery (Perlovsky, 2001). Whereas kernel functions are defined so that the probability of belonging to a class rapidly falls with the distance from a given example, in high-dimensional spaces volume growth may outweigh the kernel function fall; if kernels fall exponentially (like Gaussian) the entire "neighborhood" resides on a thin shell where the kernel fall is matched by the volume rise. Simple problems have been solved efficiently, but learning more complex problems seems impossible.
Marvin Minsky (1965) and many colleagues suggested that learning was too complexand premature. Artificial intelligence should use knowledge stored in computers.Systemsstoring knowledge in a form of “if… then…” rules are called expert systems and used until today. But when learning is attempted, rules often depend on other rules and grow into combinatorially large trees of rules.
A general approach attempting to combine existing knowledge and learning was model-based learning popular in the 1970s and 1980s. This approach used parametric models to account for existing knowledge, while learning was accomplished by selecting the appropriate values for the model parameters. This approach is simple when all the data comes from only one mode, for example, estimation of a Gaussian distribution from data, or estimation of a regression equation. When multiple processes have to be learned, algorithms have to split data among models and then estimate model parameters. I briefly describe one algorithm, multiple hypotheses testing, MHT, which is still used today (Singer, Sea & Housewright, 1974). To fit model parameters to the data, MHT uses multiple applications of a two-step process. First, an association step assigns data to models. Second, an estimation step estimates parameters of each model. Then a goodness of fit is computed (such as likelihood). This procedure is repeated for all assignments of data to models, and at the end model parameters corresponding to the best goodness of fit are selected. The number of associations is combinatorially large, therefore MHT encounters combinatorial complexity and could only be used for very simple problems.
In the 1970s the idea of self-learning neural system became popular again. Since the 1960s Stephen Grossberg continued research into the mechanisms of the brain-mind. He led a systematic exploitation of perceptual illusions for deciphering neural-mathematical mechanisms of perception - similarly to I. Kant using typical errors in judgment for deciphering a priori mechanisms of the mind. But Grossberg's ideas seemed too complex for a popular following. Adaptive Resonance Theory (ART) became popular later (Carpenter & Grossberg, 1987); it incorporated ideas of interaction between bottom-up, BU, and top-down, TD, signals considered later.
Popular attention was attracted by the idea of Backpropagation, which overcame earlier difficulties of the Perceptron. It was first invented by Arthur Bryson Yu-ChiHo in 1969 but was ignored. It was reinvented by Paul Werbosin 1974, and later in 1986by David Rumelhart, Geoffrey Hinton, and Ronald Williams. The Backpropagation algorithm is capable of learning connection weights in multilayer feedforward neural networks. Whereas anoriginal single layer Perceptrons could only learn a hyperplane in a classification space, two layer networks could learn multiple hyperplanes and therefore define multiple regions, three layer networks could learn classes defined by multiple regions.
Multilayer networks with many weights faced the problem of overfitting. Such networks can learn (fit) classes of any geometrical shape and achieve a good performance on training data. However, when using test data, which were not part of the training procedure, the performance could significantly drop. This is a general problem of learning algorithms with many free parameters learned from training data. A general approach to this problem is to train and test a neural network or a classifier on a large number of training and test data. As long as both training and testing performance continue improving with increasing number of free parameters, this indicates valid learning; but when increasing number of parameters results in poorer performance, this is a definite indication of overfitting. A valid training-testing procedure could be exceedingly expensive in research effort and computer time.
A step toward addressing the overfitting problem in an elegant and mathematically motivated way has been undertaken in the Statistical Learning Theory, SLT, (Vapnik, 1999). SLT seems one of the very few theoretical breakthroughs in learning theory. SLT promised to find a valid performance without overfitting in a classification space of any dimension from training data alone. The main idea of SLT is to find few most important training data points (support vectors) needed to define a valid classifier in a classification sub-space of a small dimension.Support Vector Machines (SVM) became very popular, likely due to a combination of elegant theory, relatively simple algorithms, and good performance.
However, SVM did not realize the theoretical promise of a valid optimal classifier in a space of any dimension. A complete theoretical argument why this promise has not been realized is beyond the scope of this chapter. A simplified summary is that for complex problems a fundamental parameter of the theory, the Vapnik-Chervonenkis dimension, turns out to be near its critical value. I would add that SLT does not rely on any cognitive intuition about brain-mind mechanisms. It does not seem that the SLT principles are used by the brain-mind. It could have been expectedthat if SLT would be indeed capable of a general optimal solution of any problem using a simple algorithm, its principles would have been discovered by biological algorithms during billions of years of evolution.
The problem of overfitting due to a large number of free parameters can be approached by adding a penalty function to the objective function to be minimized (or maximized) in the learning process (Setiono, 1997; Nocedal & Wright, 2006). A simple and efficient method is to add a weighted sum of squares of free parameters to a log likelihood or alternatively to the sum of squares of errors; this method is called Ridge regression. Practically Ridge regression often achieves performance similar toSVM.
Recently a progress for a certain class of problems has been achieved using gradient boosting methods (Friedman, Hastie, & Tibshirani, 2000). The idea of this approach is to use an ensemble of weak classifiers, such as trees or stumps (short trees) and combine them until performance continue improving. These classifiers are weak in that their geometry is very simple. A large number of trees or stumps can achieve good performance. Why a large number of classifiers with many parameters does not necessarily over fit the data? It could be understood from SLT, one SLT conclusion is that overfitting occurs not justdue to a large number of free parameters, but due to an overly flexible classifier parameterization, when a classifier can fit every little "wiggle" in the training data. It follows that a large number of weak classifiers can potentially achieve good performance. A cognitively motivated variation of this idea is Deep Learning, which uses a standard back-propagation algorithm with standard, feed-forwardmulti-layer neural networks withmany layers (here is the idea of "deep"). Variations of this idea under the names of gradient boosting, ensembles of trees, and deep learning algorithms are useful, when a very large amount of labeled training datais available (millions of training samples), while no good theoretical knowledge exists about how to model the data. This kind of problems might be encountered in data mining, speech, or handwritten character recognition (Hinton, Deng, Yuet al, 2012; Meier & Schmidhuber, 2012).
3. Computational complexityand Gödel
Many researchers have attempted to find a general learning algorithm to a wide area of problems. These attempts continue from the 1950s until today. Hundreds smart people spent decades perfecting a particular algorithm for a specific problem, and when they achieve success they are often convinced that they found a general approach. The desire to believe in existence of a general learning algorithm is supported by the fact that the human mind indeed can solve a lot of problems. Therefore cognitively motivated algorithms, such as Deep Learning can seem convincing to many people. If developers of the algorithm succeed in convincing many followers, their approach may flourish for 5 or even 10 years, until gradually researchers discover that the promise of finding a general learning algorithm has not been fulfilled.
Other researchers have been inspired by the fact that the mind is much more powerful than machine learning algorithms, and they have studied mechanisms of the mind in full honest. Several principles of mind operations have been discovered, nevertheless mathematical modeling of the mind faced same problems as artificial intelligence and machine learning: mathematical models of the mind have not achieved cognitive power comparable to mind. Apparently mind learning mechanisms are different from existing mathematical and engineering ideas in some fundamental way.
It turned out that indeed there is a fundamental mathematical principle explaining in a unified way previous failures of attempts to develop a general learning algorithm and model learning mechanisms of the mind. This fundamental principle has been laying bare and well known to virtually everybody in full view of the entire mathematical, scientific, and engineering community. Therefore in addition to explaining this fundamental mathematical reason I will also have to explain why it has not been noticed long ago. It turned out that this explanation reveals a fundamental psychological reason preventing many great mathematicians, engineers, and cognitive scientists from noticing "the obvious" (Perlovsky, 2013d)
The relationships between logic, cognition, and language have been a source of longstanding controversy. The widely accepted story is that Aristotle founded logic as a fundamental mind mechanism, and only during the recent decades science overcame this influence. I would like to emphasize the opposite side of this story. Aristotle thought that logic and language are closely related. He emphasized that logical statements should not be formulated too strictly and language inherently contains the necessary degree of precision. According to Aristotle, logic serves to communicate already made decisions (Perlovsky,2007c). The mechanism of the mind relating language, cognition, and the world Aristotle described as forms. Today we call similar mechanisms mental representations, or concepts, or simulators in the mind (Perlovsky 2007b; Barsalou, 1999). Aristotelian forms are similar to Plato’s ideas with a marked distinction, forms are dynamic: their initial states, before learning, are different from their final states of concepts (Aristotle, 1995). Aristotle emphasized that initial states of forms, forms-as-potentialities, are not logical (i.e., vague), but their final states, forms-as-actualities, attained in the result of learning, are logical. This fundamental idea was lost during millennia of philosophical arguments. It is interesting to add Aristotelian idea of vague forms-potentialities has been resurrected in fuzzy logic by Zadeh (1965); and dynamic logic described here is an extension of fuzzy logic to a process "from vague to crisp." As discussed below, the Aristotelian process of dynamic forms can be described mathematically by dynamic logic; it corresponds to processes of perception and cognition, and it might be the fundamental principle used by the brain-mind, missed by ANNs and algorithms.