On Stochastic Optimization

On the Numerical Solutions of Stochastic Optimization Problem[1]

Yu-Chi Ho

Division of Engineering and Applied Sciences

Harvard University

A cursory look at the programs of recent CDC or IFAC meetings reveals the fact that the control system community is searching for a new paradigm. Gone are the days when a linear system based framework, be it LQG, H∞, feedback linearization, and robust control, dominated and defined our researches. Instead, acronyms such as DEDS, IC (Intelligent Control), and HS (Hybrid Systems), and non-real variables, discrete, symbolic, or linguistic, pepper our papers and mathematics. It is healthy and invigorating as we search for new domains to conquer. However, as the saying goes, “In time of stress, people revert to the tools they are comfortable with”, i.e., it is easy and natural for us to attempt to use the same calculus and differential equation based tools to attack these new problems. The purpose of this note is (i) to bring up some unpleasant and no doubt controversial (to some) facts concerning the limitations on how far we can push these tools with respect to the nature of the new problems we wish to solve, and (ii) to speculate that a softened approach involving computational intelligence [Jang-Sun-Mizutani 1997] such as fuzzy logic, neural networks, genetic algorithms, and ordinal optimization has a place in our arsenal of tools. Traditionally, publications like the AC Transactions which prides itself as the standard bearer of rigor and orthodoxy have not published many AI type of papers. It is, nevertheless, this author’s belief after years subscribing to the same standard that at least an open minded discussion on these issues in the transaction pages are desirable for the benefit of the general readers. After all, one of the professed goals of the society is in solving real world problems. The fundamental limitations described below with respect to the nature of these problems force us to rethink our paradigm.

A general problem of stochastic optimization can be defined as

(1)

where Q, the search space, is an arbitrary, huge, but finite set; q the design alternatives; J, the performance criterion which is the expectation of L, the sample performance, as a functional of q and x, the randomness in the systems. This definition is sufficiently general to include either exactly or approximately:

i. The Stochastic Optimal Control Problem — In this case, Q is the space of all possible feedback control laws which is uncountably infinite when the state and control variables are real. But in numerical solutions, we must approximate Q by a finite set.

ii. The Decision-Tree Analysis Problem — This is simply the finite version of (i) with Q the set of all decision rules.

iii. Other Parameter Stochastic Optimization Problem — q in this case are simply vector design parameters, continuous, or discrete.

Of course if J(q) can be evaluated in closed form and the solution found by solving analytically the necessary and/or sufficient conditions for optimum, then no further discussions are needed. In the real world, unfortunately, problems rarely if ever fall in this class. The advent of computers brought forth another possibility, namely, numerical solution of (1). We shall discuss in this note some basic limitations related to the numerical solution of (1) and approaches to circumvent them. There are two aspects to this problem: the "stochastic" part and the "optimization" part.

The "stochastic" aspect has to do with the problem of performing numerical expectation since the functional L(x(t;q,x)) is available only in the form of a complex calculation . The standard approach is to estimate E[L(q.x)] by sampling, i.e.,

(2)

where xi represents the ith sample of the randomness. Before the reader gets the impression that we are implying that (2) is all there is to the numerical calculation of the expectation of L, we hasten to add that analysis still plays a crucial part in such solutions. Devising algorithms to improve the accuracy, to prove the convergence of the estimate J(q)est, and to un-correlate samples, etc. are important analysis problems. A huge literature exists. However, when all has been said and done, the ultimate accuracy (confidence interval) of the estimate cannot be improved upon faster than 1/(N)1/2, the result of averaging i.i.d. noise. This rate can be quite adequate or even very good for many problems. The best known example is that of the Kalman filter — the estimation of the state variables of stochastic dynamic systems. The essence here is to "subtract out the deterministic dynamic core and average the remaining innovation (i.i.d.) residues". The state estimate accuracy improves as 1/(N)1/2 which is very good since in most applications continuous measurements are present, i.e., thousands of samples can be observed in milliseconds. On the other hand, there increasingly are problems for which the ultimate rate of 1/(N)1/2 is not good enough. These classes of problems typically involve the performance evaluation of DEDS via Monte Carlo simulation. Here each sample of L(q,xi) requires one simulation run which for complex DEDS can take seconds or minutes even on the fastest computer not to mention the setup cost of programming the simulation. Millions or billions of required samples of L(q,xi) for different value of q may become an infeasible burden. Yet there is no getting away from this fundamental limit of 1/(N)1/2 , i.e. one order of magnitude increase in accuracy requires two orders of magnitude increase in sampling cost.

The second limitation has to do with the "optimization" part. Traditionally, analysis again plays an important part in numerical optimization whether it is convergence, gradient computation, or general hill climbing. Again these tools, such as perturbation analysis (PA), take advantage of the structure and real variable nature of the problem to devise effective algorithms for optimization. The best known example is of course linear programming. However, when we exhaust such advantages and the remaining problem becomes structureless and Q is totally arbitrary as in (i) and (ii) above, we encounter the limitation introduced by combinatorial explosion which we shall call the NP-hard limitation[2]. In such case, more or less blind search is the only alternative. While it is often easy to prove probabilistic convergence of random search, it is very hard to improve the search efficiency. An example will illustrate this:

Suppose Q is of size 1010,which is small by combinatorial standards. We are allowed to take 10,000 samples of J(q) uniformly in Q. What is the probability that at least one of these sample will belong in the top-50, top-500, or top-5000 of the designs of Q? Simple calculation yields,

(3)

which turns out to be 0.00004999, 0.0004998, and 0.004988 respectively, i.e., forget it! This means that for simulation based optimization problem of any complexity, simple random search is not an effective approach — too much computation for too little return.

These two difficulties, the 1/(N)1/2 and the NP-Hard limitation, are basic for general stochastic optimization problems. Each one can induce an insurmountable computational burden. To make matters worse, the effect of these limitation is multiplicative rather than additive. No amount of theoretical analysis can help once we are faced with these limitations. To circumvent them, two avenues appear possible at present. First, we must bring structure back into the problem. This is, however, easier said than done. We must break the habit of always trying for the most general result but attempting more specific problems. This is a cultural and "pecking order" thing. To divine the RIGHT structure from a series of specific examples is not always easy[3]. The fact that differential equations turned out to be the correct model for most dynamic systems in the physical world took hundreds of years of experimentation and development. There is no reason to believe the right DEDS structure will be just over the horizon and can be gotten by mere introspection. But, this does not mean we should not try. This author views this as the main pragmatic rationale for continued theoretical research and the direction of such effort should take. Until such a breakthrough comes along, in the short run the second avenue to break the stranglehold of the 1/(N)1/2 and the NP-Hard limitations is to effect a strategic re-direction of the attack[4].

Instead of estimating the best performance for sure we settle for a good enough alternative with high probability. (G)

This is in the spirit of soft computing which is being promoted here. As an example, take the recent work on ordinal optimization, this involves two ideas:

(a) "Order" converges exponentially fast while "value" converges at rate 1/(N)1/2. In other words, it is much easier to determine whether or not alternative A is better than B than to determine A-B=?. This is intuitively reasonable and can be made precise [Dai 1995, Xie 1996]. In fact, when human beings are faced with complex and difficult decision problems involving marriage, career, life-and-death medical treatments, we do not attempt to evaluate the expected utility of the alternatives as decision theory dictates. We simply try to determine which is the better alternative without necessary knowing the detail consequence of what each alternative entails, i.e., we emphasize the choice (order) rather estimating the utility (value) of the choices.

(b) Probability of getting something "good enough" increases exponentially with the size of the "good enough" set. Traditional optimization selects a single alternative and hopes that it coincides with the true optimum. The metaphor of hitting one speeding bullet with another comes to mind. Instead if we are willing to settle for the possibility that our "selected" set of alternatives overlaps in some sense with the set of "good enough" alternatives, then this probability improves dramatically with the size of these sets. It is like trying to hit a truck with a shot gun. Again this can be made precise and quantitative [Lau-Ho 1997, Lee-Lau-Ho 1995].

The details of the assertions (a) and (b) which can be found in the references are not important here. The point is that by adopting a softer goal of settling for the good enough choice with high probability we bring within computational reach many simulation-based optimization problems heretofore thought to be infeasible. Several orders of magnitude of computational savings are possible [e.g. Ho-Sreenivas-Vakili 1992, Ho-Larson 1995, Chen-Larson-Patsis 1996]. If not actually solving the original optimization problem, then we have at least provided a mean for narrowing down the search. We call this approach Ordinal Optimization and is totally consistent with and complementary to the soft computing ideas of the granulation of information and the use of linguistic variables in fuzzy logic and control [Zadeh 1996]. Such computational AI tools also tradeoff optimality for tractability.

However, even such computational reduction may not be enough. There remains at least two issues. First, to be able to get within say 1% in "order" does not mean getting within 1% in "performance value". Ordinal optimization is of no help when faced with the "needle in a haystack" problem. In such case it is an inherent difficult problem for which anything other than THE optimum is not "good enough". Second and more important, to be within the top-1% in a search space of 1010 is still 108 away from the optimum. This is often of scant comfort. Let us illustrate this with the above example of sampling in a space of 1010. The probability of getting within 1% (108 out of 1010) of this space with only 1000 sample is 0.99996 and ONE for 10,000 sample, i.e., fairly easy. On the other hand, as we have shown above the probability of getting within the top-5000 (0.00005%) with 10,000 samples is practically zero. To bridge this gap, we must do something more. Suppose by some knowledge we can narrow the top-50, top-500, or top-5000 to a subset of 1,000,000 element of this search space, then with 10,000 sample the probability improves to 0.393, 0.993, and 1 respectively , i.e., we can hardly fail! One can easily interpolate/extrapolate from these extreme examples if our knowledge is not perfect or precise. The point here is to illustrate the importance of knowledge and the need for successive iteration even with the savings from ordinal optimization.

This is where learning and adaptation in Soft Computing (SC) and Computational Intelligence (CI) come in. The word “soft computing” advocate the same goal as (G) above but deals with larger issues of intelligent decision and control. And CI is used to distinguish it from traditional artificial intelligence [Jang-Sun-Mizutani 1997, Table 1.1] to stress the fact that these subjects by and large are quantitative and can be made precise despite abuses to the contrary. Heuristics, structural knowledge, and search representation all play important roles [see for example, Ruml et al 1996], and in this author's opinion, are the heart of Artificial Intelligence and Intelligent Control. Experience with the use of genetic algorithms and simulated annealing on real problems substantiate this belief. It is also the raison de etre for human intelligence and our continuing role in problem solving. How to systematize such human intelligence and transfer them to machines are the challenges of computational intelligence and task for the control system community. We end this note on this optimistic but challenging conclusion.