The Relationship Between Least Squares and Likelihood

George P. Smith

Division of Biological Sciences

Tucker Hall

University of Missouri

Columbia, MO65211-7400

(573) 882-3344;

AUTHOR’S FOOTNOTE

George P. Smith is Professor, Division of Biological Sciences, University of Missouri, Columbia, MO65211 (e-mail: ).

ABSTRACT

Under broadly applicable assumptions, the likelihood of a theory on a set of observed quantitative data is proportional to 1/Dn, where D is the root mean squared deviation of the data from the predictions of the theory and n is the number of observations.

KEYWORDS

Bayesian statistics; Ignorance prior probability; Root mean squared deviation

1. INTRODUCTION

One of the commonest settings for statistical analysis involves a series of n quantitative observations and a series of competing explanatory theories , each of which specifies a theoretical value i corresponding to each of the actual observations xi. The degree to which the observations fit the expectations of a given theory is usually gauged by sum of the squares of the deviations for that theory, or equivalently the root mean squared (RMS) deviation ; D has the advantage of being on the same scale as the observations themselves and for that reason will be used here. The theory for which D is minimized is the best fit to the data according to the least-squares criterion.

Least-squares analysis is on firm theoretical grounds when it can reasonably be assumed that the deviations of the observations from the expectations of the true theory are independently, identically and normally distributed (IIND) with standard deviation . In those circumstances, it is well known (and will be demonstrated below) that the theory that minimizes D (or equivalently, S) also maximizes likelihood. The purpose of this article is to explain a deeper relationship between likelihood and RMS deviation that holds under broadly applicable assumptions.

2. ANALYSIS

2.1 RMS Deviation and the Likelihood Function

In consequence of the assumed normal distribution of deviations, the probability density for observing datum xi at data-point i given standard deviation  and a theory  that predicts a value of i at that point is

Eq. 1

Here and throughout this article, the generic probability function notation and the summation sign will be used for both continuous and discrete variables; it is to be understood from the context when a probability density function is intended, and when summation is to be accomplished by integration. This notational choice preserves the laws of probability in their usual form while allowing both kinds of random variable to be accommodated in a simple, unified framework. Because of the IIND assumption, the joint probability density for obtaining the ensemble X of observations {x1, x2, x3,…, xn} is the product of all n such probability densities:

Eq. 2

It will be useful in what follows to gauge dispersion in the normal distribution in terms of ln rather than  itself, in which case the above distribution can be written in the form

Eq. 3

The right-hand factor in this expression is a peak-shaped function of ln whose peak value occurs when ln = lnD but whose size and shape are independent of D and therefore of both data X and theory .

The foregoing probability is related to other key probabilities via Bayes’s theorem:

,Eq. 4

where we assume in the second equality that the prior probability distributions for ln and  are independent. Summing over all possible values of ln (from minus to plus infinity)

Eq. 5

In the Bayesian view, the laws of probability underlying the foregoing relationships embody the fundamental logic of science. In particular, Bayesians interpret the preceding equation as the rule for rationally updating our opinions of the competing theories  in light of the new evidence embodied in the observations X. The prior distribution and posterior distribution gauge rational degrees of belief in the theories before and after obtaining (or considering) evidence X, respectively. Updating is achieved by multiplying the prior probability of each theory  by —the probability, given , that we would obtain the evidence we actually did obtain. Considered as a function of  for fixed evidence X—the data actually observed— is the likelihood function . It captures, exactly and quantitatively, the relative weight of the evidence X for the competing theories, allowing a profound arithmetization of empirical judgment in those situations when it can be calculated. In summary, the likelihood function for this problem can be written

.Eq. 6

2.2 When We Are Sufficiently Ignorant of the Standard Deviation, Likelihood Is a Simple Function of RMS Deviation

The likelihood function in Eq. 6 itself contains a prior probability distribution: . Regardless of the form of this distribution, it is obvious from the expression for in Eq. 2 that the theory that maximizes likelihood is the one with the smallest RMS deviation D (the same is true, though less obviously, of the expression for in Eq. 3); this confirms the well-known fact stated above that least-squares analysis pinpoints the maximum likelihood theory under the IIND assumption.

But if the likelihood function is to be used to weigh the strength of the evidence for the competing theories quantitatively, rather than merely to identify the maximum likelihood theory, the prior distribution must be specified. Occasionally it happens that extensive prior information entails a particular special form for this function. Much more often, though, we are essentially ignorant of ln in advance of the data. That is the case that will be considered here.

What probability distribution properly expresses prior ignorance of the value of this parameter? Jaynes (1968; 2003, pp. 372–386) argues compellingly that ignorance, taken seriously, imposes strong constraints on prior probability distributions. In particular, the appropriate distribution for the logarithm of a scale parameter like  is the uniform distribution , or equivalently ; this is the only distribution that remains invariant under a change of scale—a transformation that converts the original inference problem into another that should look identical to the truly ignorant observer. Substituting that ignorant prior distribution into Eq. 6 the likelihood function can be written

Eq. 7

where constants that don’t depend on the variables of interest  and X are suppressed because it is only the relative values of the likelihood function that matter. As remarked above under Eq. 3, the integrand in the third part of the equation has the same size and shape regardless of the value of D. The integral in that equation is therefore itself a constant Q that doesn’t depend on  and X. Dividing the last part of Eq. 7 by Q, the likelihood function further simplifies to

Eq. 8

under the specified conditions. This likelihood function was previously derived in a different context by Zellner (1971, pp. 114–117). This expression does much more than simply remind us that the maximum likelihood theory is the one that minimizes D; it makes it easy for us to articulate numerically the weight of the evidence X for each contending theory .

2.3 How Much Ignorance Is Enough?

The following extreme case might be put forward as a counterexample to the above reasoning. Suppose one of the competing theories happens to fit the data exactly. D vanishes altogether for such a theory, and according to Eq. 8 that theory’s likelihood would be infinitely greater than the likelihood of a theory that deviates even infinitesimally from the observed data. But common sense rebels at thus according infinite weight to an infinitesimal deviation.

This “counterexample” serves not to undermine the reasoning above, but rather to warn us that in using the “ignorant” prior distribution we are pretending to more ignorance than we actually possess. In any situation we choose to analyze in terms of distributions of deviations, we surely must have some vague prior information that convinces us that there is at least some error—that is, that the standard deviation  is not infinitessimally small. Likewise, there is ordinarily some limit to how large the standard deviation can plausibly be. If we are making measurements with a light microscope, for instance, we wouldn’t credit a standard deviation as low as 1 femtometer or as high as 1 kilometer. This vague state of actual prior knowledge is sketched schematically in the mesa-shaped prior probability distribution in the upper part of Fig. 1. This curve is characterized by a broad central “domain of ignorance” where the curve is flat, and where the scale invariance that underlies the ignorant prior distribution Pr(ln) = const holds to a high degree of approximation. On either side of that domain the prior credibility of extreme values of ln descends gradually as shown, though in most cases we would be hard-pressed to describe that descent numerically. The ignorant prior distribution , on which the simple likelihood function Eq. 8 is based, is represented in the dashed curve in the figure; it corresponds to an idealized domain of ignorance that extends without limit in both directions. For ease of comparison, both curves are shown normalized to the same plateau level of 1—a harmless transformation since it’s only relative values of the likelihoods that matter.

Figure 1. Defining conditions under which the simple likelihood function Eq. 8 is valid. Use of the figure is explained in the text. UPPER GRAPH: The mesa-shaped “actual” prior probability distribution Pr(ln) represents a typical vague state of prior knowledge of the ln parameter. The flat part of the curve spans a “domain of ignorance” in which scale invariance holds to a high degree of approximation; the lower and upper bounds lnmin and lnmax are chosen to lie unarguably within the domain of ignorance. The flat dashed line represents an idealized prior distribution in which the domain of ignorance extends indefinitely in both directions. The three Pr(X|,ln) curves are plots of the integrand in the third part of Eq. 7, using n=12 as the number of data-points; the curves differ in the value of lnD, which is the value of ln at which the curve peaks. The value of lnD for the left-hand Pr(X|,ln) curve, ln(Dmin), is chosen so that the tail area to the left of lnmin is 10% of the total area under the curve. Similarly, the value of lnD for the right-hand Pr(X|,ln) curve, ln(Dmax), is chosen so that the tail area to the right of lnmax is 10% of the total area under the curve. LOWER GRAPH: The ratios Dmin/min and Dmax/max, as defined above, are plotted against the number of data-points n.

Also shown in the upper part of the figure are peak-shaped distributions for three different theories , corresponding to three different values of lnD. The three curves have been normalized to the same arbitrary peak height by multiplying each by a factor proportional to Dn; they are thus graphs of the integrand in the third part of Eq. 7. The peak value of each curve occurs when ln = lnD. The middle curve corresponds to a theory  whose lnD value lies well within the domain of ignorance. In many applications, all contending theories  are like that: they all correspond to values of lnD that clearly lie within the domain of ignorance. In those cases, the relative values of their likelihoods

will be the same whether we use the actual or idealized prior distribution for Pr(ln); that’s because both those prior distributions are uniform over all values of ln that are of practical relevance to the inference at hand.

What values of lnD lie safely within the domain of ignorance, as it has rather vaguely been described so far? To put the question in another way: for what range of theories , corresponding to what range of RMS deviations D, is the fractional error in likelihood incurred by using the idealized prior distribution in place of the actual prior distribution acceptably low—say, 10% or less? Answering that question precisely is typically either prohibitively laborious or beyond our powers altogether. However, it is usually feasible to put upper bounds on the fractional error, which can be written

That will be so if, without undue mental effort, we can specify a lower limit lnmin and an upper limit lnmax that lie unarguably within the domain of ignorance, as shown in the upper part of Fig. 1. For values of lnD that lie near lnmin, as for the left-hand peak in the upper part of the figure, it is easy to prove that the fractional error can be no more than

where in the second part of the equation we use the fact that . For the left-hand peak, this error corresponds to the ratio of the blackened tail area to the total area under the curve. Similarly, for values of lnD that lie near lnmax, as for the right-hand peak, the fractional error can be no more than

(blackened tail area over total area for the right-hand peak). As indicated in the figure, we can define values of RMS deviation Dmin and Dmax such that these tail areas are only 10%—usually an acceptable error level given all the other uncertainties that beset quantitative inference in practice. When for all theories  in contention RMS deviation D lies between Dmin and Dmax, we are adequately ignorant to warrant use of the simplified likelihood function Eq. 8. The lower part of Fig. 1 graphs the ratios and for various numbers of data points, allowing Dmin and Dmax to be computed from min and max.

Large percentage errors in calculating the likelihoods of the “winning” theories—those determining the smallest RMS deviations D—are intolerable because they give rise to serious errors of judgment. As RMS deviation for the winning theory gets smaller and smaller—i.e., as lnD moves farther and farther to the left of the domain of ignorance—its likelihood asymptotically approaches an upper limit that doesn’t depend on D but does depend sensitively on the exact form of the prior probability distribution Practual(ln). On the scale of Eq. 8, that limit is

where the constant Q is defined under Eq. 7, and where the distribution Practual(ln) is assumed to be normalized to a plateau value of 1 (as in Fig. 1). Substituting the simplified likelihood Eq. 8—a likelihood that increases without bound in proportion to 1/Dn—is a gross misrepresentation of the data, vastly overstating the weight of the evidence for the winners. That is precisely the situation in the “counterexample” with which this subsection began.

In contrast, large percentage errors in the likelihoods of the “losing” theories—those with the largest RMS deviations—is frequently harmless, even when those losers’ RMS deviations lie far beyond the domain of ignorance. That’s because we’re not interested in the losers individually, but only in how collectively they affect our judgment of the winners. Although the losers’ likelihoods may be greatly exaggerated by Eq. 8, those likelihoods are so small that the losers’ posterior probabilities may be collectively negligible in comparison to those of the entire ensemble of contending theories. In that case, the exaggeration will have no significant impact on our judgment of the winners. Again taking 10% as an acceptable error level, this condition will be met if

Eq. 9

In summary, sufficient conditions for use of the simple likelihood function Eq. 8 are that the lowest RMS deviation be greater than or equal to Dmin and that inequality Eq. 9 be valid. These conditions will be met in the large majority of cases encountered in practice.

3. REFENCECES

Jaynes, E.T. (1968), “Prior probabilities,” IEEE Transactions on Systems Science and Cybernetics,SSC4, 227–241.

Jaynes, E.T. (2003), Probability Theory: The Logic of Science, Cambridge, UK, CambridgeUniversity Press.

Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics. New York: Wiley.

1