GENERATION OF WEIGHTS FOR COMBINED FORECASTS

BASED ON THE MUTUAL INFORMATION FUNCTION

William J. Cosgrove, California State Polytechnic University, Pomona

Abstract

The mutual information function from information theory is employed to measure redundancy among distributions from a group of forecasts which are combined into a single forecast. Distributions within the group showing redundancy with the actual demand convey similar information as the actual demand, indicating their contribution as a predictor of actual demand. Weights for combined forecasts in subsequent periods are generated based on this contribution. An example is developed to illustrate the concept. The study assumes that forecasts within the group are stochastic and independently generated.

Keywords: Forecasting; Entropy; Information Theory; Management Science

Introduction

The benefits of combined time series forecasts from multiple independent sources over a single forecast from a single source is strongly supported in the forecasting literature (e.g., Georgeff and Murdick, 1986; Makriddakis et al., 1998; Armstrong, 2001). This is evident in that combined forecasting procedures have the benefit of drawing on several conventional analytical procedures to capture the historical characteristics of demand, as well as the advantage of integrating managerial opinion into the forecast. After examining 30 empirically based comparisons, Armstrong (2001) reported that the reduction in error for even (equally) weighted combined forecasts, when compared to conventional non-combined forecasts, averaged about 12.5%andranged from 3% to 24%.

As reported by Bunn and Wright (1991), research in the area of combined forecasting has been sparse, and a review of forecasting literature following their study supports the notion that this continues even to the present. The studies that relate to this study are those that employ unequal weights to combined forecasts. Bates and Granger (1969) structured the combined forecast as a network formulation, and utilized linear least squares to optimize the selection of weights with the single criterion of forecasting accuracy. Fralicx and Raju (1982) propose canonical correlations to determine weights based on multiple criteria considerations. Cosgrove (2003) employed simulation to combine forecasts utilizing weights derived from Bayes’ Theorem. A second study by Cosgrove (2004) generated best-fit weights using optimization based on preemptive priority goal programming.

The purpose of this study is to derive forecasting weights for a single time period based on the mutual information function (MIF) (e.g., Jelinek, 1968; Jones, 1979). The MIF is applied to two random variables, X (the forecasting source) and Y (the forecast), and measures the information about Y for a given X, where X is treated as stochastic only for the purpose of assigning pre-weighting factors. Since it is assumed that the forecasts within the group are stochastic and independent, the MIF can measure the redundancy (i.e., the extent that forecast distributions from different forecasting sources tend to overlap) of the group forecasts, as well as measuring the added redundancy that follows from each forecast in the group that makes up the combined forecast. The MIF permits various combinations of forecasts within a group to be compared to the actual demand. The greater the redundancy between a forecast and the actual demand, the better is its distribution as a predictor of actual demand for the time period under consideration.

In the sections that follow, the MIF is developed based on the entropy function from information theory. A procedure to determine the weights is proposed followed by a numerical example.

Entropy and Mutual Information Functions

Consider a bivariate structure described by X={xi|i=1,2,...I} and Y={yj|j=1,2,....J}, with xi the ith state of X and yj the jth state of Y, and distribution functions P(X) and P(Y). The entropy function of Y is given by (e.g., Jones, 1979)

H(Y) = -  p(yj) log p(yj), (1)

yjєY

where the logarithm can be set at any base, with min[H(Y)]=0 corresponding to p(yj)=1 for j=1 (i.e., if Y is deterministic), and max[H(Y)]=log J corresponding to p(y1)=p(y2)= . . . p(yj)= . . . p(yJ). Hence, the state of maximum entropy, or uncertainty, occurs when all states of Y are equiprobable.

Other relationships of interest include the partial conditional entropy of Y for a single state xi and the conditional entropy of Y given X, defined as follows:

H(Y|xi) = -  p(yj|xi) log p(yj|xi) (2)

yjєY

and

H(Y|X) =  p(xi) H(Y|xi), (3)

xiєX

with H(Y)  H(Y|X). P(Y|xi) is the distribution function corresponding to Expression (2).

A theorem from information theory expresses the mutual information function (MIF) as (e.g., Jones, 1979)

I(Y;X) = I(X;Y) = H(Y) - H(Y|X), (4)

where 0≤I(Y;X)≤min[H(Y|X),H(X|Y)]. Expressions for H(X), H(X|yj), and H(X|Y) parallel Expressions (1), (2), and (3). Expression (4) can be illustrated as the intersection of H(Y) and H(X) in the Venn diagram of Figure 1.

Intuitive interpretations from information theory for the MIF as given in Expression (4) follow from basic statistics and information theory texts (e.g., Papoulis, 1984; Hays and Winkler, 1975; Jelinek; 1968). Generally, the MIF is described as a measure of the reduction in uncertainty, or equivalently, as a measure of information gain, of Y given a knowledge of X. When Y is measured on an interval/distance scale and X on a nominal/categorical scale, the MIF provides a continuous measure of discrimination of the conditional distributions associated with the interval/distance scale. For example, consider the marginal distribution P(Y) specified by

p(yj) =  p(xi) p(yj|xi), (5)

xiєX

with entropy H(Y) determined from Expression (1). Figures 2 through 4 illustrate three patterns reflecting reductions in the MIF as given in Expression (4). Figure 2 assumes all p(xi) are approximately equal, generating the marginal distribution in a somewhat symmetrical pattern. However, as p(x1) increases as shown in Figure 3, H(Y) and I(Y;X) decrease according to Expression (4), and P(Y), the marginal distribution, begins to skew to the right shifting its mode toward the mode of P(Y|x1). The skewness is more pronounced in Figure 4 as p(x1) continues to increase with decreasing p(x2) and p(x3), even to the extent that the marginal distribution begins to take the shape of what is now the dominating conditional distribution, P(Y|x1). Hence, as p(x1) increases, the uncertainty in Y given X (with x1єX having the greatest influence) results in a large reduction in the uncertainty about Y given X. In the extreme case, when p(x1)=1, the distributions of P(Y|x1) and P(Y) are identical, reflecting a MIF at its minimum value of zero.

In the sections that follow, the study will employ the patterns of behavior for distributions in a bivariate structure as described above.

Determination of Weights

Assume a set of forecasting sources represented by Ν={n|n=1,2,...N} is generated by N independent forecasts for a given period t, where t is integer and t≥1. If Fn,t is the nth forecast with weight wn,t corresponding to Fn,t, the combined forecast from N forecasting sources is given by

CFt =  wn,t Fn,t , (6)

nєΝ

where

 wn,t = 1, (7)

nєΝ

with 0≤wn,t≤1 for all nєΝ, and Fn,t a random variable or a point estimate.

The process to determine wn,t for the forecasted period t requires that the actual demand for period t-1 is deterministic. If x1 refers to actual demand, it follows that H(Y|x1)=0 and P(Y|x1) is deterministic. For other xi, let

p(xi)=[1-p(x1)]/(I-1) for xi≠x1 (8)

with

1>p(x1)≥½ . (9)

In this manner, xi plays a dual role of representing the actual demand with i=1 for the current period t-1, and the I1 forecasting sources for the subsequent period t. Also, setting p(x1)≥½ assures that P(Y|x1) is the dominating conditional distribution shaping the marginal distribution, as discussed in reference to Figure 4. All p(xi) for xi≠x1 that originate from the forecasting sources are equiprobable as shown in Expression (8). The MIF as calculated in Expression (4) includes all I conditional distributions, where P(Y|x1) is treated as a conditional distribution as a matter of convenience. This permits consideration of the impact on the MIF from each of the I-1 conditional distributions from the forecasting sources.

To measure the impact on MIF from each forecasting source, a ratio is developed based on changes in the MIF when a single forecasting source is removed. If a forecasting source is contributing to a reduction in I(Y;X), it follows that its removal will lead to an increase in the MIF and H(Y) for the remaining forecasting sources (i.e., an increase in the uncertainty of Y given X with the actual demand included as the deterministic P(Y|x1)). However, if a forecasting source is removed and the MIF decreases, the combined forecast is a better predictor without such a source. Consequently, it would be assigned a weight of zero because it fails to contribute to making the actual demand more certain. Such a situation is not accommodated with other weighting methods.

To account for changes in the MIF when a forecasting source is removed, let xi’ specify the removed element in X. An altered form of the MIF less xi’ based on Expression (4) is given by

I(Y;X|xi’∉X)= H(Y)’– H(Y|X|xi’∉X) for all xi’≠x1. (10)

Note that xi’≠ x1 is required since x1 corresponds to the actual demand, which is not a forecasting source. H(Y)’ follows from Expression (1) with

p(yj) =  p(xi) p(yj|xi), (11)

xiєX|xi’∉X;xi’≠x1

which is the marginal distribution corresponding to x1 and the remaining I-2 forecasting sources; and H(Y|X’) follows from Expressions (2) and (3) with the summation in Expression (3) for all xiєX|xi’∉X. The relative change in MIF from the elimination of xi’ is defined by

∆MIF(Y;X|xi’∉X) = [I(Y;X|xi’∉X)-I(Y;X)]/I(Y;X)

for xi’ = x2, x3,. . . xI. (12)

To eliminate forecasting sources that do not contribute to the predictability of x1 (the actual demand), let

∆MIF(Y;X|xi’∉X) if ∆MIF(Y;X|xi’∉X)>0
R(Y;X|xi’∉X ) =
(13)

0 if ∆MIF(Y;X|xi’∉X)≤0,

where R(Y;X|xi’∉X)=0 corresponds to a forecasting source that does not contribute to improving the predictability of the actual demand.

If

I

R(Y;X) =  R(Y;X|xi’∉X), (14)

i’=2

the nth forecasting weight for period t can be expressed by

[R(Y;X)-R(Y;X|xi’∉X)]/R(Y;X) if R(Y;X|xi’∉X) <0
wn,t = (15)

0 if R(Y;X|xi’∉X) ≥0,

with n= i’-1 for i’=2,3,...I.

Example

This section develops an example with three forecasting sources and the actual demand as shown in Table 1. The actual demand is expressed as y3=104 units with a corresponding conditional distribution (i.e., P(Y|x1)) which is deterministic. Table 2 gives results based on expressions from the previous section. The high setting of p(x1) at .9500 (from Expression (9)) in Table 2 is intended to shape the marginal distribution as a close approximation to the deterministic P(Y|x1), as shown in the second and last columns of Table 1. Conditional distributions representing the forecasting sources that have the most overlap (or redundancy) with P(Y|x1) make a greater contribution to the predictability of the actual demand than those conditional distributions with little or no overlap. This leads to the situation where the small overlap of the conditional distribution of the third forecasting source (with n=i’-1=3) is insufficient to improve the predictability of actual demand (i.e., the removal of the third forecasting actually improved the predictability of actual demand with the remaining two sources, indicating that there is a disadvantage to retaining the third forecasting source). Consequently, w3,t in the last row of Table 2 is set to 0 as required from calculations based on Expressions (13) and (15).

Table 1: Example with Three Forecasting Sources

j yj Actual Forecasting Forecasting Forecasting Marginal

Demand Source n=1 Source n=2 Source n=N=3 Distribution

P(Y|x1) P(Y|x2) P(Y|x3) P(Y|x4) P(Y)

1 102 0 .1664 0 0 .0028

2 103 0 .2341 .1764 .0625 .0079

3 104 1.0 .5084 .4004 .2500 .9693

4 105 0 .0911 .3295 .3750 .0133

5 106 0 0 .0937 .2500 .0057

6 107 0 0 0 .0625 .0010

Table 2: Results for Entropy, MIF, and Weight Calculations

i/i’ → 1/- 2/2 3/3 4/4

p(x1) .9500 .9500 .9500 .9500

p(x2) .0167 - .0250 .0250

p(x3) .0167 .0250 - .0250

p(x4) .0167 .0250 .0250 -

H(Y) .1788 - - -

H(Y)’ - .1859 .1835 .1588

H(Y|X) .0645 - - -

H(Y|X|xi’∉X) - .0667 .0652 .0615

I(Y;X) .1143 - - -

I(Y;X|xi’∉X) - .1192 .1183 .0973

R(Y;X) .0779 - - -

R(Y;X|xi’∉X) - .0429 .0350 0

n=i’-1 - 1 2 3

wn,t - .4493 .5507 0

Conclusion

As illustrated in the methodology and demonstrated in the example, p(x1) in Expression (9) is the key parameter in determining the extent that the distributions from the forecasting sources overlap with the deterministic actual demand. High settings of p(x1) will increase the weights from forecasting sources with conditional distributions that overlap with the deterministic P(Y|x1), and diminish or even eliminate weights with little or no overlap. Weights generated from this study depend on the MIF as a continuous measure of predictability only to the extent that forecasting sources contribute to predicting actual demand. However, once they cross a threshold where they do not, they are eliminated regardless of how close their mean or mode may be to that of the actual demand. This suggests that the approach in this study is more suitable for situations where the forecasted item is subject to high levels of volatility, as reflected by conditional distributions exhibiting high levels of dispersion.

The study was limited to deriving weights for a single period which were independent of the weights from past periods. A proposed extension in the methodology is to move beyond the bivariate form to multivariate forms of the MIF. This extension could accommodate additional sets of categorical variables corresponding to past forecasting periods with the same forecasting sources. Such an extension can be visually depicted in terms of adding additional spheres to the Venn diagram in Figure 1. However, the additional variables would significantly add to the size and complexity of the problem, leading to several forms of MIF for multiple combinations of the variables. These forms would reflect the various combinations of overlap from the spheres of a multivariable Venn diagram.

References

Armstrong, J.S., “Integrating, Adjusting, and Combining Procedures: Combining Forecasts,”in Principles of Forecasting: A Handbook for Researchers and Practitioners, collected works edited by J.S. Armstrong, Wharton School, University of Pennsylvania, Boston: Kluwer Academic Publishers, 2001, pp. 417-439.

Bates, J.M. and C.W.J. Granger, “Combination of Forecasts,” Operations Research Quarterly, 20, 1969, pp. 451-468.

Bunn, D. and G. Wright, “Interaction of Judgmental and Statistical Forecasting Methods: Issues and Analysis,” Management Science, 37, 1991, pp. 501-518.

Cosgrove, W., “Combined Time-Series/Judgmental Forecasting Procedure Utilizing Probability Tree Structures,” Proceedings of International Decision Sciences Institute, Shanghai, Peoples Republic of China, July, 2003.

Cosgrove, W., “A Combined Forecasting Procedure Based on Network Simulation and Optimization,” Proceedings of the Asian-Pacific Decision Sciences Institute, Seoul, Korea, July, 2004.

Fealicx, R. and N.S Raju, “A Comparison of Five Methods for Combining Multiple Criteria into a Single Composite,” Educational and Psychological Measurement, 42, 1982, pp. 823-827.

Georgoff, D.M. and R.G. Murdick, “Manager’s Guide to Forecasting,” Harvard Business Review, January-February, 1986, pp. 110-120.

Hays, W.L. and R.L. Winkler, Statistics: Probability, Inference, and Decision, 2nd ed., NY: Holt, Rinehart, and Winston, 1975.

Jelinek, F., Probabilistic Information Theory, NY: McGraw-Hill, 1968.

Jones, D.S., Elementary Information Theory, Oxford: Clarendon Press, l979.

Makriddakis, S.G., S.C.Wheelwright, and R.J. Hyndman, Forecasting-- Methods and Applications, 3rd ed., NY: John Wiley and Sons, 1998.

Papoulis, A., Probability, Random Variables, and Stochastic Processes, 2nd ed., NY: McGraw-Hill, 1984.