Spring 2008Mike JonesLiterature Search
Literature Search
“A Method for Improved Modeling of Input Data Using Stochastic-Deterministic Decomposition and Synthesis”
Submitted: March6, 2008
Prepared for:
Dr. Fredrick McKenzie
Dr. Michael Bailey
Dr. Ghaith Rabadi
Dr. Zia-ur Rahman
OldDominionUniversity
Norfolk, VA
Prepared by:
Michael C. Jones
OldDominionUniversity
Norfolk, VA
1
Spring 2008Mike JonesLiterature Search
1.0:IntrODUCTION
Modeling and simulation (M&S) is pervasive across the majority of modern industries. Key executives and engineers now accept M&S as a primary tool to support analysis before making critical decisions. The validity of these models is closely scrutinized through meticulous industry-developed standards and procedures (Department of Defense 1996). Even if the model itself is perfect, which is rarely achievable, the output of the model can only be as accurate as the data provided to the model. Current methods employed to model and generate that input data focus on sophisticated analytical techniques based on well-studied stochastic systems. In practice, modelers focus on identifying the probability distributions which produce histograms which closely resemble the histogram of the observed or expected input data (Alexopoulos et al., 2006).
Unfortunately, the majority of the data observed in the actual systems are not purely stochastic. Many of the techniques used in modeling input data assume independent, identically distributed (iid) data streams while the majority of data streams are not iid. In many cases, observed data streams have cycles with some variability about the cycle. They are, in effect, a combination of deterministic and stochastic components. The deterministic component forms the cyclic structure of the data stream while the stochastic component forms the variability, or noise, about the deterministic component (Li et al., 2007). Consider the case of a restaurant. Modeling the arrivals of customers at a restaurant as an exponential random variable with a constant mean does not provide a realistic simulation since it disregards the impact of rush hour. Consider also the case of a hardware store manager ordering air conditioners. Should the manager order as many air conditioners in May as he does in October? A model of the store’s inventory which assumes the arrival of customers demanding air conditioners is an exponential random variable with a constant mean would lead to the conclusion that the orders in May and October should be the same.
A second oversight of modeling input data as iid random variables is neglecting the correlation between input data streams. Again, many of the methods in use today attempt to reproduce each input stream independently. Frequently, observed data will exhibit correlation between input streams. Consider the hardware store manager mentioned above. The reorder policy will be based on several factors, including demand and the expected lag time between placing an order and receiving the merchandise. The demand and lag time would be modeled as separate data streams. Those quantities may not be independent of each other. Perhaps at the beginning of the summer, when demand for new air conditioners peaks, the factory has a large inventory on hand and,therefore,the lag time is short. An order placed in August may arrive to an empty warehouse and have to wait to be filled. Since both inputs, demand and lag time, vary as a function of time, they will be correlated and should be modeled as such.
Modeling input data is a basic skill which must be understood by any practitioner. Therefore, the topic is covered in varying degrees by most introductory texts (eg. Kelton et al., 2004, Law & Kelton, 2000, and Ziegler et al., 2000). This research will address the issues of cyclic arrivals and correlated inputs through the development of a new algorithm for modeling input data streams by analyzing and generating representative input data which considers the correlation of input data as well as the deterministic nature of the data. The emphasis will be on exploring methods of decomposing input data streams into deterministic and stochastic components and methods of synthesizing new data streams which are representative of the observed data streams. Additionally, methods of independently manipulating the deterministic and stochastic components to produce data streams which are significantly different from the observed data streams but still plausible will be explored.
2.0:Background and LITERATURE review
2.1: Framework for review.
Figure 1: Leemis' Taxonomy
The topic of input data modeling is well covered in the proceedings of a variety of conferences and in most modeling and simulation text books. This review will explore both the breath and the depth of the available literature. In order to first demonstrate the breadth of the literature, the taxonomy suggested by Leemis (2001) will be used as a framework. Leemis’ Taxonomy is provided as Figure 1 for reference.The literature pertaining to each branch on Lemmis’ taxonomy will be summarized. Once the breadth of the literature is explored, the literature on nonhomogeneous poisson processes will be covered in greater depth.
2.2: importance of IID conditions.
Leemis initially divides input models into two categories, which he labels stochastic and time-independent. Most modeling professionals define Monte Carlo as “a scheme employing random numbers… which is used for solving certain stochastic or deterministic problems where the passage of time plays no substantive role” (Law & Kelton 2000). Therefore, the time-independent branch is sometimes referred to as the Monte Carlo branch. In these simulations, each value in the input stream is independent of all previous and future values. Additionally, each value of a particular data stream is considered a draw from an identically distributed distribution. Collectively these two requirements are the necessary and sufficient conditions to show that in input stream is iid. The time-independent branch includes the methods most commonly addressed in the literature, including the majority of the methods covered by Kuhl, et al. (2006). The time-independent methods focus on matching the distribution function of the data stream with the theoretical distribution of one of a family of probability functions. The best match is selected and the probability function is used to generate input data to drive the model.
2.3: Determining iid.
When modeling input data, the first task is to determine if the data is iid. (Law & Kelton 2000, Lemis 2001, and Kuhl, et al. 2006). This analysis will determine if the time-independent or stochastic branch of the taxonomy is appropriate. The simplest tool forassessing if a data stream is iid is to plot the data as a time series and use linear regression to plot the best fit line through the data. If the line has a significant slope, the data is not iid (Leemis, 2001). A slightly more complex, and powerful, method is to plot the autocorrelation function of the data. If the autocorrelation exhibits a statistically significant spike at any location other than lag = 0, the data is not iid (Law & Kelton 2000). Another common method is to create a scatter plot of the data. This is a two dimensional plot on which each observation is represented by a point. The horizontal coordinate of the point is the value of the observation and the vertical coordinate is the value of the following observation. Correlated data will tend to fall along a line on a scatter plot.
Li et al (2007) suggest two additional approaches. The first is to plot a Discrete Fourier Transform (DFT) of the data. The DFT is analogous to an autocorrelation plot and the point on the DFT at normalized frequency = 0 is mapped to the point on the autocorrelation plot at lag = 0. Spikes in the DFT indicate a cyclic correlation in the data stream. From the literature on digital signal processing, this cyclic correlation is called a spectral component. In some cases, the force which is creating the spectral component is itself slowly varying. This causes the energy in the signal to be dispersed across the spectrum which results in lower and broader peaks in the DFT. To detect this type of spectral component, Li et al. recommend creating a spectrogram. A spectrogram is a series of DFTs, each created from adjacent subsets of the data. The DFTs are then displayed adjacent to each other. This method highlights nonstationary spectral components. A statistically significant spike at any point on a DFT or on a spectrogram, other than at the point where the normalized frequency = 0, indicates that the data is correlated and the time-independent branch should not be selected.
2.4: Time Independent Methods.
Once the modeler is convinced the data is iid, the modeler chooses the time independent branch of the taxonomy. The next decision is made based on whether the data is a single variable, or multiple variables, of interest. The final branch decision on the taxonomy tree is made based on whether the data is discrete, continuous, or a mix of both continuous and discrete. These decisions each eliminate classes of distribution functions from consideration. After these eliminations are made, the modeler must select the most appropriate from the remaining functions. The selection is made with the aid of graphical tools such as histograms and box plots, statistical moments such as mean, variance, skew, and kurtosis, and experience. Based on the distribution selected, the modeler will manipulate parameters such as the mean to achieve the best match possible.
While an exhaustive review of the possible families of random distributions would not be appropriate here, the case of the Poisson process will be presented both as an example and to make the presentation of the stochastic branch easier to comprehend. A detailed list with explanations of each can be found in Law & Kelton (2000) or Kelton et al (2004).
The arrival of customers in an iid system is frequently modeled as an exponential random variable defined by:
For X > 0
Otherwise
B is the mean interarrival time and λ, the arrival rate, is . This function is frequently selected because it has many favorable properties in addition to the accurate representation of the arrival. First, an exponential process is “memoryless,” meaning that the time since the last arrival has no impact on the probability distribution of the time until the next arrival.The exponential distribution is the only distribution which is memoryless (Kulkarni (1995) p. 190). The minimum of two exponential random variables, such as time to the failure of first system in redundant systems, is another exponential random variable. Also, the sum or superposition of two exponential random variables is another exponential random variable. This allows an analyst to apply statistical techniques which depend on independence to analyze the output of a model when the input is composed of exponential random variables.
2.4.1: Homogeneous Poisson Processes.
It has been shown (Law & Kelton (2000), p 390, and Kulkarni (1995) p. 199) that a Poisson process is necessarily composed of exponential random variables, and the reverse,that exponential random variables constitutes a Poisson process, is also true. A Poison process satisfies three conditions (Alexopoulos 2008):
- The probability of simultaneous arrival is zero.
- Arrivals within any disjointed portions of the observation are statistically independent.
- The pattern of arrival does not change as a function of time.
Law & Kelton define the conditions more rigorously as (Law & Kelton (2000) p. 389):
- “Customers arrive one at a time.
- N(t + s) – N(t) (the number of arrivals in the time interval (t, t + s]) is independent of {N(u), 0 ≤ u ≤ t}
- The distribution of N(t + s) – N(t) is independent of t for all t, s ≥ 0.”
To ensure the chosen distribution adequately models the observed data, The goodness of the fit can be measured by any of several methods including the Anderson-Darling, Kolmogorov-Smirnov, or Cramer-von Mises tests. The goodness can be graphically assessed using histograms, P-P plots, or Q-Q plots. (Law & Kelton 2000, Leemis 2001).
2.5: PUTTING IID IN CONTEXT.
Before transitioning to the stochastic branch of Leemis’ taxonomy, it is useful to revisit the determination of whether or not observed data is iid in light of the conditions of a Poisson process.Harrod & Kelton (2006) caution that “Ignoring nonstationarity in input processes to simulation models can lead to model invalidity and thus largely erroneous results.”
Alexopoulos et al (2008) contend that the necessary and sufficient conditions for a true Poisson process can rarely be met. The first condition, that the probability of simultaneous arrival is zero, is not likely since customers routinely arrive in clusters. Consider the case of an analyst studying a restaurant. It is likely that members of a family will arrive simultaneously, or nearly simultaneously. The second condition, that arrivals within any disjointed portions of the observation are statistically independent, is also frequently violated. Consider the case of arrivals of students for a class. A traffic jam near the university may cause all of the students to be delayed. The fact that few students have arrived by the beginning of class is a strong indicator that several students will arrive shortly after the class starts. The third condition, the pattern of arrival does not change as a function of time, is perhaps the most frequently violated condition. Arrival rates of customers almost always vary throughout the day, even when the arrivals are scheduled to be uniformly distributed. These observations indicate that a time-independent method of modeling input should only be selected after careful consideration.
2.6: Stochastic Methods.
If the observed data is shown not to be iid, or when a time varying statistical description can not be reasonably ruled out, the modeler is directed to the stochastic branch of Leemis’ taxonomy. As in the case of the time-independent branch, Leemis divides the stochastic branch into subsequently smaller branches based on the statistical description of the observed data. While the order of the decisions represented in the tree structure is arbitrary, the net result is significant. These decisions can be made in any order and be technically correct. Leemis first divides the stochastic methods based on how time is measured in the data, either using discrete time or continuous time. Next, Leemis divides the branches based on whether the data exists over a continuous range or only takes on discrete values. Finally, Leemis divides the models based on whether the underlying statistical description is stationary or nonstationary. The spectrogram discussed above illustrates the difference between a stationary and nonstationary random variable.
2.6.1: Nonhomogeneous Poisson Processes.
To continue the example of the Poisson process above, the nonhomogeneous Poisson process is a natural extension of the homogeneous or stationary Poisson process. In the case of the nonhomogeneous Poisson process, the first two conditions of the Poisson process remain the same. The arrival rate in the case of a homogeneous Poisson process, λ, is replaced with λ(t) in the case of the nonhomogeneous Poisson process to indicate that the arrival rate is a function of time.The arrival rate, λ(t), is also referred to as the intensity function. The cumulative intensity function, analogous to a cumulative distribution function, is defined as:
t> 0
The cumulative intensity function is interpreted as the expected total number of arrivals to the system by time t. Since either the arrival rate or the cumulative intensity function may be derived from the other, a process’ statistical description may be fully and equivalently described by either function. Many approaches have been presented in the literature to both estimate λ(t) and to use a given λ(t) to generate input data for use in a simulation experiment. The methods for estimating λ(t) will be explored first.
Law & Kelton (2000) describe a simplistic and straight-forward approach to estimating λ(t). The modeler divides the total observation period into smaller adjacent subsets. For each subset, the modeler calculates the average arrival rate over the subset and uses that average in the approximation. This method results in a piecewise linear approximation to λ(t). The accuracy of this method depends upon the modelers ability divide the observed data into enough smaller subsets and to determine where the changes in arrival rate take place.
2.6.2: Nonparametric Estimate Of The Cumulative Intensity Function.
Leemis (1991) proposed an intuitive method which is guaranteed to converge to the actual function as the number of observations increases.Leemis proposed a piece-wise continuous approximation to the cumulative intensity function which is updated for each observed arrival time. Assume the modeler has k replications of the system under study, each covering a time interval from t = 0 to t = S. For each replication, the time of each arrival is recorded. Let ni be the number of arrivals recorded in replication i and be the total number of arrivals observed in all of the replications combined. The expected number of arrivals at time t = S will be . Leemis’ method superimposes all of the replications into one stream of arrivals and sorts them from earliest to latest, such that the time of earliest arrival observed in any replication is designated as t(1) and the time of the next earliest arrival observed in any replication is designated as t(2), and so on until the latest time observed in any replication is designated as t(n). The last possible time, t = S, will be designated t(n+1). Each of these times represents a point along the horizontal axis of the graph of the cumulative intensity function. Obviously, since arrivals are not evenly spaced, these points will not be evenly spaced along the horizontal axis. Since there are n observations, plus points for t = 0 and t = S, there will be n+1 line segments. Each line segment represents an observed arrival, so the points will be equally spaced along the vertical axis. The largest value for Λ(t) will be , therefore the points will be apart along the vertical axis. Connecting the points with straight lines yields the graph below. (I have provided critical comments on Leemis’ paper as an addendum.)