Empirical Distributions in Simulation

Q:What is an “empirical” distribution?

A:The opposite of a theoretical distribution?

Q:Was that supposed to help?

A:Not really, but you do need to know it. An empirical distribution is one based directly on observed data.

Q:So what is a theoretical distribution?

A:The kind you are used to working with: normal, uniform, Poisson, exponential, F, Gamma, the list goes on and on.

Q:What makes those theoretical?

A:The fact that we have a formula already in hand that perfectly describes them.

Q:Does that mean we don’t have a formula for an empirical distribution?

A:Correct.

Q:Aren’t all distributions based on observed data?

A:Yes and no. You will use observations to determine the mean of a distribution, but not the shape of the distribution. The shape has been determined by the formula.

Q:What do we need to observe to create an empirical distribution?

A:You need to observe what is happening and how many times it happens.

Q:What do you mean “observe what is happening?” What do we observe?

A:Whatever you are interested in. When people arrive, how long service takes, how much people buy, the number of times people trip over a curb, how many people take up more than one parking spot, ANYTHING that you want to know about.

Q:Would this be easier if you showed us an example?

A:Probably, so here goes:

You need to know what size bags to provide to your customers, so you started a count of how many items each customer buys. The data you collected is shown in the following table:

# of items purchased / Count
1 / 50
2 / 120
3 / 80
4 / 210
5 / 40

Table 1: Collected Data

You want to simulate demand for customers and test various levels of inventory for bags to see what you need.

Q:What do we do first?

A:The first step is to build the empirical distribution.

Q:Don’t we have the empirical distribution in Table 1?

A:No, you have the data for it, but it needs to be rearranged to make it usable.

Q:What do we need to do to the data?

A:Change the observations (count) into percentages, then into cumulative percentages, and then in to random number ranges.

Q:Maybe we should go one step at a time: how do we turn the counts in to percentages?

A:Simply add up the total number of observations (sum the “Count” column) and divide each observation by the sum. I have put the results in a new column in Table 2:

# of items purchased / Count / Probability
1 / 50 / .10
2 / 120 / .24
3 / 80 / .16
4 / 210 / .42
5 / 40 / .08
Total: / 500 / 1.00

Table 2: Probabilities

Q:Very nice. Now how do we set up a cumulative percentage?

A:Add the probabilities as you go down the rows, like this:

# of items purchased / Count / Probability / Cumulative
Probability
1 / 50 / .10 / .10
2 / 120 / .24 / .34
3 / 80 / .16 / .50
4 / 210 / .42 / .92
5 / 40 / .08 / 1.00
Total: / 500 / 1.00

Table 3: Cumulative Probabilities

Q:That wasn’t too bad. What is this about “random number ranges?”

A:No doubt you remember that simulations use random numbers to make the simulated customers different from one another. Rather than having the computer do this, you are going to do it.

Q:How do we know which random numbers go with which observation?

A:That is what the random number ranges are for.

Q:How big should each range be?

A:Just wide enough to cover the probability of each observation occurring.

Q:If we do that, though, won’t the ranges overlap, for example 0 to 10 means the customer purchased one item, and 0 to 24 means the customer purchased two items? How do we know which one a random number of 1 is telling us?

A:You have to set up the ranges so that they don’t overlap, and we can use the cumulative distribution for that. Starting with 0 causes a problem, however, because “0 to 10” actually has an 11% chance of occurring. Let’s set some ground rules:

1)We are going to treat the random numbers as if they were whole numbers

2)We are going to treat the random numbers as if they had only two digits

We don’t really need those rules, but it does make the explanation a little easier. Now we need to figure out how many two-digit random numbers there are between 1 and 100 (or if you prefer, 0 to 99).

Q:Aren’t there 90?

A:No, there are actually 100, but you have to realize that 0 to 9 can be written 00, 01, 02, …, 09.

Q:Does this matter?

A:It does if you remember how many discrete percentage points lie in a probability distribution.

Q:OK, so how many “discrete probability points” (whatever they are) lie in a probability distribution?

A:A discrete probability point just means we talk about whole number probabilities (1%, 2%, etc.) rather than continuous probabilities (3.4758203%). Once again this isn’t really necessary, but it makes the explanation easier. Most (I believe all) simulations actually use continuous random numbers, but we won’t bother, for now. It works the same way, though.

Q:Is it important that there are 100 probability points and 100 random numbers?

A:Yes, very important, because that means we can assign each random number to one probability point and thereby eliminate the overlap problem we had above.

Q:Doesn’t that just push the problem back one level, because we still have to decide how to match the random numbers to the probabilities, don’t we?

A:No, we don’t. The essence of random numbers is that each one has the same chance of coming next (each two-digit random has the same 1in 100 chance of coming up next). It doesn’t matter which random number is assigned to which probability, so we can simply do them in order.

Q:Do you mean the random number 01 is assigned to whatever has the probability 1% of occurring?

A:With one slight change, yes. The random number 01 is assigned to whatever has the cumulative probability of 1%.

Q:Doesn’t that create a bias, because the lower probabilities go with the lowest random numbers?

A:It doesn’t because all random numbers have the same chance of occurring, so low or high, the chance is 1 in 100.

Q:So all we do is list the random numbers for each observation, starting at the top row?

A:That’s all you have to do. You are free to scramble the random numbers around any way you like, but you don’t gain anything by it and it makes your life harder. Start with 01 and work your way up, using the cumulative probability number as the upper limit of each range. Table 4 at the top of the next page has this done.

# of items purchased / Count / Probability / Cumulative
Probability / Random # Ranges
1 / 50 / .10 / .10 / 01 – 10
2 / 120 / .24 / .34 / 11 – 34
3 / 80 / .16 / .50 / 35 – 50
4 / 210 / .42 / .92 / 51 – 92
5 / 40 / .08 / 1.00 / 93 - 00
Total: / 500 / 1.00

Table 4: Random Number Ranges

Notice that the beginning number (for each range after the first) is just the previous cumulative probability plus 1 (or plus .01 and time 100, if you lie to pick nits). Also notice the closing number is 00.

Q:Why did you put 00 last instead of first?

A:As explained above, it doesn’t matter where it goes and if we put it first we have to subtract one to get each upper range limit. It is just easier my way.

Q:Are we finished yet?

A:Almost. We have our empirical distribution (either the Probability column or the Cumulative Probability column) and we are all ready to use it.

Q:How do we use it?

A:We are doing a very simple simulation – determining how many items each customer bought. Once we have a random number, we match it to the appropriate range and read across to find out how many items that simulated customer purchased. Keep track of the results, and you can calculate any output that is reasonable.

Q:Where do we get two-digit random numbers from?

A:You can use a random number table and read the digits off two at a time, or you can use Excel and take just the first two digits of the numbers resulting from the =RAND() function, or you can roll dice, draw numbers from a hat, or ask me.

Q:Could you please give us some two digit random numbers?

A:Sure, here are five: 99, 07, 18, 59, 49.

Q:So our first customer bought 5 items (99 falls into the range 93 to 00), while our second customer bought 1 item (07 is in 01 to 10), the third customer bought 2 items (18 is in 11 to 34), our fourth customer bought 4 items (59 is in 51 to 92) and our fifth customer bought 3 items (49 is in 35 to 50)?

A:Yes. It surprised me that we didn’t have any duplications, particularly in the large range (51 to 92), but, that is random numbers for you. Have fun with them.