Data Selection and Preparation

Data Selection and Preparation

Orrery Software1Note To File

Conservative Economy – Distribution of Net Worth

Note To file

May 17th, 2011

G H Boyle

Revised May 30th.

Revised June 2nd.

References:

[1]Boyle, G H; Larson, I; Liu, S; The Distribution of Wealth In an Absolutely Conservative Economy in ModEco, 27 April 2011

[2]Boyle, G H; Note To File; Conservative Economy – Transient Vs Steady-state Behaviour; May 16th, 2011

Purpose:

  • To re-confirm the distribution of the net worth of a conservative ModEco-based economy as being fairly represented by the Fisk distribution; and
  • To accumulate stylized facts respecting a ModEco conservative economy.

At [1] the Fisk distribution was described as the best empirical fit to the distribution of net worth in an absolutely conservative ModEco-based economy. Due to time constraints, the process used to analyze the data was not completely recorded, and questions arose later as to what data had been included or excluded, and why. This exercise will attempt to replicate those results, while documenting the steps taken to do so.

At [2], using macro-economic data, the duration of transient behaviour was identified as approximately 16,000 ticks of the ModEco clock for the scenario being analyzed. This conclusion will be further tested using the micro-economic data drawn from cross-sectional samples of populations of agents.

Scenario:

ModEco V1.20A, Crowded Model, Seed=1, Price Negotiations Off (Pn=Off)

Art01Ed file name:

Art01Ed_110515_074916.CSV

The Art01Ed file contains data from a cross-section of the complete population of agents taken once every generation, i.e. once every 800 ticks.

Time Range:

0 ticks through 1,236,000 ticks. With cross-sectional data being collected at 0 ticks, and every 800 ticks thereafter, this means that the population was sampled 1,546 times.

At [2] it was identified that data during the first 16,000 ticks did not yet represent a population of agents in a steady-state economy. In the process used below, we assume that data drawn from each cross-sectional sample is equivalent to data from any other sample. This is only true if the transitional period of time is eliminated, and all data is drawn from the steady-state economy. To underscore the point, in the very first cross-sectional sample, all Corps have the same net worth, and all Prsns have the same net worth, causing an un-natural bump in the distribution. In the next generation, the net worth is still clustered quite closely to the original common values. Once steady-state has been achieved, it is assumed that such transient forms in the overall distribution of net worth have disappeared.

So, the first 21 sets of cross-sectional data will be excluded from use. This will leave 1,525 valid cross-sectional samples.

Number of records:

In addition to the four meta-data records, and two header records (one for Corps and one for Prsns), there were 185,359 data records, representing the financial micro-data for 185,359 agents. Approximately 121 agents were represented in each cross-sectional sample.

Processing Steps:

MS Excel has the ability to handle over 1,000,000 records in a single file. The maximum size of the ModEco export files was selected to fit within the limitations of this software. The data in the Art01Ed file was processed using MS Excel 2010 as follows:

  • “Windows explorer” was used to find the .CSV file. The file name was Right-clicked and the “Open with” option selected, and the MS Excel software selected. The .CSV file opened.
  • Four new sheets were created in the spreadsheet file, and named “Corps U16000”, “Corps 16000+”, “Prsns U16000” and “Prsns 16000+”.
  • The header record in row 4 was copied to row 1 of both Corps sheets.
  • The header record in row 5 was copied to row 1 of both Prsns sheets.
  • Data in rows 6 through to the end (skipping the tombstone data at the top) was sorted by agent type (field Subtype, in column D). This separates the Corp data from the Prsn data.
  • The Corp data was cut and pasted into the new sheet named “Corps U16000”. Then all Corps records collected after tick 16000 (i.e. starting at Time = 16800) was cut and pasted into the sheet named “Corps 16000+”.
  • The Prsn data was cut and pasted into the new sheet named “Prsns U16000”. Then all Prsns records collected after tick 16000 (i.e. starting at Time = 16800) was cut and pasted into the sheet named “Prsns 16000+”.
  • The header records, rows 4 and 5, were now deleted from the original sheet, as they no longer serve a purpose there. The original sheet now only contains the tombstone meta-data in rows 0 – 3.
  • The above steps enable processing of Corp and Prsn data separately for the Time = 0-16000 range and the Time = over 16000 range; four sets of records in all.
  • There are now 36,862 Corp records, not counting the header record; 727 Corps U16000 records plus 36,135 Corps 16000+ records. For Prsns, there are now 148,491 Prsn records; not counting the header record; 2900 Prsns U16000 records plus 145,591 Prsns 16000+ records.
  • The file was then saved in native MS Excel format, having the same name but:
  • with a serial number “-01” appended; and
  • with a different extension. The “.CSV” became “.xlsx”. This enables the inclusion of graphs or pivot tables in the saved files.
  • Unnecessary fields in the Corps and Prsns sheets were pared away, and the file saved as serial number “-02”.
  • Fields retained in the Corps sheet are Time, RdgNo, C_SerNo and C_NW. Fields retained in the Prsns sheet are Time, RdgNo, C_SerNo and C_NW.
  • Next, the assumptions about transients were tested. I.e. as per [2] it was assumed that transient behaviour had died away by 16,000 ticks. Scatter graphs of net worth vs time were produced for all four sets of data.
  • Next, the data for histograms of counts of agents vs net worth were created. There are several ways to do this in MS Excel, but you do not have full control over the shape of the output. Because of the requirement that the results be comparable across several sets of data, and possibly, several scenarios, pivot tables and “Data Analysis” add-ons were avoided, and the Frequency() function was used instead.
  • First, the control table was constructed. Column references assume D is the ‘Net Worth’ column, G is the control table column, and J is the BinEnds column; if they are not, replace them with the proper columns in the following instructions.
  • MinNW: use the formula =MIN(D:D)
  • MaxNW: use the formula =MAX(D:D)
  • NoOfBins: 30 (or 100, depending on the histogram being made)
  • BinWidth: use the formula =G3/G4
  • Count: use the formula =COUNT(D:D)
  • Next, the histogram table is constructed. Bin numbers from 0 to 100 were enumerated. Note that there is one more bin number than required. The first bin end does not actually represent a bin, but you need it to make the freq() function work correctly. The fields are constructed as follows:
  • Bin # - initial value = 0; following values are + 1;
  • Bin Ends – initial value is 0; following values are + BinWidth
  • BinRanges – use the formula =CONCATENATE( TEXT(J2,"0"),"-",TEXT(J3,"0"))
  • BinLabels – use the formula =(J2+J3)/2
  • Freqs – use the formula =FREQUENCY(D:D,J2:J32) (or J2:J102 for a 100 bin histogram)
  • Probs – use the formula =M3/$G$7
  • Copy the values down for the appropriate number of bins. Copying the Frequency formula requires a little bit of arcane foolery. Enter the formula in the second row. Then select all of the rows in which it is to be copied. Then click the mouse’s cursor in the edit box at the top of the screen, to the right of the frequency formula. Then, hold down the shift and control keys, and while holding them down, press enter. The formula will suddenly have curly brackets around it. This means it is an array formula. The same formula occupies all cells of the selected range.
  • The above histogram building process was repeated for all four sheets.
  • The file was saved again as serial number “-02”.

Observations:

The first issue is a test of the assumption that all significant transient behaviour has subsided by 16,000 ticks. Four scatter graphs showing the changing distribution over time were produced, as shown below.

This first graph is a scatter graph of the net worth vs time, for the first 21 generations of Corps. Note 21 columns of dots; one per generation. We see here that a concentration of net worth at $25,000 has appeared at about 5,000 ticks, but the first really high net worth of approximately $170,000 does not appear until 12,000 ticks.

Continuing to look at the Corps data, but for subsequent generations, relatively few agents achieve a higher net worth. While some extremes are always possible, there does not appear to be a trend towards higher and higher extremes of value.

The situation is a little different with Prsns. We see the same concentration of net worth appearing between $3,500 and $4,000 appearing by 5,000 ticks, and the first extremely high net worth appearing about 14,000 ticks.

However, at $16,000, that extreme value is small compared to the eventual maximum, 100,000 ticks later, of over $30,000 net worth. The extremes show signs of getting more extreme as time passes. One must wonder why such extremes of high net worth are achieved after 100,000 ticks. Is this a one-of emergent event, or is this an indication that a very-long-timescale transient is still operating?

Hypothesis: The distribution of net worth for Prsns achieves steady-state by 16,000 ticks, as it does for Corps, but short-time-scale variations (bursts) with substantial extremes of value occur from time to time. The apparent increase of extreme values over time is contingent on the choice of seed for this run, and not on a long-time-scale transient. This hypothesis needs to be tested.

Furthermore, whereas the Corps long-term distribution appears relatively stable, with a little burstiness, the distribution of net worth for Prsns displays clear structure, with brief periods of high numbers of high net worth Prsns, broken up by brief periods of low net worth.

Hypothesis: I would speculate that the population of Prsns suffers some synchronization in the birth dates, resulting in a significant portion of the population dropping their net worth by half (due to fission) on or about the same date. This hypothesis needs to be tested.

Moving on to the ‘distribution of net worth’ issue, all Corps and all Prsns start with common values for net worth: $11,500 and $2,520 respectively. It takes several generations to smooth this spike. In the Corps U16000 and Prsns U16000 sheets we see the initial spike.

The U16000 (under 16,000 ticks) records were discarded because of the initial transient spike. The remaining data was moved into the 16000+ sheets, and the following histograms created.

Note that the curves for Corps and Prsns have a very similar shape. Dividing the count in each bin by the total number of agents of that type gives us a normalized curve of probabilities. When comparing these curves, we see how similar they actually are.

Hypothesis: Possibly these curves could be made to overlay each other more closely if one or two outlier points in the Corps curve were deleted, shifting the curve to the right slightly. This hypothesis needs to be tested.

The following search for a ‘best fit’ curve assumes that all of the agent data is independent. In essence, we are taking data from many generations of agents, and treating it as if it is from a single very large generation. Since it is possible for a poor and therefore infertile agent to live an extra 800 ticks after it reaches reproductive age, it is quite possible that many poor agents will be counted twice, once among their immature cohorts, and again among their older infertile cohorts. Dependence arises in two ways:

  • The younger agent is very similar to its birth twin as they start out with precisely the same economic assets. This could be addressed by identifying and eliminating the second twin, if under some arbitrary age. But, if we remove the record from a young aged twin, we deflate a low Net Worth bin, because all recently born agents have a net worth of only half of their parent, potentially introducing a bias.
  • The older agent is very similar to its younger self, having had the same benefit or definicit of assets at the beginning of its life. This could be addressed by removing one record for those agents represented twice. But if we remove a record for the younger self, we are removing a necessarily low net income record. And, if we remove a record for the later age, the level of dependence between younger and older version of the agent is dramatically reduced by the effects of 800 ticks’ worth of transactions, and little is gained.

Hypothesis: The decision is to leave the data as collected. The overall effect will be more statistical dependence between those agents with net worth below the median value (because net worth is divided evenly between two birth twins) and relatively less statistical dependence between those agents with net worth above the median (who climbed back up to the top).

Orrery Software1Note To File

Having constructed the histograms for the distribution of net worth for Corps and Prsns, it remains to test them against mathematical formulae such as the Maxwell, Pareto, Fisk and Burr formulae.

Note that the Maxwell formula, borrowed directly from physics, represents binary conservative transactions (collisions) between idealized atoms in a gas. Note that the tail of the ModEco data is too fat. The ‘fat tail’ problem has been widely discussed in economic literature.

The Pareto distribution has historically been used to describe the tail of wealth distributions in economics. I note that the tail portion of the ModEco data is too fat near the peak, and too skinny further out. So, while the Pareto distribution would seem to give a better fit for the tail, it is not a great fit.

The next distribution, the Fisk distribution, has a bit of a murky history. I don’t know why it’s called the Fisk distribution, but it is said to be used to describe the distribution of wealth, the distribution of income, and end-of-life statistics. The Maxwell and Pareto distributions both have two controlling parameters. The Fisk distribution has three controlling parameters. It is not unreasonable to expect that the Fisk will be able to provide a better fit, and this expectation is born out empirically. But careful examination show that the ModEco data is too fat on the slope, and too skinny for extreme net worth values.

The Burr distribution is a generalization of the Fisk distribution, having four controlling parameters. Also said to be used for wealth, income, and end-of-life statistics. Again, because it has four parameters, we would expect a better fit than given by the Fisk distribution, and this is born out empirically. Looking over the graphs visually, I can see no systematic variation of the ModEco data from the best fit Burr distribution.

The ‘best fit’ curve for each of the above graphs was found using a specially prepared Excel spreadsheet implementing a manual ‘least squares’ process.

For each of the sheets ‘Corps 16000+’ and ‘Prsns 16000+’ the BinLabel and Freqs fields of data were copied into the appropriate least squares spreadsheet; 101 data points for each field.

The construction of the ‘least squares’ spread sheets required a little trial and error to avoid division by zero, or excessively large numbers which the spreadsheet was unable to handle. The steps were:

  • Use a linear transformation on the BinLabels (the x values) to convert them to the numbers from 0 to 101;
  • Divide each frequency by the total number of agents to convert each to a probability, a number between 0 and 1; zeros were replaced with very small probabilities;
  • A set of reasonable parameters were selected arbitrarily for the estimating formula (e.g. the Maxwell, Pareto, Fisk or Burr formula).
  • Using the normed values for X and Y (BinLabels and Freqs), and using the selected parameters for the formula, a preliminary set of estimated values for Y were calculated, one for each bin.
  • This difference between the normed Y and the estimated Y (the error in Y) was calculated for each bin, and those error values squared.
  • The sum of the squared errors was computed and stored in a control table, with the parameters.
  • An initial delta value was arbitrarily selected for each parameter.
  • Composite keystroke routines (commonly called Macros) were written for each parameter, applying the delta to the parameter and checking whether the sum of squared errors of Y had been reduced.
  • By careful monitoring of the effects of a particular delta, and by adjusting the size and sign of the delta as progress was made, I gradually reduced the sum of squares of the error until an acceptable level of error had been reached.

This approach would be called a ‘numerical technique’ as opposed to an ‘analytic technique’. In an analytic technique, one creates an sigmoid formula describing the sum of the squared errors. One then takes the derivative of this sigmoid formula, sets it to zero, and solves for the values of the parameters. The formulae of interest here (Maxwell, Pareto, Fisk and Burr) as well as others that may be of interest later in this project, are complicated, and finding the analytic solution to a least squares problem might be difficult in the extreme. Therefore, to put all such calculations on an equal footing, a numerical approach using Excel spreadsheets was selected.