THE CIA FACTBOOK

Processed by HCE and GeoVista

Ray Chen and Nico Zazworka

{rchen,nico}@cs.umd.edu

Introduction

We parsed and analyzed the CIA World Factbook 2007 with two well known tools from class: the Hierarchical Clustering Explorer and GeoVista. Our goal was to provide different points of view: the formerfor finding relationships between attributes of countries, and the latter for visualizingattributes in a map to enable users to partition the world into meaningful regions.

Data Description and Difficulties

The CIA factbook provides webpages with up to 189 attributes for every countryin the world. Unfortunately, the data is not provided in any raw form. It’s published on the web via HTML pages which are tailored for the human reader viewing through a web browser. The attributes were not easy to extract from the 246 individual country files and, once extracted, they were not easy to interpret either. Most of the attributes were unusable text fields, such as background descriptions, or current environmental issues. Even after separating the unusable text fields from the salvageable numeric fields, our job wasn’t done. Numbers were stored with comma place-holders and textual modifiers, such as “2,433,244” or “$1.532 billion.” The time to generate a PERL script which properly extracted the information we wanted took far longer than expected. But, in the end, we were finally able to produce a comma separated table of 64 attributes for 246 countries.

Tool Improvements

There are several small improvements that would make the handling of more productive.

  • HCE - In some dialog boxes the tool does not save the last made choice – e.g. the hierarchical clustering dialog. This is sometimes disturbing when trying out several possibilities in short time.
  • HCE - The standard interpretation of all data is that attributes are mapped to rows and entities to columns – e.g. open file dialog, clustering dialog. This seems contrary to many database tools and even the first chapter of the tool documentation shows it the other way around.
  • GeoVista - There was little to no documentation for the file formats GeoVista expected. We had to reverse engineer 5 data files from the tutorial example before we could start using the tool for data analysis.
  • GeoVista - The self organizing map coloring scheme was more hurtful than helpful. More often than not, there was no clear indicator of what the color actually meant for the data.

Findings

States with a large industrial sector spend more money on military

At a first glance, the scatter plot ordering scoreboard(presenting correlation coefficientand uniformity) showed expected relationships between several attributes in economical data. However, upon closer inspection, more unrelated attributes aroused our interest. The expenditures in military as percentage of the GDP was unrelated (-0.01 < correlation coefficient < 0.2) toall other attributes except one single one: the percentage of the GDP produced by the industrial sector (secondary sector of industry) with a coefficient of 0.411. For the other two sectors, agricultural and service providingthe coefficient was low: 0.102 and 0.104.

Figure 1: HCE's Scatter Plot Ordering Scoreboard

One possible interpretation is countries that change from an agricultural to an industrial society are increasing their military forces to defend or expand themselves. Later after that process when changing to a society with a greater service sector this development stops or decreases.

For further investigations it would be useful to have time series for countries over time spans long enough to capture the change of industry structure.

Responsible Countries Have Less Need for Internet

One unexpected finding involves the relation between a country’s current account balance and its number of Internet hosts. According to the CIA’s website, a country's current account balance is its net trade in goods and services, plus net earnings from rents, interest, profits, and dividends, and net transfer payments (such as pension funds and worker remittances) to and from the rest of the world. Or, in other words, how fiscally responsible a country is.

Figure 2: GeoVista's Parallel Component Plot

However, there is a strong inverse correlation between these two attributes. Normally it would seem like the countries that make more money than they spend are the ones whose population can afford luxuries like Internet hosts. One possible explanation is that, on a macro economic scale, accruing a national debt is healthy for the country’s economy, and is reflected by the flourishing of such luxuries.

The CIA Can’t Be Trusted

Another interesting way to use these tools is to verify what others claim about the data. The CIA World Factbook’s description of its sex ratio attribute explains that, “Sex ratio at birth has recently emerged as an indicator of certain kinds of sex discrimination in some countries. This will affect future marriage patterns and fertility patterns.” If this is the case, there should be some correlation between the sex ratio and the total fertility rate of a country. We put this to the test.

Figure 3: GeoVista's Parallel Component Plot

As one can see by Figure 3 above, there is little correlation between these two attributes. When jumping from one axis to another,we can see just as many lines have a negative slope as have positive slopes. This is verified two ways by GeoVista. The conditional entropy between these two variables is a high 0.7923, and the correlation between the two is 0.5381; A medium correlation at best.

Men Detrimental to Men’s Health

After the previous verification of the sex ratio variable, we became curious as to what actually was correlated with it. Using GeoVista’s Feature Selection tool, it didn’t take long for us to find the combination of sex ratio and the life expectancy at birth for males.

Figure 4: GeoVista's Parallel Component Plot

What this seems to imply is that, the higher the ratio of men to women in a country, the lower the life expectancy is for men. And these variables are correlated at an incredibly high -0.8412. The larger implications of such a finding shall be left as an exercise for the reader.