Part I: Introduction to Data Analysis

The true foundation of theology is to ascertain the character of God. It is by the aid of Statistics that law in the social sphere can be ascertained and codified, and certain aspects of the character of God thereby revealed. The study of statistics is thus a religious service. — Florence Nightingale (1820-1910).

Statistical Thinking is understanding variation and how to deal with it. In this course we explore methods for moving as far as possible to the right on this continuum:

Ignorance / --> / Uncertainty / --> / Risk / --> / Certainty

We will also stress tools and skills for communicating variation and risk-related ideas and conclusions to other people. In other words, we view statistics as an important part of a managerial tool kit, aimed at making good decisions on the basis of solid scientific information.

Types of Data

Categorical vs. Numerical

Discrete vs. Continuous

Nominal Data are the weakest type of measurement for statistical methods. They can be numbers, but really are just names or labels (not quantities). Same as Categorical.

Ordinal Data, by their size, rank or order observations on some basis. The intervals between these numbers, and their ratios, are meaningless.

Interval Data also rank observations according to some dimension, but the interval or distance between observations has a constant meaning. Readings on the Fahrenheit temperature scale are examples of interval data; the zero point is somewhat arbitrary, but a difference of, say, ten degrees means the same thing everywhere on the scale. We can do addition and subtraction with interval data, but not multiplication or division.

Rational Data are the most useful type for statistical analysis. Ratio data are numbers which by their size rank observations in order of importance and between which intervals as well as ratios are meaningful. All types of arithmetic operations can be performed with rational data.

Example 1

1998 New York Yankees Roster

No. / Last / First / Position / Bats / Throws / Ht. / Wt. / Born
2 / Jeter / Derek / Infield / R / R / 6-3 / 185 / 6/26/74
11 / Knoblauch / Chuck / Infield / R / R / 5-9 / 170 / 7/7/68
14 / Irabu / Hideki / Pitcher / R / R / 6-4 / 240 / 5/5/69
18 / Brosius / Scott / Infield / R / R / 6-1 / 202 / 8/15/66
19 / Sojo / Luis / Infield / R / R / 5-11 / 175 / 1/3/66
20 / Posada / Jorge / Catcher / S / R / 6-2 / 205 / 8/17/71
20 / Davis / X-Chili / Outfield / S / R / 6-3 / 220 / 1/17/60
21 / O'Neill / Paul / Outfield / L / L / 6-4 / 215 / 2/25/63
22 / Bush / Homer / Infield / R / R / 5-10 / 175 / 11/12/72
24 / Martinez / Tino / Infield / L / R / 6-2 / 210 / 12/7/67
25 / Girardi / Joe / Catcher / R / R / 5-11 / 195 / 10/14/64
26 / Hernandez / Orlando / Pitcher / R / R / 6-2 / 190 / 10/11/69
26 / Spencer / Shane / Outfield / R / R / 5-11 / 210 / 2/20/72
27 / Lloyd / Graeme / Pitcher / L / L / 6-7 / 234 / 4/9/67
28 / Curtis / Chad / Outfield / R / R / 5-10 / 185 / 11/6/68
29 / Stanton / Mike / Pitcher / L / L / 6-1 / 215 / 6/2/67
31 / Raines / Tim / Outfield / S / R / 5-8 / 186 / 9/16/59
33 / Wells / David / Pitcher / L / L / 6-4 / 225 / 5/20/63
36 / Cone / David / Pitcher / L / R / 6-1 / 190 / 1/2/63
39 / Strawberry / Darryl / Outfield / L / L / 6-6 / 215 / 3/12/62
40 / Holmes / X-Darren / Pitcher / R / R / 6-0 / 202 / 4/25/66
42 / Rivera / Mariano / Pitcher / R / R / 6-2 / 168 / 11/29/69
43 / Nelson / X-Jeff / Pitcher / R / R / 6-8 / 235 / 11/17/66
46 / Pettitte / Andy / Pitcher / L / L / 6-5 / 235 / 6/15/72
51 / Williams / Bernie / Outfield / S / R / 6-2 / 205 / 9/13/68
54 / Borowski / Joe / Pitcher / R / R / 6-2 / 225 / 5/4/71
55 / Mendoza / Ramiro / Pitcher / R / R / 6-2 / 154 / 6/15/72
58 / Jerzembeck / Mike / Pitcher / R / R / 6-1 / 185 / 5/18/72

Example 2

  1. Ballard Power Systems, Inc. stock has risen in price by $107 per share in five years.
  2. Ballard Power Systems, Inc. stock has risen in price from $8 to $115 per share in five years.

Operational Definitions

An important concept, perhaps difficult to measure (e.g. the overall health of the U.S. equity market), is often operationalized with an easy-to-measure proxy (e.g. the Dow Jones Industrial Average).

Sampling

One of the fundamental principles of statistics is that we can learn a great deal about a complete population of data by looking at a smaller subset, or sample, from the population.

Statistics

We will learn to represent an entire population (or sample) of numbers with one or more statistics, which attempt to summarize many numbers with one number. The most important types of these summary measures are:

  • measures of central tendency (i.e. averages),
  • measures of dispersion (i.e. ranges, quartiles, standard deviations),
  • measures of association (i.e. coefficients of correlation and covariance),
  • measures of relative distance (i.e. z-stats and t-stats), and
  • measures of probability or risk (i.e. p-values).

Getting Started in Microsoft Excel

Frequency Distribution

Focus / Count
Consulting / 276
Manufacturing / 98
Financial / 69
Other / 57

Percentage Distribution

Focus / Count / Percent
Consulting / 276 / 55.20%
Manufacturing / 98 / 19.60%
Financial / 69 / 13.80%
Other / 57 / 11.40%
500 / 100.00%

Graphs and Charts

History

Johann Heinrich Lambert (1728-1777) was a Swiss-German scientist and mathematician. He is generally recognized as the inventor of the time series graph, in which the values of some variable of interest are plotted against the vertical axis and time is plotted on the horizontal axis.

William Playfair (1759-1823) was a Scottish political economist. He advocated the use of charts instead of tables of data, because "a man who has carefully investigated a printed table, finds, when done, that he has only a very faint and partial idea of what he has read". Playfair also invented the bar graph.

Florence Nightingale (1820-1910) was a British Army nurse in the Crimean War (1854). She used graphical tools to convince army officers to improve conditions in military hospitals. In 1860 she offered to fund a chair in applied statistics at Oxford, and was turned down.

Edward Tufte (1946- ) is a professor of political science, statistics, and computer science at Yale. He has written several excellent books about statistics and graphic design.

Personal Computers and Integrated Software such as the Microsoft Excel, PowerPoint, and Word programs used by most students in this class, have greatly simplified the creation of graphs and their use in documents and multimedia presentations. An unfortunate side effect has been to limit people's creativity in creating graphs.

Types of Charts

Pareto Diagram

Cost / Cumulative Cost / Cumulative %
DeBurr / $ 8,181.25 / $ 8,181.25 / 52.5%
Cut / $ 5,950.00 / $ 14,131.25 / 90.7%
Engrave / $ 848.75 / $ 14,980.00 / 96.2%
Grind / $ 446.25 / $ 15,426.25 / 99.0%
Weld / $ 148.75 / $ 15,575.00 / 100.0%

Histogram

Scatter Plot

Here is a time-series graph from March, 1999, showing the growth of the Dow Jones Industrial Average during the 1990s. Note how the minimum value on the vertical axis has been set to accentuate the Dow's growth — a mild example of lying with charts.

Two graphs that appeared in the H.J. Heinz Company’s 2003 proxy statement:

Here is another example of lying with charts. The proportion of the number of titles in the Barnes and Noble database to the number in Amazon's is evidently 8,000,000 to 4,700,000, or about 170%. But this one-dimensional relationship is distorted in the two-dimensional graph.

The area of Barnes and Noble's black bar is 2700 square centimeters, while the area of Amazon's gray bar is 800 square centimeters. This gives the visual impression that the proportion of titles is more like 340%. The distortion is augmented by the choice of color: Barnes and Noble looks bold, clear and strong, while Amazon looks washed-out, pale, and weak.

Here's an example of a graphical technique that you can't do with Excel. In this New York Times map of Kosovo, colors and shapes are used creatively to communicate complicated quantitative information simply and clearly (e.g. the volume and direction of refugee movements over time).

Two interactive web graphs, from MSNBC.com and from Yahoo Finance:

Juran’s version of the same Dow Jones Industrial Average data:

Juran's Suggestions for Good Charts

General

Label all axes with the variable name and units.

Don't use a legend for univariate charts (charts with only one variable).

Put the dependent variable on the vertical (Y) axis and the independent variable on the horizontal (X) axis. (We will discuss dependent and independent variables in greater detail later in the course.)

Let horizontal and vertical axes start at zero unless you have a good reason not to.

Keep your scales, colors, patterns, and symbols consistent.

Eschew fancy effects that do not contribute to the reader's understanding (e. g. 3-D effects, distracting colors or patterns, etc.).

Watch your ink-to-information ratio (see Tufte).

Keep it simple. Don't present data that aren't central to the point you are making.

Don't rely on the reader to infer the point of your chart; state your point explicitly in the text.

Pareto Charts

Let the left vertical axis show the values for the various categories, and be scaled so the maximum value corresponds to the total of all categories. Let the right vertical axis show the cumulative percent, and be scaled so that the maximum value is 100%.

Histograms

Don't let Excel decide what values to use for the class boundaries (a.k.a. bin or bucket boundaries). Specify them yourself.

The proper number of classes is subjective; try to use between six and ten.

Don't use the upper class boundary as the category label on the X-axis. Use the class midpoint to avoid confusion.

The default Excel column chart has gaps between the columns; these make a histogram harder to read. Double-click on one of the columns, select "Options", and reduce the gap width to zero.

Descriptive Statistics

Measures of Central Tendency

1) Average or ArithmeticMean.

Example: The annual salaries (in $1000s) of the seven employees of a small government department are as follows:

48, 90, 46, 42, 40, 46, 49.

The mean is:

 / = (48 + 90 + 46 + 42 + 40 + 46 + 49)/7
= (361/7)
= 51.571

The mean salary is therefore $51,571. We use the Greek letter mu () to symbolize the mean.

Notation: We will sometimes use a mathematical shorthand notation called Summation Notation. It is easy to use and should not scare anyone; ask for help if you need it.

If we have 7 data points, we can abstractly write these numbers X1, X2,..., X7 (where X1 = 48, X2 = 90, ... X7 = 49). Then we write the average of N = 7 numbers as:

Average of (X1, X2, ..., XN) =  =

We can also write the average:

Where

48 + 90 + ... + 49 = 361, so the average or mean is 361/7 = 51.571 or $51,571.

2) Median

The median of a data set is the “middle” value; the value such that 50% of the population lies above and below it.

To find the median salary, first arrange the salaries in ascending order:

40, 42, 46, 46, 48, 49, 90.

The median salary is the middle value. In this case, it is $46,000, which (at least here) seems more representative of a typical salary than the mean value ($51,571).

This worked nicely because we had an odd number of observations. Suppose we want to find the median of the following:

48, 90, 46, 42, 40, 46, 49, 51.

For an even number of observations, the median is the average of the two middle values. In this case, the average of 46 and 48, that is $47,000.

3) Mode

The mode of a data set is the “most popular” value or the value with highest frequency.

Example: The manager of a men's store observes that the 10 pairs of trousers sold yesterday have the following waist sizes (in inches): 31, 34, 36, 33, 28, 34, 30, 34, 32, 40. The mode of these waist sizes is 34 inches, and this fact is undoubtedly of more interest to the manager than are the facts that the mean waist size is 33.2 inches and the median is 33.5 inches.

Measures of Dispersion

1) Range = maximum value - minimum value.

In the above example, the range is 90 - 40 = 50.

2) Quartiles, InterquartileRange

Top 20 Domestic Film Grosses

($ not adjusted)

Movie / Year / ($Millions)
1 / Titanic / 1997 / 600.8
2 / Star Wars / 1977 / 460.9
3 / E.T. / 1982 / 434.9
4 / Star Wars: Episode I / 1999 / 431.1
5 / Spider Man / 2002 / 403.7
6 / JurassicPark / 1993 / 357.1
7 / Lord of the Rings: TwoTowers / 2002 / 337.9
8 / Forrest Gump / 1994 / 329.7
9 / Harry Potter: Sorcerer’s Stone / 2001 / 317.6
10 / Lord of the Rings: Fellowship / 2001 / 313.4
11 / Lion King / 1994 / 312.9
12 / Star Wars: Episode II / 2002 / 310.7
13 / Return of the Jedi / 1983 / 309.1
14 / Independence Day / 1996 / 306.1
15 / Sixth Sense / 1999 / 293.5
16 / Empire Strikes Back / 1980 / 290.2
17 / Home Alone / 1990 / 285.8
18 / Shrek / 2001 / 267.7
19 / Harry Potter: Chamber Secrets / 2002 / 262.0
20 / Jaws / 1975 / 260.0

As Of 2003; source:

Note: If figures were corrected for inflation the picture would be very different. For example, Star Wars (1977) would have made $1.027 Billion in today’s dollars.

Quartiles are used to divide a data set into four pieces; they can be thought of as statistical dividing lines between these pieces. You will discover that there are differences in the way statisticians calculate these dividing lines; here we will illustrate the method used in the Excel QUARTILE function.

For a list of n numbers, first sort the numbers in increasing order and figure out how many data there are. In this case, n = 20. In the Excel method, the first quartile is the number that is three quarters of the way from the fifth observation (from the bottom) to the sixth. The fifth is 290.2 (Empire Strikes Back), the sixth is 293.5 (Sixth Sense), and the first quartile is:

billion

The second quartile is the number that is half way from the tenth observation (from the bottom) to the eleventh. The tenth is 313.4 (Lord of the Rings: Fellowship), the eleventh is 312.9 (Lion King), so the second quartile is:

billion

The third quartile is the number that is one quarter of the way from the fifteenth observation (from the bottom) to the sixteenth. The fifteenth is 357.1 (JurassicPark), the sixteenth is 403.7 (Spider Man), so the third quartile is:

billion

The interquartile range is the difference between the third and first quartile:

368.75 – 292.675 = $76.075 million.

Percentiles are like quartiles, except they are dividing lines between hundredths of the data instead of fourths. The 25th percentile is the same as the 1st quartile, the 50th percentile is the same as the 2nd quartile, and the 75th percentile is the same as the 3rd quartile.

Quartiles can be used to create a type of chart called a Box Plot, or Box and Whisker Plot, as in this example from the MBA graduate data:

Notice that the box plot allows us to compare central tendency and dispersion across several variables in one chart. Here we can see how starting salaries vary across the different groups of students. Unfortunately, Excel can't help you with box plots very well (these were created in Minitab, a popular statistics software package).

3) Variance: The average of the squared deviations of values from the arithmetic mean.

Example: To calculate the variance of the above 7 governmental salaries, first calculate the mean; it is 51.571. Then for each number, calculate its deviation from the mean, so we get

48 - 51.571 = -3.57,

90 - 51.571 = 38.43, and so forth...,

49 - 51.571 = -2.57.

Add the squares of these together, and we get (-3.57)2 + (38.43)2 + ... + (-2.57)2 = 1,783. Then dividing by 7 we get 254.82. The variance of the above salaries is 254.82($2). Using summation notation this is:

(Beware of the units of the variance, it is in the original units squared.)

4) Standard deviation = =. (We use the Greek letter sigma, , to represent the standard deviation.) This can be thought of as the “average” deviation from the mean. It is simply the square root of the variance:

= $15.96

Example: A school system employs teachers at salaries between $28,000 and $50,000. The teachers' union and the school board are negotiating the form of next year's salary increases.

1.If every teacher is given a flat $1000 raise, what will this do to the mean salary?

2.To the median salary?

3.To the range?

4.To the quartiles of the salary distribution?

5.What would a flat $1000 raise do to the standard deviation of teachers' salaries?

6.If, instead, each teacher receives a 5% raise, what will this do to the mean salary?

7.To the median salary?

8.Will the 5% raise increase the standard deviation of the salaries?
Population versus Sample

  • A population is usually a group we want to know something about: e.g., all potential customers, all eligible voters, all the products coming off an assembly line, all items in inventory, etc....
  • A population parameter is a number relevant to the population that is of interest to us: e.g., the proportion (in the population) that would buy a product, the proportion of eligible voters who will vote for a candidate, the average number of M&M's in a pack....
  • A sample is a subset of the population that we actually do know about (by taking measurements of some kind): e.g., a group who fill out a survey, a group of voters that are polled, a number of randomly chosen items off the line....
  • A sample statistic is often the only practical estimate of a population parameter. In practice we will use sample statistics as proxies for population parameters, but it is important to remember the difference.

Sample Mean and Variance: To determine the average amount of money spent in the Central Mall, a Central City official randomly samples 12 people as they exit the mall. He asks them the amount of money spent and records the data. The official is trying to estimate mean and variance of the population from a sample of 12 data points. Here are the data for the 12 people:

Person / $ spent / Person / $ spent / Person / $ spent
1 / $132 / 5 / $123 / 9 / $449
2 / $334 / 6 / $5 / 10 / $133
3 / $33 / 7 / $6 / 11 / $44
4 / $10 / 8 / $14 / 12 / $1

Sample Means, Variances and Standard Deviations: A sample (x1, x2, ... , xn) has sample mean, sample variance, and sample standard deviation as follows:

Sample Mean /
Sample Variance /

Note: The denominator of the sample variance formula is n - 1, not n. This is because of the aforementioned distinction between population parameters and sample statistics. The n - 1 formula for s2 tends to give a better estimate of 2, especially for small sample sizes.

Sample Standard Deviation /

Example:

The sample mean is

The sample variance is

The sample standard deviation is

Without checking every shopper that has ever bought anything at the mall, we estimate that people’s spending at the mall has a mean of $107 and a standard deviation of $144.40. These are just estimates of the population parameters, based on sample statistics.

In the later parts of this course we will be working almost exclusively with sample data, because in real life population data frequently are difficult, expensive, time-consuming, or impossible to obtain.

Other Measures of Dispersion

Example:

Consider these summary statistics from month-end stock prices for 50 months (from November 1999 to December 2003):

Mean / StDev
Minnesota Mining & Manufacturing (3M) / MMM / $105.41 / $ 18.57
International Business Machines / IBM / $ 96.15 / $ 15.96
Procter & Gamble / PG / $ 80.04 / $ 13.48
Disney / DIS / $ 25.82 / $ 8.05
Microsoft / MSFT / $ 30.97 / $ 7.85
McDonald's / MCD / $ 27.15 / $ 6.92

One important consideration in investing is volatility, because high volatility (which is really the same thing we have been calling dispersion) implies high risk. We might infer from the standard deviations here that MMM, IBM, and PG stock prices are quite volatile compared with DIS, MSFT, and MCD.

However, we also note that the average prices for those three stocks are somewhat higher, and an investor might be more interested in volatility relative to the mean than in absolute terms. The coefficient of variation might be useful here:

Using the CV, we would infer that Disney was in fact the most volatile stock over these 50 months:

Mean / StDev / CV
Minnesota Mining & Manufacturing (3M) / MMM / $105.41 / $ 18.57 / 0.176
International Business Machines / IBM / $ 96.15 / $ 15.96 / 0.166
Procter & Gamble / PG / $ 80.04 / $ 13.48 / 0.168
Disney / DIS / $ 25.82 / $ 8.05 / 0.312
Microsoft / MSFT / $ 30.97 / $ 7.85 / 0.253
McDonald's / MCD / $ 27.15 / $ 6.92 / 0.255

In finance, it is conventional to avoid these scaling issues by using the concept of return on investment, which is calculated as shown here: