Descriptive Statistics (ASW Chapter 3)

MGMT 201: Statistics

Descriptive Statistics (ASW Chapter 3)

¨ Numerical Methods: We are interested in describing data using simple, easy to understand numbers.

¨ Measures of Location

· By “location”, we mean “where are the numbers, generally”. For example, suppose most of our observations are “around” 100 in one group and around “150” in another. We want some way to quickly and easily communicate this to others.

· Example 1: Consider the following sample data collected from a population of 500 items:

128 / 129 / 142 / 130 / 128
128 / 128 / 128 / 130 / 125
120 / 126 / 133 / 143 / 129
128 / 129 / 125 / 126 / 139
127 / 125 / 128 / 130 / 130
126 / 138 / 131 / 137 / 130
123 / 128 / 132 / 130 / 118
129 / 124 / 134 / 125 / 135
131 / 137 / 129 / 128 / 127
120 / 127 / 135 / 129 / 138
131 / 129 / 128 / 128 / 136
127 / 130 / 137 / 128 / 129
128 / 129 / 134 / 128 / 127
132 / 124 / 131 / 135 / 127
142 / 124 / 145 / 140 / 132
129 / 129 / 128 / 130

§ A histogram of the data can be found at the end of the notes. How might we describe the distribution depicted here?

§ The histogram shows that the distribution is not symmetric and is skewed toward the right. The distribution peaks at 128, which occurs 15 times in the data.
The range of the data is 118 to 145. These numbers are descriptive, but what else can we say?

§ Mean

· Sample Mean: . Here, n is the number of items in the sample and the summation covers all items in the sample.

§ In our example,

· Population Mean: . Here, N is the number of items in the population and the summation covers all items in the population.

§ In our example, we do not know what the population mean is. We do know that N = 500, but we do not know what the sum of the items is.

§ Note: The sample mean does not have to be equal to the population mean. In fact, they are the same only by coincidence.

§ Median º the middle value in the distribution. To find the median, we simply order the numbers and choose the one in the middle.

· In our example, we have 79 items, so the middle item is the 40th item. Counting from the left, we see that there are 35 items at 128 or less and 11 at 129. The median is thus 129.

· If a sample has an even number of items, we average the two middle values.

§ Example: Suppose our data set is {5,7,8,10}. The two middle values are 7 and 8, so the median is 7.5.

· When the median is above the sample mean, the data is likely to be skewed to the left. If the median is below the sample mean, the data is likely to be skewed to the right. (Think about why this is true).

· Medians are more descriptive than means when there are extreme “outliers” that might dramatically impact the mean. For example, suppose that the first item in our data set is 10,000 instead of 128. This changes the sample mean to 254.99. Citing this might mislead people about the nature of our data. Note: we will consider such outliers later.

§ Mode º the value that occurs most often in the sample.

· In our example, the number 128 occurs 15 times. This is the mode of the distribution.

§ Percentiles: The pth percentile is the number such that at least p percent of the values take that number or less and at least (100-p) percent take that value or more.

· In our example, what is the 10th percentile?

§ Consider the following rank ordering of the data:

1 / 118 / 17 / 127 / 33 / 128 / 49 / 130 / 65 / 135
2 / 120 / 18 / 127 / 34 / 128 / 50 / 130 / 66 / 135
3 / 120 / 19 / 127 / 35 / 128 / 51 / 130 / 67 / 135
4 / 123 / 20 / 127 / 36 / 129 / 52 / 130 / 68 / 136
5 / 124 / 21 / 128 / 37 / 129 / 53 / 130 / 69 / 137
6 / 124 / 22 / 128 / 38 / 129 / 54 / 130 / 70 / 137
7 / 124 / 23 / 128 / 39 / 129 / 55 / 131 / 71 / 137
8 / 125 / 24 / 128 / 40 / 129 / 56 / 131 / 72 / 138
9 / 125 / 25 / 128 / 41 / 129 / 57 / 131 / 73 / 138
10 / 125 / 26 / 128 / 42 / 129 / 58 / 131 / 74 / 139
11 / 125 / 27 / 128 / 43 / 129 / 59 / 132 / 75 / 140
12 / 126 / 28 / 128 / 44 / 129 / 60 / 132 / 76 / 142
13 / 126 / 29 / 128 / 45 / 129 / 61 / 132 / 77 / 142
14 / 126 / 30 / 128 / 46 / 129 / 62 / 133 / 78 / 143
15 / 127 / 31 / 128 / 47 / 130 / 63 / 134 / 79 / 145
16 / 127 / 32 / 128 / 48 / 130 / 64 / 134

§ We have 79 items in the sample, so we want to look at the 7.9th item (79´ 10%/100%). Since this is not an integer, we round up to the 8th item and use that as the 10th percentile. In this case, the 10th percentile is 125.

· Here, we created in index .

· If the index i is an integer, we average the values in positions i and i+1. Suppose, for example, we had item 80 = 150. What is the 25th percentile?

§ I = (25/100) ´ 80 = 20. We then average the items in positions 20 and 21 to get (127+128)/2 = 127.5. This is the 25th percentile.

§ n-tiles º specific percentiles. For example, quartiles are divided by the 25th, 50th, and 75th percentiles. Quintiles are divided by the 20th, 40th, 60th, and 80th percentiles.

¨ Measures of Variability

· Range º the difference between the largest and smallest values in the sample = largest – smallest

§ To calculate the range in Excel, we can use the MIN and MAX functions, which tell us the smallest and largest numbers respectively.

§ This is the simplest measure, but it also is problematic. Consider the following data:

Date / Sales ($000,000) / Advertising Expense ($000,000)
January / $15.2 / $0.71
February / $16.4 / $0.65
March / $16.2 / $0.70
April / $17.0 / $0.74
May / $18.8 / $0.80
June / $17.6 / $0.75
July / $17.2 / $0.75
August / $18.1 / $0.80
September / $18.6 / $0.85
October / $16.1 / $0.81
November / $19.2 / $0.90
December / $31.2 / $1.34

§ The range of monthly sales is $31.2 - $15.2 = $16.0 million. This is misleading and might lead people to believe that monthly sales are highly volatile. Similarly, monthly advertising expenses have a range of $1.34 - $0.65 = $0.69 million, which is a misleading measures of variability.

§ Þ The range is misleading when there are extreme values in the data.

· Interquartile Range º the difference between the first quartile (Q1) and the third quartile (Q3)
= Q3 – Q1.

§ Consider the following rank orderings of our data (notice that the dates do not matter here):

Sales / Advertising Expense
15.2 / 0.65
16.1 / 0.70
16.2 / 0.71
16.4 / 0.74
17.0 / 0.75
17.2 / 0.75
17.6 / 0.80
18.1 / 0.80
18.6 / 0.81
18.8 / 0.85
19.2 / 0.90
31.2 / 1.34

§ For sales, Q1 = $16.3 and Q3 = $18.7, so the interquartile range is $18.7 - $16.3 = $2.4 million. This is much more descriptive of the data than the range and suggests that our monthly sales figures have been fairly consistent.

§ For advertisting expense, Q1 = $0.725 and Q3 = $0.83, so the interquartile range is $0.83 - $0.725 = $0.105 million. This is much more descriptive of the data than the range and suggests that our monthly advertising expenses have been fairly consistent.

· Five-Number Summary º {smallest value, Q1, Q2, Q3, largest value}

§ We often provide a very informative description of the data by citing the five-number summary.

§ In our example, the five-number summary for sales is {$15.2,$16.3,$17.4,$18.7,$31.2}.

§ Notice that we can quickly infer the presence of an observation well above most of the others.

· Variance

§ population variance º

· The Excel function for the population variance is VARP.

§ sample variance º

· The Excel function for the sample variance is VAR.

· In our example…

§ =$18.467 million for sales and =$0.8167 million for advertising expenses.

§ s2=17.56 for sales and s2=0.0318 for advertising expenses

§ population standard deviation º s

· The Excel function for the population standard deviation is STDEVP.

§ sample standard deviation º s

· The Excel function for the sample standard deviation is STDEV.

· In our example, s=$4.19 million for sales and s=$0.178 million for advertising expenses.

· Note: We typically prefer the standard deviation because it has the same units as the items in the sample. In our example, the variance would be in units of dollars2. The standard deviation would be in units of dollars. This makes interpretation much easier.

· The numbers suggest that sales are far more volatile than advertising expense. What is wrong with this conclusion?

§ coefficient of variation º = the size of the standard deviation relative to the mean.

· The coefficient of variation allows us to compare standard deviations on a logical basis. It essentially adjusts for the size of the numbers in each sample.

· In our example…

§ coefficient of variation for sales = 22.7%. We say that the standard deviation of sales is 22.7% of the mean monthly sales.

§ coefficient of variation for advertising expense = 21.8%. We say that the standard deviation of advertising expenses is 21.8% of the mean monthly advertising expense.

§ We find that sales are indeed more volatile than advertising expense, but not by much.

¨ Measures of Relative Location

· We are interested in logically examining extreme values in data. Furthermore, we are interested in developing a measure of any item’s location relative to the mean.

· Z-score º the number of standard deviations a particular item is away from the mean.

§ Z-scores are quite useful because they allow us to generally state where a particular item falls in a distribution. In fact, we can draw specific inferences about the nature of the rest of the sample simply by looking at the z-score.

§ In our example, consider the March sales of $16.2 million. The z-score for this item is
z = ($16.2 - $18.467)/$4.19 = -0.541. This implies that the observation is 0.541 standard deviations below the mean.

§ Consider the December sales of $31.2 million. z = ($31.2 – $18.467)/$4.19 = 3.04.

· Chebyshev’s Theorem: For any z>1, at least (1-1/z2) of the items in any data set must be within z standard deviations of the mean.

§ In our example, at least 1-1/32 = 0.889 = 88.9% of the items must fall within 3 standard deviations of the mean.

§ Similarly, at least 1-1/1.32 = 0.408 = 40.8% of the items must fall within 1.3 standard deviations of the mean.

§ Chebyshev’s Theorem is nice because it is true for any distribution. We can have extremely bizarre distributions and the theorem will still be true.

§ For reasonable distributions, we can say even more.

· Consider the bell curve (aka, the normal probability density function):

§ Approximately 68% of the items will fall within one standard deviation of the mean.

§ Approximately 95% of the items will fall within two standard deviations of the mean.

§ Approximately 99.7% of the items will fall within three standard deviations of the mean.

· Detecting Outliers

§ What is an outlier? Technically, an outlier is simply an item that comes from a different distribution than the other items in the sample. Practically speaking, our task is to infer whether a given item is likely to be an outlier.

§ We often use the z-score to detect outliers. In our example, z=3.04 for December sales. Given the small sample size, it is highly unlikely that December sales come from the same underlying distribution as the sales from other month. We might consequently decide to remove the December observation from our sample.

· This, of course, requires us to then recalculate the means, standard deviations, etc. based on the new sample.

· Note that the high December sales is likely the result of the holiday season, so we would be perfectly justified in removing the observation from our sample. Alternatively, we might use econometric methods to control for the fact that December sales are unique.