Summary Descriptive Measures

Summary Descriptive Measures

Location is an indicator of where the data is located.

Scale is a measure of how “spread out” data is.

Criteria for Measures of Location and Scale

Must be well defined for: Raw Data

Grouped Data

Theoretical Curves

For Business Purposes: Must be arithmetic

Measures of Location

Mode

Simply the most frequent value in a data set.

Problems:

Raw Data: Many data sets have no repeat values, therefore mode does not exist.

Grouped Data: Mode is taken as midpoint of the bin with the greatest frequency.

But consider the data discussed in the last lecture.

Theoretical Data: Mode may not exist; consider the theoretical distribution of random numbers which should look like:

Measures of Location

Median

The median is that data value which has approximately the same percentage of observations below it as above it (for large data sets this proportion will approach 50%).

The word “median” comes from the Latin word “medius”, meaning “middle”.

Raw Data:

Finding the median from raw data is a two step process. First you must put the data in order, then you need to find the middle value.

Example: Data = 3, -1, 6, 10, 11

Ordered Data = -1, 3, 6, 10, 11

Median = 6

If sample size is odd then median will be the value occupying position (n+1)/2 in the ordered data.

Example: Data = 3, -1, 6, 10, 11, 7

Ordered Data= -1, 3, 6, 7, 10, 11

Median = any value between 6 and 7. Usually average two points to get 6.5 .

If sample size is even then median is the arithmetic average of the values occupying positions (n/2) and (n/2) +1 in the ordered data.

Notice: Median is not computed, it is found. For example replace the value of 11 in the above example by 12,000. The median remains 6.5

Cannot be manipulated algebraically.

Finding the Median of Raw Data Using EXCEL

Open the file “thickdat.xls” in the MBA Mod 1 folder.

Find an empty cell and type in =median(

Then highlight the range of the data. You should see something that looks like the following:

Finally, type in the right parenthesis.

The result is 355 which is the average of the 30th and 31st values, both of which happen to be 355.

Finding the Median from Grouped Data

Suppose you did not have the raw data for steel thickness, but only had the data grouped as shown below:

m(i) / f(i)
Interval / Midpoint / Freq / F
341.5 / 344.5 / 343 / 1 / 1
344.5 / 347.5 / 346 / 3 / 4
347.5 / 350.5 / 349 / 8 / 12
350.5 / 353.5 / 352 / 8 / 20
353.5 / 356.5 / 355 / 20 / 40
356.5 / 359.5 / 358 / 13 / 53
359.5 / 362.5 / 361 / 5 / 58
362.5 / 365.5 / 364 / 2 / 60

Using the column labeled “F”, it is clear that the 30th and 31st observations lie in the interval [353.5 to 356.5].

Altogether there are 20 observations in the interval [353.5 to 356.5].

Since there are 20 observations below 353.5, we need 10 more to get to the 30th value.

ASSUMPTION: The data points in the interval are equi-spaced throughout the interval

To get the 30th value, we need to go 10/20ths (or .5) into the interval. Since the bin is 3 units wide, we need to go a distance of (10/20)*3 = 1.5 into the interval. Therefore we estimate the 30th value as 353.5 + 1.5 = 355

To get the 31st value, we need to go 11/20ths (or .55) into the interval. Since the bin is 3 units wide, we need to go a distance of (11/20)*3 = 1.65 into the interval. Therefore we estimate the 31st value as 353.5 + 1.65 = 355.15.

The median is estimated as median = (355 + 355.15)/2 = 355.075.

Finding the Median From Theoretical Probability Distributions

If f(x) is the probability density function of x, the median is that value mmed satisfying the integral equation:

Problems with the Median

Suppose you had two groups of people. In Group 1 you had 50 people with a median hourly wage of $15.00 per hour. In Group 2 you had 100 people with a median hourly wage of $17.00 per hour. Given this information can you determine the median hourly wage of all 150 people?

Consider the following data:

Time 1 / Time 2 / change
5 / 4 / -1
10 / 12 / 2
15 / 18 / 3
20 / 19 / -1
25 / 23 / -2
median / 15 / 18 / -1
Change in median is 18 - 15 =3
Median Change is -1

Measures of Location

Averages

Averages can be tricky.

Consider:

Rate of Return
Year 1 / Year 2 / Year 3 / Year 4 / Year 5
0.07 / 0.1 / 0.12 / 0.3 / 0.15

What is the average rate of return over the five year period?

Arithmetic average = .148

Correct average = .145321

(The correct average is that value which when compounded for 5 years gives the same result as the observed compounding rates, in other words the solution to the equation: )

Consider:

Dallas and Fort Worth are approximately 30 miles apart. On a round trip from Dallas to Fort Worth and back, you average 30 mph on the first leg from Dallas to Fort Worth. How fast to you have to travel on the return leg from Fort Worth to Dallas so that you average 60 mph for the round trip?

Usual answer: 90 mph

Correct answer: it is impossible

Both of the above are common errors.

Measures of Location

The Arithmetic Average

The arithmetic average of a set of values is the sum of the values divided by the number of values.

If x1, x2, . . . . xn represent the n numerical values from a random sample, then the formula for the sample mean is:

To find the average( when I use this term subsequently, I will mean the arithmetic average), using EXCEL, one uses the function “average”. It is used just like the “median” function.

Specifically, one types “=average( range of data)”. For the data on steel thickness, you would have something that looks like the below:

By closing the parentheses, you get the average for the data as 354.55.

Computation of the Arithmetic Mean

From Grouped Data

If we do not have the raw data but only the frequency distribution of the data, the formula for the sample mean becomes:

EXCEL does not compute this formula directly. To compute this in EXCEL for the steel thickness data, one can use the following procedure:

m(i) / f(i)
Interval / Midpoint / Freq / f(i)*m(i)
341.5 / 344.5 / 343 / 1 / 343
344.5 / 347.5 / 346 / 3 / 1038
347.5 / 350.5 / 349 / 8 / 2792
350.5 / 353.5 / 352 / 8 / 2816
353.5 / 356.5 / 355 / 20 / 7100
356.5 / 359.5 / 358 / 13 / 4654
359.5 / 362.5 / 361 / 5 / 1805
362.5 / 365.5 / 364 / 2 / 728
60 / 21276
Average / 354.6

If one defines the proportion of observations in a bin as

then the formula for the mean from grouped data (and also the formula for a discrete probability distribution) is:

Using the above, it is then possible to generalize the definition of the mean for data from a continuous distribution with probability density function f(x) as:

Computation with the Average

Consider the problem of having two groups of people, 50 people in Group 1 with an average hourly wage of $15.00 and 100 people in Group 2 with an average hourly wage of $17.00, can I find the mean of the pooled group of 150 people.

The average of the pooled group is just the total hourly wages of all 150 people divided by the 150 people. Using the formula for the arithmetic average, one can show that:

Therefore the sum of the hourly wages in the first group is 50 x 15 = 750.

The sum of the hour wages in the second group is 100 x 17 = 1700. Finally the mean of the pooled group is:

pooled average = (750 + 1700)/ (50 + 100) = $16.33

This can be written in formula terms as:

This is a special case of the formula for multiple groups:

Consider the following example which we discussed previously in connection with the median:

Group / Group / Change
1 / 2
5 / 4 / -1
10 / 12 / 2
15 / 18 / 3
20 / 19 / -1
25 / 23 / -2
Average / 15 / 15.2 / 0.2

Notice that the change in the means is the same as the mean of the changes.

Summary

Criterion Median Mean

Ease of Understanding High Reasonable

Computation Moderate Easy

Effect of Outliers None High

Use in Further Computation None Easy

Accuracy for Inference to

Population for a fixed sample

of size n 25% worse than mean Baseline

Simpson’s Paradox

Consider the following data found in the file “meandemo.xls”:

Male / Female
Males / Average / Females / Average
Prof / 35 / 60,000 / 5 / 65,000
Assoc Prof / 25 / 50,000 / 20 / 55,000
Asst Prof / 15 / 40,000 / 15 / 45,000
Average / 52,667 / 52,500

Or the following data also found in the file “meandemo.xls”:

Time 1 / Time 2 / Median
Time 1 / Median / Time 2 / Median / Change
30 / 31
Group 1 / 35 / 35 / 32 / 32 / -3
48 / 75
14 / 60
Group 2 / 85 / 85 / 83 / 83 / -2
98 / 85
60 / 61
Group 3 / 63 / 63 / 62 / 62 / -1
65 / 98
All
Groups / 60 / 62 / 2

Measures of Scale

The simplest way to measure scale is to find the average distance of each datapoint from the measure of location (in our case the arithmetic mean). Symbolically this can be written:

The fact that some deviations are positive and some negative can be corrected in one of two ways:

1) Use the absolute value to compute the mean absolute deviation (MAD), which in formula terms is:

or 2) Use the square of the deviations which in formula terms gives:

and,

In EXCEL, the function “stdev” uses the above formula for computing the sample standard deviation:

For the steel thickness data, you would type “=stdev(range)” as shown below:

This yields the value of s=4.492549.

EXCEL does not automatically compute the standard deviation if the data is grouped. The computing formula to use in this case is given by:

and then taking the square root.

The necessary terms can be computed in EXCEL as shown in the following table for the steel data:

m(i) / f(i)
Interval / Midpoint / Freq / f(i)*m(i) / f(i)*m(i)*m(i)
341.5 / 344.5 / 343 / 1 / 343 / 117,649
344.5 / 347.5 / 346 / 3 / 1,038 / 359,148
347.5 / 350.5 / 349 / 8 / 2,792 / 974,408
350.5 / 353.5 / 352 / 8 / 2,816 / 991,232
353.5 / 356.5 / 355 / 20 / 7,100 / 2,520,500
356.5 / 359.5 / 358 / 13 / 4,654 / 1,666,132
359.5 / 362.5 / 361 / 5 / 1,805 / 651,605
362.5 / 365.5 / 364 / 2 / 728 / 264,992
Sum / 60 / 21,276 / 7,545,666

which yields an estimate of s = 4.5031.

If only the proportion of observations in each bin is available, then the following approximate formula may be used:

which in this case yields the value of s = 4.465423.

The standard deviation for data following a theoretical distribution function f(x) can also be defined as:

and,

Further Uses of the Mean and Standard Deviation

The Mound Rule:

For data which is “mound” shaped, approximately

Percent of Data Region

68% mean +/- one standard deviation

95% mean +/- two standard deviations

99.7% mean +/- three standard deviations

For the steel thickness data (which is mound shaped) the exact results are:

Region / Values / %
mean +/- 1 sd / 350.1 / to / 359.0 / 73.0%
mean +/- 2 sd / 345.6 / to / 363.5 / 96.7%
mean +/- 3 sd / 341.1 / to / 368.0 / 100.0%

Chebyshev’s Inequality

For any distribution, at least 100(1- 1/k2)% of the data must lie in the region, the mean +/- k standard deviations.

Specifically, for k=2, at least 75% of the data must lie in the range mean +/- 2 standard deviations.

For k=3, at least 88.9% of the data must lie in the range mean +/- 3 standard deviations.

Measures of Relative Position

Class Mean Standard Deviation

Monday 85 6

Wednesday 90 8

A Student from the Monday night class takes the Wednesday exam and scores 92 To what score in the Monday night class, does this score correspond?

Define:

and

For the example,

t = (92-90)/8 = .25

xMonday = 85 + .25 x 6 = 86.5