Sihyun Kim February 16, 2006

MAT 120.7688

Statistics

Project

Data Set 8: New York City Marathon Finishers.

1. Describe the data you selected and explain why they are important.

The data depict a sample of 150 runners who were selected from the population of 29,733 runners who finished the New York City Marathon in a recent year. The data show the time and rank in which these runners were able to complete the marathon. Furthermore, the data list the sex of each runner in the sample. Because the data consist of a large sample of runners, a careful analysis of the statistics may help us determine whether patterns exist among New York City Marathon runners. For example: Can gender substantially affect the likelihood of somebody finishing faster than others? Through various investigative techniques that we have thus far encountered in our Statistics course, we may be able to answer these questions in our attempt to make assumptions concerning not just the sample of 150 runners, but the general population of 29,373 New York City Marathon runners as a whole. Only through a better understanding of the special individuals who participate in the sporting event can we truly appreciate the incredible feat of running 42.195 km nonstop.

2. Using SPSS, compute descriptive statistics on your data. That is, you will find mean, median, mode, quartiles, variance, standard deviation, maximum, and minimum, range and mid-range.

Please see the following page

Frequencies of Marathon Runners

Statistics

Order / Age / Gender / Time (sec)
N / Valid / 150 / 150 / 150 / 150
Missing / 0 / 0 / 0 / 0
Mean / 14309.53 / 38.87 / 15878.84
Std. Error of Mean / 705.693 / .828 / 256.725
Median / 13279.50(a) / 37.73(a) / 15322.00(a)
Mode / 130(b) / 30(b) / 9631(b)
Std. Deviation / 8642.936 / 10.144 / 3144.226
Variance / 74700337.674 / 102.895 / 9886157.974
Skewness / .016 / .443 / .645
Std. Error of Skewness / .198 / .198 / .198
Kurtosis / -1.175 / -.377 / .551
Std. Error of Kurtosis / .394 / .394 / .394
Range / 28915 / 49 / 16267
Minimum / 130 / 19 / 9631
Maximum / 29045 / 68 / 25898
Sum / 2146429 / 5830 / 2381826
Percentiles / 25 / 7093.00(c) / 31.00(c) / 13854.00(c)
50 / 13279.50 / 37.73 / 15322.00
75 / 21017.00 / 46.25 / 17397.00

a Calculated from grouped data.

b Multiple modes exist. The smallest value is shown

c Percentiles are calculated from grouped data.

Descriptive Statistics Between Genders

Case Processing Summary

Gender / Cases
Valid / Missing / Total
N / Percent / N / Percent / N / Percent
Time (sec) / M / 111 / 100.0% / 0 / .0% / 111 / 100.0%
F / 39 / 100.0% / 0 / .0% / 39 / 100.0%

Descriptives

Gender / Statistic / Std. Error
Time (sec) / M / Mean / 15415.23 / 288.236
95% Confidence Interval for Mean / Lower Bound / 14844.02
Upper Bound / 15986.45
5% Trimmed Mean / 15297.23
Median / 14942.00
Variance / 9221888.290
Std. Deviation / 3036.756
Minimum / 9631
Maximum / 24384
Range / 14753
Interquartile Range / 3844.00
Skewness / .582 / .229
Kurtosis / .214 / .455
F / Mean / 17198.33 / 497.545
95% Confidence Interval for Mean / Lower Bound / 16191.11
Upper Bound / 18205.56
5% Trimmed Mean / 17008.16
Median / 16792.00
Variance / 9654503.123
Std. Deviation / 3107.170
Minimum / 12047
Maximum / 25898
Range / 13851
Interquartile Range / 3503.00
Skewness / .984 / .378
Kurtosis / 1.277 / .741

3. Explain in your words what each of the above statistics mean, in general. Also explain what it means as related to your project. Compare the different measures of central tendency, measures of dispersion and explain what they tell you about your individual data.

Mean- The mean is the measure of center in a set of values that is found by adding the values and dividing the total by the sum of the number of values. The mean is generally considered the most important of all numerical measurements used to describe data.
The sample mean of time needed by male New York City Marathon runners was 15415.23 seconds (4 hours 16 minutes 55 seconds), while the sample mean of time needed by female New York City Marathon runners was 17198.33 seconds (4 hours 46 minutes 38 seconds). In other words, the typical male New York City Marathon runner was able to finish the event about 30 minutes before the average female Marathon runner.

Median- The median of a data set is the measure of center in a set of values that is found by choosing the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. The usefulness of the median is especially clear in the case that a set of values contain exceptional values (in such a case, the mean will be dramatically affected).

The sample median of time needed by male New York City Marathon runners was 14942.00 seconds (4 hours 9 minutes 2 seconds), while the sample median of time needed by female New York City Marathon runners was 16792.00 seconds (4 hours 39 minutes 52 seconds). In other words, 50% of male marathon runners (those who were chosen for the sample) were faster than the male runner who finished the marathon in 14942.00 seconds was (and the remaining 50% of male marathon runners were slower than he). Likewise, 50% of female marathon runners (those who were chosen for the sample) were faster than the female runner who finished the marathon in 16792.00 seconds was (and the remaining 50% of female marathon runners were slower than she). In addition, we can observe that the runner whose marathon time is the male’s sample median was able to finish the event 30 minutes earlier than the person whose marathon time is the female’s sample median could.

Mode- The mode of a data set is the value that occurs most frequently. The application of mode is limited to nominal data.

Therefore, the mode cannot be applied to the data that are in consideration for this statistical analysis. The time needed by each marathon runner is unique, and, therefore, no time allotted will appear more than once in the data.

Quartiles- Just as the median divides the data into two equal parts, the three quartiles, denoted by Q1 (first quartile), Q2 (second quartile), and Q3 (third quartile) divide the sorted values into four equal parts. The first quartile separates the bottom 25% of the sorted values from the top 75%, the second quartile, which is the same as the median, separates the bottom 50% of the sorted values from the top 50%, and the third quartile separates the bottom 75% of the sorted values from the top 25%.

Quartiles are especially helpful in the analysis of the data in question. For the male New York City Marathon runners, the quartiles of time needed to complete the event are as followed: Q1 = 12907 seconds, Q2 = 14942 seconds, and Q3 = 16977 seconds. For the female New York City Marathon runners, the quartiles of time needed to complete the event are as followed: Q1 = 14711 seconds, Q2 = 16792 seconds, and Q3 = 18874 seconds. We shall see later on in the box plot analysis that these statistics can give us a quick, but firm, understanding of the overall distribution of time needed to complete the marathon between males and females.

Standard Deviation- The standard deviation of a set of sample values is a measure of variation of values about the mean. Values close together will yield a small standard deviation, whereas values spread farther apart will yield a larger standard deviation. The standard deviation is the measure of variation that is generally considered the most important and useful in statistical analysis.

The standard deviation of time needed to complete the New York City Marathon for male runners is 3036.76 seconds. Because the mean is 15415.23 seconds, we can interpret this statistics as followed: time allotted within the interval 15415.23 ± 3036.76 seconds (within 1 standard deviation of the mean) make up 68% of total observations in the sample, and time allotted within the interval 15415.23 ± 6073.52 seconds (within 2 standard deviations of the mean) make up 95% of total observations in the sample.

The standard deviation of time needed to complete the New York City Marathon for female runners is 3107.17 seconds. Because the mean is 17198.33 seconds, we can interpret this statistics as followed: time allotted within the interval 17198.33 ± 3107.17 seconds (within 1 standard deviation of the mean) make up 68% of total observations in the sample, and time allotted within the interval 17198.33 ± 214.34 seconds (within 2 standard deviations of the mean) make up 95% of total observations in the sample.

Variance- The variance of a set of values is a measure of variation equal to the square of the standard deviation. The usefulness of variance is especially clear when we deal with sampling distributions—variance is an unbiased estimator that is likely to yield good results when we use a sample statistic to estimate a population parameter. Standard deviation, on the other hand, tends not to target population parameters in cases in which the sample sizes are relatively small.

The variances of the time needed to complete the New York City Marathon for male and female runners are 9221888.2 and 9654503.1 seconds respectively. Because we are dealing with sampling distribution to make estimations of the general population of the 29,733 runners who finished the New York City Marathon, the variance will serve as a better statistic than can the standard deviation to predict population parameters.

Maximum and Minimum- The maximum is the highest observed value in a given set of data. Conversely, the minimum represents the lowest observed value in a given set of data. Both values depict extreme values in a given data set. Together, the maximum and the minimum can show us examples of observations that are very unusual in a set of values.

The minimum time (in the sample) needed to complete the New York City Marathon for male runners was 9631 seconds (2 hours 40 minutes 31 seconds), whereas the maximum time (in the sample) needed was 24384 seconds (6 hours 46 minutes 24 seconds). These values deviate significantly from the mean of 15415.23 seconds (4 hours 16 minutes 55 seconds), and, therefore, certainly do not represent the time needed by the typical male runner to finish the marathon. Similarly, the minimum time (in the sample) needed to complete the New York City Marathon for female runners was 12047 seconds (3 hours 20 minutes 47 seconds), whereas the maximum time (in the sample) needed was 25898 seconds (7 hours 11 minutes 38 seconds). Again, these values certainly do not represent the time needed by the typical female runner to finish the marathon.

Range- The range of a set of data is the difference between the highest value and the lowest value. Although the range is very easy to compute, it is not as useful as the other measures of variation due to the fact that it depends on only the extremes of a given data set.

The range of time needed to complete the New York City Marathon for male runners was 14753 seconds. What this means is that 4 hours 5 minutes 53 seconds passed after the first male to be chosen in the sample had completed the marathon for the last male to be chosen to finish. Likewise, the range of time needed to complete the New York City Marathon for female runners was 13851 seconds. In other words, 3 hours 50 minutes 51 seconds passed after the first female to be chosen in the sample had completed the marathon for the last female to be chosen to finish.

Midrange - The midrange is the measure of center that is the midway value between the highest and lowest values in a given data set. It is calculated by dividing the sum of the maximum and the minimum by two. Although the midrange is rarely used (like the range, it is also dependent on the extremes of a given data set), it has some advantages: (1) it is easily computed; (2) it helps to reinforce the important concept that there are many ways to define the center of a data set.

The midrange of the times needed by male New York City Marathon runners to complete the event was 17007.5 seconds, whereas the midrange for female runners was 18972.5 seconds. These statistics give us yet another way to interpret the data in question—the male runner who required about 17007.5 seconds to complete the marathon finished the event halfway between two men who required very unusual periods of time to accomplish very same thing. Likewise, the female runner who required about 18972.5 seconds to complete the marathon finished the event halfway between two women who required very unusual periods of time to accomplish very same thing.

4. Construct as many of the graphical representation of your data as possible. Discuss the results as they pertain to your set of date.

Histogram: Frequency of Time (Male Marathon Runners)

Histogram: Frequency of Time (Female Marathon Runners)

Through observations of the histograms that depict the frequency distribution of time needed by male and female runners to complete the New York City Marathon, we can see that the time for both sexes is normally distributed. More specifically, the distribution of data is, for the most part, symmetrical—indeed, the mean and median for both males (mean = 15415.23 seconds, median = 14942.00 seconds) and females (mean = 17198.33 seconds, median = 16792.00 seconds) are quite close to each other.