Math Analysis: Statistics
Class Notes (Page 1) Name:
Statistics: The mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling.
The Arithmetic Mean is the most widely used measure of location.
It is calculated by summing the values and dividing by the number of values (the average).
The Median is the midpoint of the values after they have been ordered from the smallest to the largest. There are as many values above the median as below it in the data array. For an even set of values, the median will be the arithmetic average of the two middle numbers.
Example: The top five movies had budgets (in millions) of 125, 195, 110, 80 and 205.
Arranging the data in ascending order gives: 80, 110, 125, 195, 205. Thus the median is 125.
Example: Given the budgets of the top 10 movies: 80, 110, 125, 125, 145, 150, 180, 195, 200, 205 which is already in order, the median is (145+150)/2 = 147.5.
The Mode is the value of the observation that appears most frequently.
There can be more than one mode per set of data, and if all of the data elements occur the same number of times, then there is ‘no mode’.
Example: Given the data set: 80, 110, 125, 125, 145, 150, 180, 195, 200, 205
the mode is 125, because it occurs the most times.
Practice 1: Determine the mean, median, and mode for the following data set: 22, 30, 42, 33, 25, 30, 40, 38, 29.
The Range is the difference between the largest (maximum) and the smallest (minimum) value.
Example: The range for the data set from the previous example is 205-80 = 125.
Stem-and-leaf display: A statistical technique for displaying a set of data. Each numerical value is divided into two parts: the leading digits become the stem and the trailing digits the leaf.
Practice 2: The top 10 movies ran for: 130, 154, 117, 102, 136, 130, 133, 106, 129, 115 minutes. Construct a stem-and-leaf chart from this data.
Math Analysis: Statistics
Class Notes (Page 2) Name:
The Percentile gives us the location, or ranking, of a data point in relation to the data set.
For instance, the 9th percentile is the value that is above exactly 9% of all the data points.
A special type of percentile is called the Quartile.
The first quartile, Q1, is the value that is above one quarter, or 25% of the data values.
The third quartile, Q3, is the value that is above three quarters, or 75% of the data values.
The first quartile, Q1, is essentially the median for the first half of the data.
The third quartile, Q3, is essentially the median for the second half of the data.
The Inter-quartile range is the distance between the third quartile Q3 and the first quartile Q1.
This distance will include the middle 50 percent of the observations.
Inter-quartile range = Q3 - Q1
Example: The gross incomes of the top 9 movies are: 352, 281, 184, 254, 241, 207, 381, 209, 191.
Arranging the data in ascending order gives: 184, 191, 207, 209, 241, 254, 281, 352, 381. Thus the median is 241, Q1 is 199, and Q3 is 316.5. The inter-quartile range is Q3 - Q1 = 316.5 – 241 = 75.5.
A Box Plot (sometimes called a ‘box and whisker plot’) is a graphical display, based on quartiles, that helps to picture a set of data.
Five pieces of data are needed to construct a box plot: the Minimum Value, the First Quartile, the Median, the Third Quartile, and the Maximum Value.
Practice 3: Construct a box plot for the data set from the previous example:
A box plot sometimes includes an Outlier.
An outlier is an extreme value that is more than 1.5 times the inter-quartile range beyond the upper or lower quartiles. If an outlier exists, it is marked by a single point, and each whisker is extended to the last value of the data that is not an outlier.
Example: In the data from the previous example, the inter-quartile range was 75.5.
(1.5)(75.5) = 113.25. Q3 + 113.25 = 316.5+113.25 = 429.75. Since the highest value in the set was 381, there is no outlier on that side. Q1 – 113.25 = 199 - 113.25 = 85.75. Since the lowest value in the set was 184, there are no outliers on that side. So there are no outliers for this set of data.
Practice 4: Determine if there are any outliers for the movie run times: 130, 154, 117, 102, 136, 130, 133, 106, 129, 115.
The Percentile gives us the location, or ranking, of a data point in relation to the data set
To find the location of the percentile, p, in a data set containing n data points, first order the data from smallest to largest. Then, to find the location in the ordered set, use the following formula.
If the location falls between two data points, you will find a value between those data points.
Example: Here is the process for finding the 18th percentile for the following data set:
25, 32, 80, 93, 110, 125, 125, 130, 140, 145, 150, 180, 195, 200, 205
In this problem, n = 15. Therefore the location of the 18th percentile is L=15+118100=3.06
and is between the 3rd and 4th data points.
To find the exact number of the 18th percentile, we take the difference in the values of the two data points: 13. Then add the portion of the remainder times the difference to the lower data point.
p18 = 80 + .06*13 or p18 = 80.78 ≈ 81
Standard Deviation and Variance:
Just as the median is one measure of the middle of a set of data, so is the mean. The mean is defined as the sum of the data divided by the number of elements.
The formula is:
Just as the quartiles measure the dispersion or spread around the median, variance and standard deviation measure the spread around the mean.
The formula is:
Example: Consider these test scores: 70, 98, 88, 97, 84, 75, 100, 76
Here is the process for finding the Standard Deviation:
1. Find the mean.
2. Find the difference between each data point and the mean. This is the deviation from the mean. The sum of all these differences equals zero (some are positive and some are negative)
3. To do away with the problem of negative numbers, square each difference. Find the sum and divide by the number of data points. This is the VARIANCE.
4. This is hard to compare, as it is not in the same scale as the original data. Take the square root of the variance. This is the STANDARD DEVIATION.
Find the mean first: =
Fill in the chart below to find the variance and standard deviation
List the data
Variance σ2 =
Standard Deviation σ =
A Frequency Distribution is a grouping of data into mutually exclusive categories showing the number of observations in each class.
Some concepts associated with Frequency Distributions are the following:
Class frequency: The number of observations in each class.
Class interval: The class interval is obtained by subtracting the lower limit of a class from the lower limit of the next class. The class intervals used in the frequency distribution should be equal.
Class Mark: The midpoint of a class interval.
Number of Classes: Should use at least k classes, where 2k n ( the number of data points).
(This is the 2k rule)
Determine a suggested class interval (i) by using the formula:
i= (Highest value-Lowest value)Number of classes
Note: this is a suggested class interval; if the computed class interval is ’97’for instance, it may be better to use ‘100’.
A Relative Frequency Distribution shows the percent of observations in each class.
Example: Length of top 15 grossing movies of 2011:
130, 154, 117, 102, 136, 130, 133, 106, 129, 115, 124, 105, 146, 125, 91
Movie Length / 85-95 / 95-105 / 105-115 / 115-125 / 125-135 / 135-145 / 145-155Frequency / 1 / 1 / 2 / 3 / 5 / 1 / 2
A Histogram is a graph in which the classes are marked on the horizontal axis and the class frequencies on the vertical axis. It is a visual representation of the Frequency Distribution. The class frequencies are represented by the heights of the bars and the bars are drawn adjacent to each other. An example is drawn below: