AP Statistics Chapter 1: Exploring Data Section 1.2: Describing Distributions with Numbers
Shape: x̅ < M, x̅ > M, x̅ ~ M
Center:
Mean
Median
Mode – (only a measure of center for categorical data, if 2 modes then it is Bimodal)
Spread:
Range
Interquartile Range (IQR, ALWAYS state even if not drawing a box plot)
Standard Deviation
Mean:
Is the arithmetic average.
Formula:
x = ∑ xi
n
Not resistant to the influence of extreme observations.
Sample mean: x̅ Population mean: μ (mu)
Median - M
The midpoint of a distribution
The number such that half the observations are smaller and the other half are larger.
If the number of observations n is odd, the median is the center of the ordered list.
If the number of observations n is even, the median M is the mean of the two center observations in the ordered list.
Is resistant to the influence of extreme observations.
Parameters vs. Statistics:
Parameters are used to measure populations – greek letters
Statistics are used to measure summaries (samples) of data – English letters
How to load data into a list
How to find statistics about a list of data
For the following data create a back to back stem plot, describe the plot both visually and mathematically. Which representation of the center makes more sense?
Babe Ruth’s HR totals: 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22
Roger Maris’ HR totals: 8, 16, 23, 14, 26, 33, 39, 95, 28, 13 (95 is not a real HR fyi)
Measuring Spread: The Quartiles
Why is spread so important? Create 2 dot plots of the following data. State the mean, median, and range.
1) 3, 3, 3, 3, 3
2) 2, 2, 3, 4, 4
Range
Difference between largest and smallest points.
Not resistant to the influence of extreme observations.
Interquartile Range (IQR)
Measures the spread of the middle half of the data.
Is resistant to the influence of extreme observations.
Quartile 3 minus Quartile 1.
Quartiles: 25th, 50th, 75th percentiles
Percentiles: Q1 and Q3 are the 25th and 75th percentiles respectively.
Q1 - 25th percentile - means 25% of data falls at or below this value
Q2- median - 50th percentile
Q3 - 75th percentile - means 75% of data falls at or below this value
95th percentile means 95% of data is at or below this value.
The five number summary
Consists of the
min, Q1, median, Q3, max
Offers a reasonably complete description of center and spread.
Used to create a boxplot.
Boxplot
Shows less detail than histograms or stem plots.
Best used for side-by-side comparison of more than one distribution.
Gives a good indication of symmetry or skewness of a distribution.
Regular boxplots conceal outliers.
Modified boxplots put outliers as isolated points.
For the following data create a side by side box plot, describe the plots, identify any possible outliers
Babe Ruth’s HR totals: 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22
Roger Maris’ HR totals: 8, 16, 23, 14, 26, 33, 39, 95, 28, 13 (95 is not a real HR fyi)
Min Q1 M Q3 Max
Ruth:
Maris:
Can use the calculator to find the median and quartiles using
1-Var Stats (scroll down to second page of results)
Outlier
An observation that falls outside the overall pattern of the data
Calculated by using the IQR - interquartile range (Q3 - Q1)
Anything smaller than Q1 – 1.5(IQR) or larger than Q3 + 1.5(IQR) is an outlier. These are thresholds.
Do NOT draw thresholds on box plot – they are not data values!
Why 1.5 x IQR? John Tukey, inventor of boxplots, answered that one was not enough and two was too much.
TI-83/84 can display up to three boxplots on the same scale. Use StatPlot. Can select quick or full boxplot. Identify list where data is located and frequency list if used. Use zoom 9 to draw graph. remember: clear out any equations in “y =“ first !!)
To draw “full” or “modified” boxplot (Modified Boxplot- modified to remove outliers)
Identify thresholds_—start with 5-number summary and IQR “Thresholds” above (or below) which any data values are considered outliers
Don’t draw thresholds on boxplot—they are not data values. Only use them to identify outliers.
How to determine shape of distribution from a box plot
For Ruth and Maris create a modified side by side box plot, describe the plots, identify any outliers.
Homework: p. 74 27 – 32 p.82 33, 35 - 38
Measuring Spread:
Why might the Range not give a good picture of the spread of data?
Create two dot plots.
1) 3, 3, 3, 3, 3, 9
2) 2, 2, 3, 4, 6, 6,8
Variance (sample data - s2 population data – σ2)
The average of the squares of the deviations of the observations from their mean.
In symbols, the variance of n observations x1, x2, …, xn is
or
Standard deviation (sample data – s pop. data – σ (sigma))
The square root of variance.
Compares observations to their mean
Degrees of Freedom is n – 1.
How to find the mean and standard deviation from their definitions.
With the list of numbers, calculate the variance and standard deviation using the formula. 5, 6, 7, 8, 10, 12
1. Find the mean ( x ):
2. Plot the points on a number line:
3. Find s² and s
Find the variance and standard deviation with these metabolic rates for 7 men who took part in a study using the calculator:
1792, 1666, 1362, 1614, 1460, 1867, 1439.
1. Find the mean ( x ):
2. Plot the points on a number line:
3. Find s² and s
Properties of Variance:
Uses squared deviations from the mean because the sum of all the deviations not squared is always zero.
Has square units.
Found by taking an average but dividing by n-1.
The sum of the deviations is always zero, so the last deviation can be found once the other n-1 deviations are known.
Means only n-1 of the squared deviations can vary freely, so the average is found by dividing by n-1.
n-1 is called the degrees of freedom.
Properties of Standard Deviation
1) s measures spread about the mean; only use when using mean as center.
2) s = 0 when all observations are equal.
3) s is not resistant
How to choose between 5 number summary and s/x̅ to describe center and spread? :
1) Use 5 number summary when skewed or strong outliers.
2) Use s/ x̅ when reasonably symmetric.
Summary:
Always plot your data!!!!! Always show title and labels!!!
Graphs
Give the best overall picture of a distribution.
Numerical measures of center and spread
Only give specific facts about a distribution.
Do not describe its entire shape.
Can give a misleading picture of a distribution or the comparison of two or more distributions.
Homework p. 89 39 - 43
Linear Transformations
Why would we want to transform the data??
When univariate data span several orders of magnitude, you have a problem with the large or small values “dominating” the distribution.
How would you create a stem plot of the following data?
3066, 34, 39769, 1396, 902, 307, 7, 10, 540
How do you think the stem plot would look? Useful??
The goal is to change skewed univariate data to be approximately normal and bivariate data to be more linear across a scatter plot.
Linear Transformations
Changes the original variable x into the new variable xnew (x*).
x* = a + bx
Does not change the shape of a distribution.
Can change one or both the center and spread.
The effects of the changes follow a simple pattern.
Adding the constant (a) shifts all values of x upward or downward by the same amount.
Adds (a) to the measures of center and to the quartiles but does not change measures of spread.
Multiplying by the positive constant (b) changes the size of the unit of measurement.
Multiplies both the measures of center (mean and median) and the measures of spread (standard deviation and IQR) by (b).
Example: you have 5 observations; 3, 5, 10, 12, 15
x̅ / M / S1) For the obs
2) Obs + 10
3) Obs *6
4) Obs * 6 + 10
For the transformation x*= a+bx:
1) All measures are multiplied by b
2) a is added to x̅ and 5 number summary
(Measures of spread are unaffected by a)
X*= a+bx M*= a+bm Q1*= a+bQ1
S*= bs min*= a+bmin Q3*= a+bQ3
max*= a+bmax IQR*= bIQR
Chapter review
D) Items used for describing plots:
D) Items used for describing plots: