MathQ114 Chapter 1 supplementary materials© Mark Pawlak & Maura Mast
.
Frequency Histograms vs. Relative Frequency Histograms
We have been analyzing the ages of the 20 or so students in this class to determine the mean and median ages as well as to look at how ages are distributed among us. If we now ask the question: is our class typical of all students enrolled in Math Q14 this semester in terms of age? What does it mean that there are 15 students in our class between the age of 17 and 22 while there are, let’s say, 163 students in all QR clases in the same age range. How can we compare the two groups when they are of vastly different sizes?
When comparing the distribution of two populations of different sizes, a relative
frequency histogram, which shows the percent of each population in an interval is most useful.
See below how the relative frequency distribution of the populations of Chnia and the USA allows you to compare the two countries in ways that the frequency distribution does not.
The 1998 population of China was 1,236,915,000 people, while the US population that same year was only 270,290,000 people.
Except for people age 85 years and older, there were more Chinese than Americans living in each 5 year interval. . It’s hard to see from this graph however to compare the way that each country’s population is distributed in the year 1998. Relative to the rest of the US population is the majority of the population under the age of 35 as appears to be the case in China? The relative frequency histogram below helps us see this clearly since it shows what % of each country’s population is grouped in each 5 year interval.
Data Intervals & Data Compression
When constructing a histogram there are several things you must bear in mind hgaving to do with the data intervals represented on the horizontal axis. First, rmember that the intervals must be all the same size. They cannot overlap , say 15-19 years of age and 19-to 19-23 years of age, because then anyone who is age19 will be counted twice, once in the first interval and once in the second. Another thing you need to pay attention to is that there cannot be numerical gaps between intervals, say ages 15-19 and then ages 22-26, skipping ages 20 and 21. They have to be contiguous, ie. Where one ends, the next begins.
Having said all this about intervals it is important to note, however, that how large an interval size you use is entirely your choice. Trail and error is the best approach in choosing the interval size. The characteristics of each data set will be best shown by a particular size interval for grouping the data.
Data Compression can be achieved by changing the size of intervals in a histogram. If intervals are too small (see the 1 year interval graph) the histogram may be very busy, making it hard for the reader to detect patterns. If the intervals are too large the data becomes so compressed that essential details about patterns are lost (see 10 year interval graph). Choosing the right size intervals is a matter of trial and error until you select a size that best illustrates the patterns that exist in the distribution you are studying.
Here are three examples of different interval sizes for the population distribution of the U.S. (using the 1990 Census).
Kinds of Distributions:
uniform, symmetric, skewed right or left, (bimodal?)
Maura’s lecture and examples go here followed by the Worksheet.