MLI Pretest/Posttest questions –Statistics

  1. Give a detailed description of the data that is illustrated in the following graphs.
  1. The histogram below shows the sugar content (as a percentage of weight) of 58 brands of breakfast cereals.

    Data is bimodal. It looks like there are 2 distinct groups of cereals—perhaps sugary cereal (28% - 64%) aimed at kids and non-sugary cereal (0% - 24%) aimed at adults. Before doing too much more analysis, we might want to disaggregate the data.
    The median is about 16. The data is skewed to the right, so the mean is somewhat higher. Neither is a particularly good representative of the data, since it’s bimodal.
  1. The boxplot below illustrates running speeds of various land mammals.

There are two outliers, one at about 70 mph (cheetah?) and one about 10 mph. The rest of the data is between about 11 mph and about 50 mph. The data looks somewhat symmetrical. If you ignore the outliers, it looks slightly skewed to the left. The median is about 37 mph. The mean is probably about the same, or maybe slightly lower. Either looks like a good representative of the data. The distribution looks fairly normally distributed, with 50% of the data between 30 and 42.

  1. The scatterplot below shows the number of twins born in the U.S. from 1981 to 2005 (source: National Vital Statistics Reports, Volume 56, No. 6), as well as the least-squares regression line.

  1. Describe the trend in the number of twins born, using the above model.
    In general, the number of twins born grows by about 2,685 each year.
  2. Using the above model, how many twins would we expect to be born in the year 2010? Describe how confident you are in this prediction, and why.
    In the year 2010, we expect 2684.8(2010) – 5,252,000 = 144,448 twins to be born. I’m fairly confident about this prediction. This is an extrapolation, but it’s not very far outside the domain of our data, and the linear model is quite good (r2 = .98, which is very high). In general, growth such as this is not linear—it’s logistic or exponential. However, a linear model works well for short intervals.
  3. Use the above model to estimate the number of twins that were born in the year 1900. Describe how confident you are in this prediction, and why.

In the year 1900, we the model estimates that 2684.8(1900) – 5,252,000 = -150,880 twins were born. This is obviously nonsense, since a negative number doesn’t have meaning in this context. The problem is that we are extrapolating way outside of our domain, and as stated above, linear models generally don’t work when describing any sort of population growth over a long interval.

  1. Ten high school students participated in a survey in which they estimated their GPA and how many hours of television they watched in a week. The results are below:

Participant / GPA / TV hours per week
#1 / 3.1 / 14
#2 / 2.4 / 10
#3 / 2.0 / 20
#4 / 3.8 / 7
#5 / 2.2 / 25
#6 / 3.4 / 9
#7 / 2.9 / 15
#8 / 3.2 / 13
#9 / 3.7 / 4
#10 / 3.5 / 21

Sketch a scatterplot of the data, and calculate the formula for the least squares regression line, where the TV hours are the explanatory (independent) variable and the GPA is the respondent (dependent) variable. Comment on whether or not you think it is a good model. What does this model tell us about high school students, their GPAs, and their television watching habits?

The coefficient of determination (r2) is 39%, which is not particularly high—it means that knowing the TV watching habits will explain 39% of the variance of the GPA, but the other 61% is still unknown. For that reason, the model is not that great.

This model shows a negative correlation between TV watching hours and GPA—in particular, that a one-hour weekly increase in TV watching predicts a .06 decrease in GPA. A 17-hour weekly increase in TV watching predicts a 1 point decrease in GPA.

This correlation should not be used to imply causation, however. This model does not say that watching TV will lower a GPA. It also does not say that raising your GPA will make you watch less TV.

The model tells us that in general, high school students that watch less TV will have higher GPAs, with a general increase of .06 GPA points for every 1 hour less of watching TV. There might also be other variables as well that affect TV watching and GPA.

  1. The Astros host the New York Mets. The Astros generally play better with the stadium roof closed, whereas the Mets play better with it open. If the roof is closed, then the Astros have a 60% probability of winning the game. If the roof is open, then the Astros have a 45% probability of winning the game.
    There is a 30% chance of rain tomorrow. If it rains, then the roof will be closed. If it doesnot rain, then the probability that the roof is closed is 25% (it might be really hot).
  2. What is the probability that the Astros win the game tomorrow?
  3. Given that the Astros win the game, what is the probability that it didnot rain?