The Joy of Stats
Chapter One
Claudio is studying palm tree heights in the state of Pernambuco, Brazil. In this study, which of the following is a variable?
a.Location = Pernambuco.
b.Tree height.
c.The height of any specific tree—for example, 3 metres.
(Answer: b; height)
In this study, which of the following is the value of a variable?
a.Location = Pernambuco.
b.Tree height.
c.The height of any specific tree—for example, 3 metres.
(Answer: c; a value for a specific observation)
In this part of the study, what are the cases or units of analysis?
a.Location.
b.Tree heights.
c.Metres.
d.Palm trees.
(Answer: d; the variables are measured on trees.)
Claudio discovers that the average height of palm trees is different in 10 distinct areas depending on the average yearly rainfall of the area. In this part of the study, the units of analysis are:
a.Rainfall in inches.
b.The 10 areas.
c.Pernambuco.
d.Palm trees.
e.Tree height in metres.
(Answer: b; the 10 areas. The concept is place vs. individual level units of analysis.)
In the part of the study in which Claudio looks at the average height of palm trees and the average yearly rainfall in 10 different areas, what is the best statement of his research question?
a.Is there variation in the average yearly rainfall of Pernambuco?
b.Which forest area has the highest average annual rainfall?
c.Is average tree height in an area related to average annual rainfall in the area?
d.Does a heavy rain cause trees to grow higher?
e.Do taller trees cause heavier rains?
(Answer: c; “related”—remaining initially cautious about causality and the direction of effects)
Which of the following is a constant, not a variable?
a.The provinces of Canada, such as Alberta, Manitoba, Quebec, etc.
b.Children’s heights in the nation of Laos.
c.Vietnam.
d.Suicide rates of European countries.
e.5 different species of mould found in Texas.
(Answer: c)
Short Essay
Petra wants to know if countries characterized by a high level of social inequality have higher rates of homicide than countries with low levels of inequality.
- Question 1: Which is the better choice of independent variable for this question and why?
- Question 2: Both of the variables in the study present problems of operationalization. Why? And what would you do to operationalize the variables?
For each pair, identify the independent variable for the research question OR indicate that either variable could be the independent variable:
a.An individual’s sex category at birth and height.
b.Ambient humidity and extent of mould damage in buildings.
c.U.S. cities’ unemployment rates and their poverty rates.
d.Level of truck traffic in cities and level of air pollution.
e.Level of air pollution in cities and level of water pollution in cities.
(Answers: a) sex category, b) humidity, c) either one, d) truck traffic, e) either one)
Short Essay or Discussion Question
For which of the preceding pairs of variables could there be a causal relationship?
(Answer: For a, b, and d, we could make a good case for a causal relationship, but this relationship is less evident for c and e.)
Identify the level of measurement for the following variable: weight of individual human beings measured in kilograms.
a.Nominal.
b.Kilograms.
c.Ordinal.
d.Interval-Ratio.
e.None of the above.
(Answer: d; interval-ratio)
As a Fill-in-the-Blanks Item
Weight measured in kilograms is considered to be at the ______level of measurement.
(Answer: interval-ratio)
Identify the level of measurement for the following variable: Rank order of cities as tourist destinations in Mexico (first, second, third, etc.).
a.Nominal.
b.Ordinal.
c.Interval-Ratio.
d.Continuous.
e.None of the above.
(Answer: b; ordinal)
Identify the level of measurement for a variable defined on students at the University of Toronto so that all students with a Ychromosome are defined as male and all others as not-male.
a.Nominal.
b.Ordinal.
c.Interval-Ratio.
d.Dichotomous.
e.All of the above.
(Answer: e; all of the above)
Discussion Question
Up until the 1960s Olympic athletes were identified as men or women by an inspection of external genitalia. Over the years, testing procedures have become increasingly complex, multifaceted, and controversial. What are the issues? (For example, is it fair for women to compete against someone who has many male physical characteristics?) What do you think is the best operational definition of sex categories in athletic competition?
Note to the instructor: Students may find it interesting to read about Caster Semenya, a South African runner, featured in an article by Ariel Levy, “Either/Or Sports, Sex, and the Case of Caster Semenya,” The New Yorker, November 30, 2009.
Fill in the Blanks
Provide examples of variables at each level of measurement.
Franco is studying beauty ideals, and part of his study involves comparing the average age in years of women featured in celebrity magazines in three global regions: Europe, Asia, and South America. In this part of the study, the units of measurement are:
a.Countries.
b.Celebrities.
c.Magazines.
d.Years.
e.Nominal-level data.
(Answer: d; years)
For the study of beauty images in the preceding question, Franco defines a variable called “estimated body mass index” by looking at photos of the celebrities and estimating their body mass index from the information in the photographs. Creating this variable is an example of ______a variable:
a.Deducing.
b.Operationalizing.
c.Dichotomizing.
d.Theoretically conceptualizing.
e.Maximizing.
(Answer: b; operationalizing)
In the study of beauty ideals, which variable pertaining to the celebrities would be the most difficult to operationalize?
a.Height.
b.Weight.
c.Beauty.
d.Age.
e.Skin tone.
(Answer: c; beauty)
Saba is studying the AIDS death rates of 60 countries. The units of analysis or cases in her study are:
a.Individuals.
b.Places (countries).
c.Number of deaths.
d.Operational variables.
e.All of the above.
(Answer: b; places/countries)
The data file for the study of countries’ AIDS death rates would have how many cases?
a.One for each individual who died.
b.A case for each person in the total population of the 60 countries regardless of whether they died or not.
c.One for each person who contracted AIDS in the 60 countries.
d.60 cases, one for each country in the study.
e.Can’t tell; not enough information is available to answer the question.
(Answer: d; 60 cases. Each country is a case.)
If Teaching SPSS/PASW
All information for a case would be located in a ______of the data file:
a.Column.
b.Row.
c.Cell.
d.Variable definition.
e.None of the above.
(Answer: b; row)
Chapter Two
In his study of palm trees, Claudio found that, in one cluster of trees, 343 are healthy, 25 are in early stages of a disease, and 20 are in an advanced stage. In the frequency distribution of tree health, what percentage are in the early stage of disease? ______
a.10.1
b.5.7
c.6.4
d.11.2
e.Can’t answer the question with the available information.
(Answer: c; 6.4%)
For the variable tree health, defined in the preceding problem, which summary measures can be obtained?
a.Mean.
b.Median.
c.Mean, median, and mode.
d.Mean, median, variance, and standard deviation.
e.None of the above.
(Answer: b; median. It is ordinal data.)
To display tree health, as defined in the preceding problems, which graphs would be appropriate?
a.Bar chart.
b.Bar chart and histogram.
c.Bar chart and pie chart.
d.Histogram.
e.Histogram and boxplot.
(Answer: c; bar chart and pie chart for nominal and ordinal data)
Which statement best characterizes the variance of a distribution?
a.It is the square of the standard deviation.
b.It is a measure of variability in a distribution.
c.It is the average of the squared deviations of observations from the mean of the distribution.
d.It is never a negative number.
e.All of the above.
(Answer: e; all of the above)
When is the median a better summary measure than the mean to represent a distribution?
a.When it is larger than the mean.
b.When the distribution is skewed.
c.When a distribution has two long symmetrical tails.
d.When the level of measurement is nominal.
e.When it is smaller than the mean.
(Answer: b; skewed)
A measure of ______should always be reported along with the measure of central tendency.
a.Similarity.
b.The average of the distribution.
c.Magnitude.
d.Variability.
e.None of the above.
(Answer: d; variability)
The administration at Big U has computed the mean and median of the family incomes of incoming students. In order to represent the income distribution accurately, they should also compute the ______of family income.
a.Similarities.
b.Standard deviation.
c.50th percentile.
d.Absolute value.
e.Central tendency.
(Answer: b; standard deviation)
Which of the following are measures of central tendency?
a.Median.
b.Mode.
c.Mean.
d.All of the above.
e.None of the above.
(Answer: d; all of the above)
Where is the line indicating the median placed in a boxplot?
a.In the middle of the box.
b.In the box but not necessarily in the middle.
c.Always at the upper edge of the box.
d.Always at the lower edge of the box.
e.It cannot be specified—depends on the shape of the distribution.
(Answer: b; in the box but not necessarily in the middle)
Which statement best describes a stem-and-leaf chart?
a.The raw scores are preserved in the chart.
b.It is organized similarly to a histogram.
c.It is used with interval-ratio data.
d.The “leaves” represent digits that are further to the right in the values of the observations.
e.All of the above.
(Answer: e; all of the above)
In a histogram:
a.The horizontal dimension refers to the frequency of each value, and the vertical dimension shows the range of values of the variable.
b.The vertical dimension refers to the frequency of each value, and the horizontal dimension shows the range of values of the variable.
c.The vertical dimension refers to the percentiles of the distribution.
d.Each bar represents a category of a categoric variable.
e.All of the above.
(Answer: b; vertical displays frequency, horizontal displays values)
Which statement is true of the mean of a distribution?
a.It can never be 0.
b.All values or observations must be included in the calculation.
c.It is a measure of dispersion.
d.It cannot be computed for a binary variable.
e.All of the above.
(Answer: b; all values must be included in the calculation)
Claudio has created a data file for his palm tree data. Each of 1500 trees is characterized by 4 variables: height in metres, state of health (healthy, early stage disease, advanced stage disease), location in one of 10 forest areas, and tree diameter in metres.For which variable(s) would a mean be meaningful?
a.Height and diameter.
b.Height.
c.Height and location.
d.Health and height.
e.All of the above.
(Answer: a; height and diameter)
Claudio creates boxplots of tree height for trees in each of the three states of health. This graph displays
a.One variable: health.
b.Two variables: health and height.
c.Three variables: one for each of the three states of health.
d.One variable: height.
e.Nothing: it cannot be made because health is not an interval-ratio variable.
(Answer: b; two variables, health and height. However, when this chart is made, there will be three boxplots of tree heights—one for each state of health—although there are only two variables. Before answering this question, students might need help to create these types of boxplots that use a categoric variable and show boxplots of the interval-ratio variable displayed for each category of the categoric variable.)
For which of the following variables would an index of diversity and an index of qualitative variation NOT be appropriate summary measures?
a.Height in metres of students in Canadian high schools.
b.Types of sports in which students in a school choose to participate.
c.Species of predators found in a province of Canada.
d.Types of mental illness diagnosed in a community mental health centre during the course of a month.
e.The two indexes can be computed for all of the above variables.
(Answer: a; height in metres. This is an interval-ratio variable not a categoric one.)
According to Chebycheff’s Inequality, in a distribution with a finite mean and variance, we would expect a score or value that is five standard deviations from the mean to appear with a probability of no more than:
a.5
b.1/5
c.0 (never)
d.68%
e..997
(Answer: b; 1/5)
Claudio discovers the mean height of palm trees in the region is 5 metres with a standard deviation of .5 metres. He finds one tree that is 5.5 metres tall. Its Z-score in the height distribution is:
a.+5.5
b.+.5
c.1
d.0
e.Can’t tell from the information provided.
(Answer: c; 1. Its height is one standard deviation above the mean.)
Which statement is NOT true about Z-scores for any distribution with a finite mean and variance?
a.Their mean is always 0.
b.A Z-score can be a negative number.
c.Their standard deviation is always 1.
d.Values larger than the mean of the distribution have positive Z-scores.
e.Each case’s Z-score is invariant for all distributions in which the case appears.
(Answer: e; Z-scores for a case vary depending on the distribution.)
Easier Version of Preceding Question
Cassie tells us that she has a Z-score of –1 for weight in the weight distribution of her graduating class. Identify the correct statement based on this information:
a.This is not possible because Z-scores can never be negative.
b.Her weight is below the mean of the class’s weight distribution.
c.This Z-score represents her weight in any weight distribution in which she is a case.
d.This indicates that she has an extremely low weight—it is an outlier in the distribution.
e.Her weight is exactly 5 standard deviations below the mean of the class weights.
(Answer: b; her weight is below the class mean weight)
The algorithm for a Z-score is:
a.Z = unstandardized score minus the mean of the distribution.
b.Z = divide the number of cases into the difference between the unstandardized score and the mean of the distribution.
c.Z = unstandardized score divided by the standard deviation.
d.Z = divide the standard deviation into the difference between the unstandardized score and the mean of the distribution.
e.Z = unstandardized score divided by the variance of the distribution.
(Answer: d; divide the standard deviation into the difference between the unstandardized score and the mean of the distribution)
Another term for Z-score is:
a.Chebycheff number.
b.Standardized score.
c.Mean score.
d.Hinge score.
e.Parameter.
(Answer: b; standardized score)
Shirley weighs 120 pounds and Dalia weighs 140 pounds. We consider their weights in the distribution of weights for 220 women in their graduating class. What can we conclude based on the information provided?
a.Shirley’s weight Z-score is 12 and Dalia’s is 14.
b.Shirley’s weight Z-score is lower than Dalia’s.
c.We cannot say anything about the Z-scores because the information about the mean and standard deviation of the distribution is not provided.
d.Dalia’s score is closer to 0 than Shirley’s.
e.Dalia’s Z-score is closer to +1 than Shirley’s.
(Answer: b; Shirley’s Z-score is lower than Dalia’s)
Difficult Short Essay Question or Question for Class Discussion
How is an SPSS/PASW data file different from a frequency distribution of a variable?
(Answer: Students are sometimes confused about this, so stress that the frequency distribution displays only one variable. When we see it, we cannot identify the cases to which the values pertain; it only provides a summary of the distribution, not the full information or “raw data.” We do not know which value pertains to which case. In the data file we can see this information. The frequency distribution is a summary of information in the data file. It tells us how many cases are associated with a specific value of the variable.)
We have a data set of 1500 cases from a simple random sample of Cape Town residents. The variable is height measured to the nearest millimetre. Which of the following would be a good way of summarizing the variable distribution using SPSS/PASW?
a.A frequency distribution table for heights measured to the nearest millimetre.
b.A pie chart displaying all the heights of the 1500 individuals.
c.A bar chart.
d.Descriptives and a histogram.
e.All of the above.
(Answer: d; descriptives and a histogram. This is interval-ratio data.)
Note to the instructor: Unless the problem is called to their attention, many students would go ahead and print out a frequency table for this type of data, remaining “unfazed” by the fact that the table goes on for page after page! Emphasize that they can obtain a frequency distribution only if the data are categoric or if they are grouped into intervals—SPSS/PASW does this automatically for the histogram but not for the frequency table.
Chapter Three
The formula for the estimated standard error of the mean is:
a.Estimated SE = σ
b.Estimated SE = S/N
c.Estimated SE = S/
d.Estimated SE = S2/N
e.Estimated SE = S
(Answer: c; sample standard deviation divided by the square root of sample size)
When sample size is 2 or more, the standard error of the mean is ______the standard deviation of the underlying variable.
a.Greater than.
b.Less than.
c.Equal to.
d.Equal to 1 – s.
e.Could be any of the above; it depends on the situation.
(Answer: b; less than)
Which of the following is NOT a characteristic of the sampling distribution of the mean?
a.It is normally distributed.
b.Its mean is equal to the mean of the underlying distribution.
c.It presents an exception to Chebycheff’s Inequality.
d.Its standard deviation is sigma divided by the square root of N.
e.Its standard deviation is less than sigma, provided N > 1.
(Answer: c; exception to Chebycheff’s Inequality)
The standard deviation of the sampling distribution is called:
- Measurement error.
- Estimated error term.
- The variance of the sampling distribution.
- Standard error.
- Bias.
(Answer: d; standard error)
Standard error represents:
- Variability in sample outcomes as the result of randomness in sampling.
- Errors a researcher made in measuring.
- Errors that result from researcher bias.
- Errors that result from inaccurate operationalization of variables.
- All of the above.
(Answer: a; variability in sample outcomes as the result of randomness)
If we roll a pair of dice and sum the numbers that turn up on the upper face of each die, what can we say about the sum?
- All numbers from 1 to 12 are equally likely as the value of the sum.
- All numbers from 2 to 12 are equally likely as the value of the sum.
- 7 is the expected value.
- 12 is the expected value of the distribution and the maximum likelihood estimate.
- We cannot draw any conclusion about the sum because it is a random process.
(Answer: c; 7 is the expected value)