1
CHAPTER 3
#1) We first need to figure how wide each of our intervals will be, and then we can calculate the relative frequency (the height of the rectangle) of each interval. The range of the data is 0-18, so to make intervals of equal width, we pick a width of 18/5 =3.6, or 4 to make things simple, with the convention of placing data on the boundaries in the interval to the right.
The ordered data looks like 0,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,5,5,6,6,6,6,7,8,8,8,9,9,10,10,10,11,12,12,13,13,14,16,16,18.
IntervalFrequencyRelative Frequency
0-41313/40=0.325
4-8 1010/40=0.25
8-12 9 9/40=0.225
12-16 5 8/40=0.2
16-20 3 3/40=0.075
#2. As in #1, we need to calculate the relative frequencies so as to determine the height of each rectangle in the histogram.
Amount of
purchase ($) FrequencyRelative FrequencyRel.Freq/Width
0 - 20 3636/200=0.180.18/20=0.009
20 - 30 4242/200=0.210.21/10=0.021
30 - 40 5050/200=0.250.25/10=0.025
40 - 50 3434/200=0.170.17/10=0.017
50 - 60 1616/200=0.080.08/10=0.008
60 - 70 1212/200=0.060.06/10=0.006
70 - 100 1010/200=0.050.05/30=0.00166
Note: Intervals are of unequal width so we must use the density scale
#3)Ans: 45%, median is in 30-40hr interval
Explanation: For density scale,
Area in a “bar” or rectangle = % cases that fall in that interval.
% of lifetimes > 40 hrs = area under the histogram > 40 hrs
= area of the last two rectangle.
Area of 1rst rectangle = base X height
(50hr-40hr) X (2.5 %/hr)
10hr X 2.5%./hr = 25%
Area of 2nd rectangle= (70hr-50hr) X (1.0%/hr) = 20%
So, the percentage of lifetimes that exceed 40hrs is 25% + 20%=45%
From our work above, % of lifetimes in 0-30,30-40 intervals is 100%-45%=55%, so the median must be in one of them.
The % of lifetimes in 0-30 interval = 30hr X .5%/hr =15%, so
30-40 interval must contain the median.
#4)
IntervalFrequencyRelative Frequency Rel. Freq./Width
10-30 cars1212/24=0.50.5/20=0.025
30-50cars 8 8/24=0.3325=0.016625
50-70cars 4 4/24=0.1675=0.008375
Both histograms are acceptable-one is relative frequency, the other is density scale.
#5)
Interval FrequencyRelative Frequency Rel. Freq./Width
10-20 days 3 3/20=0.150.015
20-30 days1313/20=0.650.065
30-40 days 4 4/20=0.20.02
40-50 days 0 00.0
The histogram then looks like:
CHAPTER 4
#1)Ans:Mean=11.4, Median=10, SD=4.58694
Explanation :The ordered data is 5,9,10,15,18. The exact middle will be our median = 10.
The mean, or average, is (sum of observations)/(# of observations)=57/5=11.4 For the standard deviation,
SD= Sqrt{[∑(observation – mean)2]/(# of observations)}
=Sqrt([(5-11.4)2 + (9-11.4)2 + (10-11.4)2 + (15-11.4)2 + (18-
11.4)2/5])
=Sqrt(105.2/5)=4.58694
#2)
a)Ans: 68% The range 40-60 can be written as (50-10)-(50+10) or 50 ± 10. Since 10 = SD and 50=mean, the range 40-60 is all the data within one SD of the mean, so we expect this contain roughly 68% of the data.
b) Ans: 70 Your drawing should look like the nice “bell curves” illustrated in your text in chptr 4, with the center, or peak, at the mean=50, and tapering off at about two SD away from the mean on either side, namely, 50-2x10=30 and 50+2x10 = 70.
#3)Median = middle entry of the three = 38
Mean = (33+38+40)/3=111/3=37
SD = Sqrt([(33-37)2 + (38-37)2 + (40-37)2]/3)=Sqrt(26/3)=2.94392
#4)
a)While the axis may appear a bit different, your dot plot should look similar to that below:
SD=16 The standard deviation is approximately the range divided by 4 (since two standard deviations to either side of the mean should contain just about all the data) , and so a rough estimate of the SD is (88-32)/4=56/4=16.
#5)As above,
Median = the middle observation of the ordered data(23,42,48,60,78), =48
Mean = (Sum of observation values)÷(number of observations)=251÷5=50.2
SD=Sqrt([(23-50.2)2 +(42-50.2)2 + (48-50.2)2 + (60-50.2)2 + (78-50.2)2]/5)=32.306
#6) Mean = 68/4= 17
SD=Sqrt( [(3)2 +( 2)2 + (1)2 + (4)2 ]/4)=2.73861
#7)SD=~4.25 The histogram looks pretty symmetric, or balanced, about the peak at the value 22 in the middle, so this would be a rough estimate of the mean. The empirical rule is 95% of the data should fall within two SD’s of the mean, and looking at the histogram, it appears as if about 2-3% falls at or below 12, and about the same at or above 29, so an estimate of the
SD~ (29-12)/4=4.25
#8) Theamount of weight John is above the average is 230lb-190lb=40lbs. This is equal to 40/25=1.6 SD’s.
1.4 SD’s = 1.4x25lb= 35lb, so Mark’s weight is about 190lb-35lb=155lb.
1.2 SD’s = 1.2x25lb= 30lb, so men with weights within 1.2 SD’s of the average have weights between (190-30)lb and (190+30)lb, or between 160lb and 220lb.
CHAPTER 5
#1)Ans: 0.106, 9.76min
Explanation: (30min-20min)/8min=1.25, so 30min is 1.25 SD’s > mean, or, in standard units, 30 min is equal to 1.25. So,
Prob(squad car takes >30 minutes to arrive.)=(Area under the std. normal
curve > 1.25)
= 1-(area to the left of 1.25)
= 1-0.894=0.106
Notice that we see the same area to the right of 30 under the normal curve with mean =20 and variance=8
To find the arrival time below which 10% of all arrival times fall, we will again refer to the standard normal curve. From the tables, we need to find the specific z-value to the left of which 10% of the area lies.
This is z=-1.28. Since the z value we found is less than zero, we know this corresponds to a value 1.28 SD’s below the mean. So,
Time at which only 10% squad cars arrived before = mean – 1.28 SD’s
= 20-1.28x8=9.76 min.
#2)Ans: a)No, 552 is less than the 75th percentile.
Explanation:The 75th percentile=value below which 75% of the data falls
(or 75% of the area under the curve).
From the std. normal curve, this corresponds to z=0.674
As in #1) we see that 552 is (552-500)/100= 0.52 SD’s above the mean.
As in #1) we see that 552 is (552-500)/100= 0.52 SD’s above the mean.Since the score of 552 is less than 0.674 SD above the mean, it less than the 75% percentile, and hence does not beat the cutoff.
#3)Ans:Passing score=67, 6.7% score less than 60.
Explanation:The score above which 80% of the students should achieve is the same below which 20% will not, i.e the 20th percentile.With z0.20 =-0.842, the passing score should be 75+(-0.842)x10=75-8.4=66.6~67.
A score of 60 is (75-60)/10=1.5 SD’s below the mean. From the standard normal table, we see that .0668, or approx 6.7% of the area under the curve is below -1.5. This means that the same percentage, 6.7%, of the students score less than 60 (1.5 SD’s below the mean).
#4)
a)Ans: 88.5%
As in #3) 56minutes is (56-50)/5 = 1.2 SD’s above the mean. From the standard normal table, 88.5% of the area under the curve is below (to the left) of z=1.2, so we can conclude that 88.5% of the students will finish the test within 56 minutes (mean + 1.2 SD’s).
b)Ans: 57 minutes
Following the reasoning in the first part of #3), we want to find the 90th percentile, since we want 90% of the student test times to be at or below this time. From the standard normal table z0.90 =1.28, so the test time needed to ensure that 90% of students finish = mean + 1.28 SD’s =50 + 1.28x5=56.4~57 minutes.
#5) Ans:0.0007
9.2volts=(10-9.2)/.25=3.2 SD’s below the mean. From the standard normal table, the probability of an observation having a value less than (to the left) of -3.2 (3.2 SD’s below the mean) is approximately 0.0007. So the probability of a battery being defective (having a voltage less than 9.2 volts) is 0.0007.
#6) Ans: 0.586
90 complaints is 0.5 SD’s below the mean, and 125 complaints is 1.25 SD’s above the mean. What we want is the area under the normal curve which falls between the two z value of -0.5 and 1.25. To find this area, note that this
area is actually
(Area under the curve <1.25)- (Area under the curve <-0.5)
This is illustrated below.
(Area under the curve <1.25) - (Area under the curve <-0.5)
That is, the probability of having between 90(mean-0.5 SD’s) and 125 (mean + 1.25 SD’s) complaints next month is 0.586.
CHAPTER 8
#1)Ans: r=0.4335
Mean X=5 SDX=3.16
Mean Y=9 SDY=2.92
XYX std. unitsY std. Units Product
1 8(1-5)/3.16=-1.266 (8-9)/2.92=-0.342(-1.266)(0.342)=-0.433
35(3-5)/3.16=-0.633 (5-9)/2.92=-1.37(-0.633)(-1.37)=0.867
713 = 0.633 = 1.37 =0.867
910 = 1.266 =0.342 =0.433
Sum= 1.734
Mean of the products =1.734/4=.4335
#2)
a)Ans: r=0.7558
Mean X=6yrs. SDX=2.12yrs.
Mean Y=26(x$1000) SDY=4.24
XY X std unitsY std unitsProduct
320-1.415-1.4152
524-0.472-0.472.222
6270.00.2360
8330.943 1.6511.557
8260.943 0.0 +0
Sum =3.779
Mean of the Products=3.779/5=.7558
b)Ans: Y=16.304 + 1.616*X
Explanation: The slope of the regression line is (correlation coef.)xSDY/SDX=.7558x4.24/2.22=.7558x2=1.616
The intercept of the regression line is = Ybar- (slope)x(Xbar) =26-6x1.616 =16.304. So the equation of the regression line is Y=16.304 + 1.616x(years of experience) where Y is in units of $1000.
#3)Ans: r=0.8(approximately)
Explanation:Almost of all of the X data lies between 30 and 70, symmetrically about 50, with an outlier on both sides, so the average X~50, and the SD~10. Judging by Figure 6 on page 127 of the text, the correlation coefficient looks about 0.8
#4)
The smallest x=~25
The smallest y=~90
The average x=~50
SD of x= ~6. Almost all of the data sits between 35 and 65, or the mean-~15 and mean+~15. This means 15~2 SD’s, and the closest to this among the choices =6.
The r is certainly not 0, and it looks stronger than .2, so correlation=.5 is the only other option.
#5)If r=-.4 then as x increases, y decreases. The r<0 means the ”cloud” in the scatter plot slopes down as x increases, that is, y decreases.
Chapter 9
1)Ans: Probably not equal , most likely a little higher
Explanation: Suppose that Mexicans (represent by the red data “football”)are shorter and weigh less, on average, than the Europeans (represented by the blue). The result of pooling these two sets of data, as illustrated below, is a “stretched” football shape, making it look longer and thinner, which would make it cluster closer about a line.
For example, in the graph below, both groups again have r=.7, but combined, they now have r=.77.
(2)Ans:This conclusion does not appear to be supported by the data at hand. Explanation: While a correlation of -0.3 exists, it is imperative to remember that correlation does not imply causation, and there could be other factors, maybe even more significant than and probably confounding the effects of computer gaming time, that account for the variation of GPA’s.
(3)Ans: The manager’s conclusion is not supported by the data.
Explanation: While the manager’s conclusion might make sense intuitively, the data suggests otherwise. From the scatter plot, the drink sales appear to peak around 90 degrees, but then begin to fall off beyond that (maybe fewer people come out to the beach for weather beyond 90 degrees?). In fact, the data appears to follow something other than a linear relationship, most likely a curve. For example if the current day’s temperature were 90 degrees, and the next day was predicted to be 100 degrees, according to the data one would expect to see, on average, a decrease in drink sales, contrary to the manager’s conclusion.
(4)Ans: The data people are correct.
Explanation: Recall that the correlation coefficient is a “pure number”, without units, and is not affected by
1)adding the same number to all the values of one variable
2)multiplying all the values of one variable by the same positive number.
In converting the numbers from kg to lbs, it is a matter of just multiplying by a positive conversion factor, and hence the correlation coefficient is unchanged. Further conversion of degrees Fahrenheit to Celsius does not, as shown in the text, alter r, so the correlation coefficient remains unchanged in the complete conversion.
(5) Ans: The board’s proposal is premature.
Explanation: Because the data is taken from such a broad swath of the population which frequents the farm/garden supply store, the data can be broken down into a number of subcategories which might give stronger correlations, for instance,
a)age( retired, elderly people, who have time to garden a lot may have a lower strength score than younger study participants who have little time to garden)
b) gender (women, who may garden more as a result of not working outside the home, may have lower strength scores than their male spouses)
c)occupation(strong, physical laborer participants may not garden as much, due to employment requirements, as other study participants of less strength but more gardening time)
Chapter 10
(1)Ans: 80th percentile high school GPA => 60th percentile in college GPA
10th percentile high school GPA=>35th percentile in college GPA
Explanation:
High School GPA: 80th percentile has Z= 0.84 (area to the left is 0.80)
So,
College GPA has Z=(0.3)x(0.84)=0.252 ~ 0.25
The area to the left of Z=0.25 is 0.599, which corresponds to the 59.9th
percentile, or ~ 60th percentile.
Similarly,
High School GPA of 10th percentile has Z=-1.28
College GPA has Z=(0.3)x(-1.28)=-.384, corresponding to the 35th percentile.
(2)Ans: False
Explanation: Two regression lines can be drawn/calculated from the data set of interest. One predicts weight from height, the other predicts height from weight. These are two different lines, as shown in the text, and hence where as 62” tall predicts a weight of 140 lbs, 140lbs does not necessarily predict 62” tall.
(3)Ans: No
Explanation: The graduate student is guilty of the regression fallacy: the reason for this change in percentile rank of grades is a clear example of the regression effect, especially since we know the grade distributions to be approximately normal, and hence the scatter diagrams to be “football-shaped” clouds of data points.
(4)Ans: Person 2 has the better interpretation
Explanation: The different regression lines do not necessarily evidence different spending habits. As above, one regression line predicts expenditure from income(store 1), While the other predicts income from expenditure (store 2), and cannot be compared directly. Also, caution is in order when interpretating regression/correlation done on “averages” versus the raw paired data, as correlation can be deceivingly strong when using averages.
(5)Ans: No
Explanation: UFIT is trying to extrapolate about company employees, whose age, fitness, health history and daily-lives will, most likely, vary greatly from that group of people who participated in the study (6th-10th graders). There is little reason to believe that the results of the company employees’ participation in the exercise routine will duplicate those of the test group of 6th-10th graders.
Chapter 11
- Ans: (a)25.93 = 26 , (b) 26 ± 2.4 , (c) false , (d) y = 13.68 + 0.533*X
Explanation:In the manner of your course packet
x = 23mpg
Zx=(x-Xbar)/SDx=(23-25)/3=-0.666
Zy= R*Zx=(-0.666)*(0.8)=-0.533
y=Ybar + SDy*Zy= 27 + 2*(-0.533)=25.94~26mpg
So a car with 23mpg the first month is predicted to have gas mileage of 26mpg in the second month
(b) The 95% range is mean y( for given X value) ±2*R.M.S
R.M.S.= Sqrt(1-.82)*2=1.2
95% range=26±2*1.2=26±2.4
(c) This is what is meant by the “regression effect”.
(d)The regression line slope = R*SDY/SDX=0.8*2/3=0.533.
The intercept= Ybar – slope*Xbar
=(mean 2nd month mpg)-(slope)*25
=27-(0.533)*25=13.68
So the equation of the regression line is 13.68 + 0.533*X
- Ans: 38.6 , 38.6 ± 15.8 ( note RMS = 9.6)
Explanation: As in #1)part(a),
x=35(thousands)
Zx=(35-30)/10=5/10=.5
Zy=(0.6)(0.5)=0.3
y=35+(0.3)12=35+3.6=38.6.
For the 90% range for 2007 incomes of those who made 35k in 2000, we assume the entire set of data is normally distributed,
mean = 38.6
SD=RMS=Sqrt(1-0.62)*SDY= 0.8*12=9.6
Then 90% range is = 38.6 ± 1.645*RMS
= 38.6 ± 1.645*9.6=38.6±15.8
- Ans: 81.2 = 81 , 81 ± 17
Explanation: As above, an entrance exam score of 80 is
x = 80
Zx=(80-60)/15=4/3=1.33
Zy=(0.7)(1.33)=0.931
y =70+(0.931)*12=70+11.17~81.
With an RMS=Sqrt(1-0.72)*12=8.57,
the interval of interest is 81±2*8.57~81±17.
- Ans: 11.25 , 11.25 ± 7.36 (RMS = 3.68)
Explanation: First, we need to find SDX, SDY and R.
SDX=Sqrt(avg(x-meanx)2)
=Sqrt([sum{(xi-meanx)2}]/n)
=Sqrt(840/30)=5.3
SDY=Sqrt(580/30)=4.4.
R=COV(x,y)/(SDX*SDY)
=average {(xi-mean x)*(yi-mean y)} /(SDX*SDY)
=([sum{(xi-mean x)*(yi-mean y)}/30-]/[SDX*SDY]
=(450/30)[5.3*4.4]=.64
Now,
x=22”
Zx=(22-18)/5.3=0.75
Zy=(0.64)(0.75)= 0.48
y= 9.1 + (0.48)(4.4)=9.1 + 2.12=11.22
RMS=Sqrt(1-.642)*SDY=3.4,
95% range=11.22 ±2*3.4=11.22±6.8
(5) Ans: A and D
Explanation: A and D exhibit a definite pattern. Note in both A and D that all the residuals are positive (overestimating y) up until a certain value of y, then all become negative (underestimating y), and then again all become positive. These patterns suggest that the y values are not linearly related to the x values.
(6) Ans:Yes
Explanation: Recall that the RMS of a regression line is calculated RMS=Sqrt(1-R2)*SDY. Also recall that R is always between -1 and 1, so R2 and Sqrt(1-R2) are always between 0 and 1. This means that RMS is always less than or equal (in the case R=0: no correlation between X andY) to SDY. Since SDY is already less than the accuracy required, the manager knows that the RMS has to be less than that required.
Chapter 12
1.Ans: y = 1.204 + .466 x , 13.068 ± 2.350 ( note RMS = 1.428)
Explanation: The slope of the regression line is
Slope=R*SDY/SDX=0.7*2/3=0.466.
Intercept is Ybar-slope*Xbar = 14-6*(0.466)=11.2.
The regression line is 11.2 + 0.466x
To predict the average weight of a sunfish 4 yrs of age,
x=4
Zx=(4-6)/3=-0.666
Zy=(0.7)(-0.67)=-0.466
y=14 + (-0.466)*2= 13.068.
RMS=Sqrt(1-R2)*SDY=Sqrt(.51)*2=1.428.
So,recalling #2 from Chptr 11 above,
90% range =13.068 ± (1.645)*1.428=13.068±2.35
2. Ans: y = 43.9 - .245 x , RMS = .693
Explanation:
SDX=Sqrt(40/5)=2.83
SDY=Sqrt(4.8/5)=0.95.
To find R, we first need
COV(x,y)=average{x*y}-(mean x)*(mean y)
=Sum(xi*yi)/n – 29*36.8
=5326.2/5-1067.2=-1.96.
Then R= COV/(SDX*SDY) =-1.96/(2.83*0.95)=-.73.
slope of the regression line =R*SDY/SDX=-.73*0.95/2.83=-.245.
The intercept = Ybar-slope*Xbar = 36.8 –(-.245)*29=43.9.
Equation for the regression line is 43.9-.245x.
3. Ans: intercept = 21.6,change in y for 1 unit change in x, 10.88 ± 3.23
Explanation: The intercept and description of the slope follow directly from the discussion of lines in the text. More specifically, though, the slope means that for each year of age the employee is, it is predicted that the employee will, on average, be absent 0.268 days less during the calendar year. At age 40, the average employee will be absent 21.6-0.268*40= 10.88 days. The 95% prediction interval is 10.88±2*RMS=10.88±3.23
4 Ans: No
Explanation: The CEO is making a causal inference from an observational study. “With an observational study, the slope cannot be relied on to predict the results of intervention.” Here, the intervention is a controlled increase in vacation time. There are almost certainly other factors which positively impact sales volumes and which happen to be positively correlated with vacation time, e.g. more established and experienced salesmen might have higher sales volumes, and, due to their seniority, get more vacation time. While we might all like a CEO who thinks like this, he wouldn’t be a CEO for long.
- Ans: y = 38.16 + .568 x , RMS = 3.84
Explanation: First,
SDX=Sqrt([(-14)2 + 12 + 112 + 62 + (-4)2]/5)=8.6
SDY=Sqrt([(-8)2 + 22 + 22 + 92 + (-5)2]/5)=5.96. Next,
Sum{xi*yi}=65*75+80*85+…=32995
COV=average{ xi*yi}-(mean x)*(mean y) =32995/5-79*83=42.
R=COV/(SDX*SDY)=42/(8.6*5.96)=0.82
The slope of the regression line = 0.82*5.96/8.6=0.568
The intercept is 83-0.568*79=38.13
Chapter 16
(1)Ans: box has one “35” and thirty seven “-1”.
Explanation: The phrase “pays 35 to 1” means that you get your $1 bet plus $35 back if your number comes up, but lose your $1 bet if any other number comes up. For a box diagram, that means there is one ticket with “35” on it, for your number, and the rest with “-1” on them. The number with -1 on them is the total number of “numbers” other than yours. In the case of roulette, this is 36 +2(‘0’ and ‘00’) -1(your number)= 38-1=37.
(2)Ans:(a)box has one “35”, 18 “1”, and 19 “-2”.
(b)box has one “36”, 17 “1”, and 20 “-2”.
Explanation: As above, the box will have 38 tickets.
For (a)
Win $35 for the number ‘10’ -> 1 ticket with “35” on it
“ $1 “ “ “ ‘odd’ -> 18 “ “ “1” (# of ‘odd’ in 0-36)
Lose $2 for any other # -> 19 “ “ “-2” (remaining tickets)
Similarly, for (b),
Win $35 + $1 for the number ‘10’ -> 1 ticket with “36” on it
Win $1 for ‘even’ (besides ‘10’) -> 17 “ “ “1” (# ‘even’ besides ‘10’) Lose $2 for any other # -> 20 “ “ “-2” (remaining tickets
(3)Ans: -$11, $1
Roulette #12 33 6 16 8 4 18 29 5 14
Net Gain: 2(a)bet -2 +1 -2-2 -2-2 -2 +1 +1 -2 Total:3-14=-11
2(b)bet +1 -2 +1 +1 +1 +1 +1 -2 -2 +1 Total:7-6 = 1
(4)Ans: you can’t given the current information
Explanation: You need to know if the “10” is black or red.
If ‘10’ is red, the box looks like 2(a), and if black, like 2(b).
(5)Ans:44/70
Explanation: The SE for the number of baskets made =Sqrt(# free throws)*(SD of the box). For 20 and 70 free throws, SE= Sqrt(20)*(0.5)=2.236 and 4.183,respectively. Then 14(out of 20) is (14-10)/2.236=1.789 SE’s above the mean, and 44(out of 70) is (44-35)/4.1833 =2.151 SE’s above the mean. Assuming the normal distribution of your cousins shooting percentage, the probability of getting 44 out of 70 is lower, and hence a better bet for you.
Chapter 17
(1)Ans: EV = 200 , SE = 8.16
Explanation: When a game is described as a set of tickets in a box, the amount you win, or expected value(EV), from a # of plays =(number of draws)x(average of the box). In this case, the average of the box = sum{ticket values}/(# of tickets in the box)
= 6/3 = 2.
So the EV = (100)X(2)=200. The standard error for the amount you win(sum of the draws)= Sqrt(number of draws)x (SD of box). The SD of the box =Sqrt{[(1-2)2 + (2-2)2 + (3-2)2]/3}= Sqrt(2/3). So the SE of the sum of the draws = Sqrt(100)xSqrt(2/3)=Sqrt(100x2/3)= Sqrt(66)=8.16