1
Histograms
Example 1
A 30 minute standardized test is given to 27 students, and their completion times (in minutes) were as follows: 17,19,21,24,24,24,25,25,25,27,28,28,28,29,29,29,29,29,29,30,30,30,30,30,30,30,30,30
Clearly, this is right-skewed, as is seen in the histogram
Mean time =27.107 Median=29
Note the mean is lower than the median.
Example 2
The time to failure(in days) of a friction pad, used as a transmission clutch in a machine, was recorded for a random sample of 100. The data is left skewed, as is seen in the histogram:
Mean failure time = 362.71 Median failure time 352.96
Note the mean is now higher than the median.
Correlation
Example 1
A movie theatre wants to see if there is measurable relationship between ticket sales and average temperature during the hottest summer month of August. Data from the last 10 years is as follows:
Avg. Aug. Temp. (x)94999693949098969195
Ticket sales( $1000)(y) 11312210910512097116106100105
Solution: To calculate the correlation coefficient, the following table is set up:
xyx- y- (x-)2 (y-)2 (x-)*( y-) x*y
94 / 113 / -0.22222 / 4.444444 / 0.049383 / 19.75309 / -0.98765 / 1062299 / 122 / 4.777778 / 13.44444 / 22.82716 / 180.7531 / 64.23457 / 12078
96 / 109 / 1.777778 / 0.444444 / 3.160494 / 0.197531 / 0.790123 / 10464
93 / 105 / -1.22222 / -3.55556 / 1.493827 / 12.64198 / 4.345679 / 9765
94 / 120 / -0.22222 / 11.44444 / 0.049383 / 130.9753 / -2.54321 / 11280
90 / 97 / -4.22222 / -11.5556 / 17.82716 / 133.5309 / 48.79012 / 8730
96 / 106 / 1.777778 / -2.55556 / 3.160494 / 6.530864 / -4.54321 / 10176
91 / 100 / -3.22222 / -8.55556 / 10.38272 / 73.19753 / 27.5679 / 9100
95 / 105 / 0.777778 / -3.55556 / 0.604938 / 12.64198 / -2.76543 / 9975
∑ 848 / 977 / 59.55 / 570.22 / 134.88 / 92190
µ 94.22108.55
SDX=sqrt(59.55/9)=2.57
SDY=sqrt(570.22/9)=7.95
COV=134.88/9=~15
R=15/(2.57*7.95)=.73416
Using z-scores
ZX=ZY=
x-/SDX y-/SDY ZX*ZY
-0.08647 / 0.55905 / -0.048341.859058 / 1.691125 / 3.143899
0.691742 / 0.055905 / 0.038672
-0.47557 / -0.44724 / 0.212695
-0.08647 / 1.439553 / -0.12447
-1.64289 / -1.45353 / 2.387985
0.691742 / -0.32145 / -0.22236
-1.25378 / -1.07617 / 1.349284
0.302637 / -0.44724 / -0.13535
∑=6.6
R=6.6/9=.7333 (The difference due to round off)
Regression
Example 1
From the example above, temperature vs. ticket sales, what is the prediction for monthly ticket sales for an August with an average temperature of 92?
Solution: While the data set is very small, a prediction can be made. As in your course packet,
x= 92 degrees
ZX= (92-94.22)/2.57=- 0.864 (0.864 SDX’s below the mean).
ZY= 0.734*(-0.864)=-0.634 (0.634 SDY’s below the mean).
y= 108.55-.634*(7.95) = 103.51 ( in $1000)
So predicted ticket sales for an August with avg. temp.=92 is $103.51(thousands).
Example 2
In the example using height and weight in Ch. 6, given a correlation coefficient of R=0.6, it was shown that an x-value (height) of the 80th percentile was predicted to result in a y-value(weight) of the 69th percentile, while an x-value of the 10th percentile was predicted to result in y-value of the 22nd percentile. Using the same data set, what would happen to heights in the 70th, 90th, 20th and 5th percentiles?
Solution: Below is a table containing the percentiles of concern (highlighted), as well as many other values.
X%ZXZYY%
1-2.33(0.6)X(-2.33)= -1.4~3
5-1.645(0.6)x(-1.645)=-0.987~16
10-1.28 -0.768~22
20-0.84 -0.504~31
30-0.525 -0.315~38
40-0.25 -0.15~44
500.00.0~50
600.250.1556
700.525 0.315~62
800.840.504~69
901.280.768~78
951.6450.987~84
992.331.4~93
SIDE NOTE(Advanced): Looking at this table, notice that the amount of “regression effect” from x percentile to y percentile (towards the mean) changes depending on how far away the initial x value is from the median, or the 50th percentile (note that since the normal distribution is symmetric, mean = median). For x-values at the 99th and 1st percentiles (49 percentiles away from the 50th), the predicted y values regress 6 percentiles towards the mean(99->93 and 1->7). Then, at the 90th and 10th percentiles (40 percetiles away from the 50th), the regression effect is 12 percentiles (90->78 and 10->22). And again, at the 60th and 40th percentile(10 percentiles away from the 50th), the regression effect is only 4 percentiles (60->56 and 40->44). Why is this? The answer lies in the shape of the standard normal curve. Notice that the “slope” or steepness of the curve changes depending on the distance away from the mean. Very close and very far from the mean, say 0<z<0.25 and z>2, the curve is not very steep, tending to flatten out. On the other hand, somewhere between 0.5 < z <1.5, the curve is much steeper, at least relative to those places closer and further away.
This means that at these steeper points, a change in z corresponds to a bigger change in total area under the curve than at points where the curve is not as steep. This is equivalent to saying a change in zscore in the steeper areas corresponds to a bigger change in percentiles than in areas where the curve is not so steep.
Example 3
Refer to the data from 1998 Masters Golf Tournament (Round 1 and Round 2)
First Round mean() = 76.093 SDX=4.106
Second Round mean()=74.186 SDY=4.272 R=0.45
What is the probability of a golfer improving his score (scoring lower) on the second round if his first round score was 73?
Solution: First, we need to find the predicted second round score for a first round score of 73.
x= 73
ZX=(73-76.093)/4.106=-0.748417
ZY=R*(ZX)=(0.45)*(-0.748417)=-0.337
y = 74.186-.0337*4.272=72.75.
In order to improve his score, the golfer will need to score a 72 or below. Recall that we are assuming that the distribution of data points around the predicted score (from regression) is normal, with a center at the predicted value and SD=RMS error from the regression line.
RMS=Sqrt(1-R2)*SDY=Sqrt(1-0.452)*4.272= 3.81506.
So, for a second round score of
y’ = 72 (given a first round of score of 73)
Zy’ = (y’-y)/RMS=(72-72.75)/3.81506=-.06553 from the point of view of the regression.
The probability of scoring less than or equal to 72 is the area under the normal curve to the left of -0.06553, which is 0.526.
Example 4
Refer to the data from the Highway Mileage vs. City Mileage in Cars
HWY MPG mean() = 28.241 SDX=5.643
City MPG mean()=21.624 SDY=5.787 R=0.95
What is the probability that a car with City MPG of 24 will have a HWY MPG of 30 or more?
Solution: As in Example 3 above,
x=24 city mpg
ZX= (24-21.624)/5.787=.583377
ZY=0.95*0.583377=.554208
y = 28.241+.554208*5.643=31.368 hwy mpg.
RMS=Sqrt(1-.952)*5.643=1.762.
So, for a hwy MPG of
y’= 30 ( given the city MPG of 21)
Zy’=(30-31.368)/1.762=-0.776604.
The probability of a Hwy MPG>30 is the area under the normal curve to the right of -.77604=~0.78.
Example 5
Refer to the data of Total SAT vs Percentage eligible students taking the test.
Percentage mean() =37.35 SDX=3.92
Total SAT mean state average()=1065.5 SDY=68.5 R=-0.89
For a state which has a percentage of test takers=30, find a 95% confidence interval for the predicted State Total SAT average.
Solution:
x=30
ZX=(30-37.35)/3.92=-1.875.
ZY=-0.89*(-1.875)=1.6875
y= 1065.5+1.66875*(68.5)=1179.81. RM=Sqrt(1-(-0.89)2)*68.5=31.23
A 95% range = (predicted value)±2*RMS=1179.81±2*31.23=(1117.35, 1242.27)
Box Models
Example 1
Suppose a $2 scratch-off game has a total of 10000 tickets to be sold with the following odds
1 in 10 wins $2
1 in 100 wins $7
1 in 1000 wins $50
1 in 2000 wins $100
1 in 10000 wins $1000
What is the expected return of a single ticket from this game?
Solution: The box model has 10000 total tickets. If randomly chosen, 1 in 10 tickets should be a $2 winner, so there ought to be 10000/(10)=1000 tickets in it the box with a “2” on it. Similarly, there should be
10000/100=100 tickets with a “7”
10000/1000=10 “ “ “ “50”
10000/2000=5 “ “ “ “100”
10000/10000=1 “ “ “ “1000”
This sums to 1116 tickets, meaning there will also be 8884 tickets with a “0”.
The expected value of each draw from the box is
= ∑(Probability of drawing a ticket)x(# on the ticket)
= (1/10)x(2) + (1/100)x(7) + (1/1000)x(50) + (1/2000)x100 + (1/10000)x1000
= 0.2 + 0.07 + 0.05 + 0.05 + 0.1= 0.47 ( 47 cents)
So the expected return on each ticket is 0.47-2.00=-1.47 dollars.
Normal Approx. to Sampling Distributions
Example 1
In a particular city, the average household has 2.1 children with an SD=1.2. If a random sample of 100 households is chosen, what is the probability that the total number of children will be greater than 230?
Solution: Recall that
EV of sum = (average of the box)x(number of draws)= (2.1)x(100)=210
SE of sum =(SD of the Box)x(Sqrt(number of draws))=(1.2)X(Sqrt(100))=12
230 children is (230-210)/12=1.66 SE above the expected value. Assuming the sampling distribution is approximately normally distributed, with mean = EV and SD=SE, the area under the normal curve to the right of 1.66 is 0.048, which is the probability desired.
Example 2
In any one week, a small elementary school office uses an average of 12 bandaids with an SD=7. In a year of 37 weeks give a 95% range for the total number of bandaids used.
Solution: As above
EV=(37)x(12)=444
SE=(Sqrt(37))x(7)=42.6.
Assuming an approximately Normal distribution, we get the
95% range= EV±2xSE = 444±42.6=(401.4, 486.6)
Sampling distribution for proportions
Example 1
At a state funded university, approximately 5 out of every 6 students carries a cell phone. Out o f a random sample of 300 students, what is the probability of at least 255 of them having cell phones?
Solution: In this case,
the box={1,1,1,1,1,0}
box avg. = 5/6 or 0.833
SD of the box is Sqrt((1/6)x(5/6))=Sqrt(5)/6=0.373.
EV= 300x(0.833)=249.9~250
SE= Sqrt(300)x(.373)=6.46.
That makes 255 cell phones (255-250)/6.46=.774 SE’s above the EV.
The probability of finding at least 255 cell phones on the 300 students is about 0.22.
Example 2
A vending company knows from previous experience that for any one week in the summer in a certain water front city, there is a 3% chance of any one of its snack vending machines being vandalized, which causes the technician to have to spend extra time servicing the machines. In a random sample of 40 machines, what is the probability that 2 (5%) or more machines with have been vandalized?
Solution: With a box of 97 “0” and 3 “1”, box average=0.03
The EV of sum of vandalized machines = 40x(.03)=1.2
The SE=Sqrt[(0.97)x(0.03)x(40)]=1.08.
So two machines is (2-1.2)/1.08 is 0.74 SE’s above the expected value, and the probability having at least this many machines vandalized is 0.23.
Confidence Intervals for proportions
Example 1
In a random sample of 200 adults (18yrs +) from a large city, 24% did not know whether they had ever contracted measles in their lifetime. Give a 95% range for the proportion of the adult population, in this city, that doesn’t know whether they have ever had the measles.
Solution: In the case of proportions, the sample proportion is our expected value. The only thing we need to construct the range is the SE= (Sqrt[(0.31)x(0.69)÷200]=0.033. So the range desired is about 0.24±2x(0.033) = (0.174, 0.306).
Example 2
In a small city, from a random sample of 75 families, with children between ages 2-15 yrs, 21 of the families had at least one child visit a public library in the past week. Give a 90% range for the proportion of such families(with children as above), in the city, that had at least one child visit the public library.
Solution:
The EV of the proportion of the population= proportion of the sample= 21/75=0.28,
SE=Sqrt[(0.28)x(0.72)÷75)=0.052.
So the 90% range 0.28±1.65x(0.052)=(0.176, 0.394).
Sampling Distribution of a Sample Mean
Example 1
A health care company, which owns a number of hospitals, found that among their RN’s (registered nurses), the average work week =45 hours and the =3.5 hours. If a random sample of 49 nurses is drawn from the population of those employed at this company, what is the chance that the sample mean is at least 46 hours?
Solution: Recall that the sampling distribution of the mean(if you took many samples and looked at the distribution of the means for each sample) is approximately normal, with
EV= population (box) mean= 45 hrs
SE=(box SD)/Sqrt(number of draws)=3.5/Sqrt(49)= 0.5hrs.
For x= 46 hrs
Zx= (46-45)/0.5=2 SE’s above the EV.
Without looking at the table, we already know that this sample average, for a sample of size 49, corresponds to the right side limit (EV + 2*SE) of the 95% range for the sample average. This means the probability of having the sample average being greater than or equal to this is (1-0.95)/2=0.025.
Example 2
The owner of a large, long standing, local bar and grill has a pretty good estimate for the average bar bill for a customer on Friday night = $7 with =$4. In a sample of 36 customers on a Friday night, what is the probability that the average bar bill will be above $10? What is the probability that a randomly chosen individual customer will have a bar bill above $10?
Solution: As above, the distribution of the sample mean will be approximately normal with an
EV= 7 , SD=4/Sqrt(36)=0.75.
So, for an average bar bill per customer of
x= $10
Zx= (10-7)/0.75= 4
The chance of having an average at least that large, from a random sample of 36 customer, is less than .001. When considering the bar bill of only a single randomly selected customer, if we can assume the distribution of these to be approximately normal, we do a similar calculation, but with SD=4. So for
x=$10
Zx=(10-7)/4=0.75.
The probability of a randomly selected individual customer having a bar bill of at least $10 is .227.
*Notice large difference between the average of a sample of 36 customers and that of an individual customer. Notice also that our assumption of an approximately normal distribution of the bar bill per customer has some difficulties, as it will not be completely symmetric: customers don’t usually have negative bar bills!(-2*<0)*
Example 3
Suppose in a city with 42000 senior citizens, the average monthly prescription drug bill (before insurance) is = $75 and =$36. For a random sample of 144 senior citizen, give a 95% range for the average sample mean of monthly prescription drug costs.
Solution:
SE=36/Sqrt(144)=3,
95% range = EV±2xSE=75±6=(69,81).
Hypothesis Testing
Example 1
An airline wants to know how the flight from one location to another is doing. In order for this route to be considered financially viable, an average occupancy of the flight needs to be at least 65%. In a random sample of 50 flights, the mean occupancy was 59% with an SD=11%. Is this route financially viable?
Solution: To set up a test of hypothesis, since the condition we want to check is if average occupancy(occup) ≥ 65%,
H0:(occup) ≥0.65.
Ha:(occup) <0.65. (exhaustive of all outcomes and mutually exclusive of H0 )
Next, we construct the test statistic
z=(xbaroccup-.65)/0.11 =- 0.06/0.11=-0.545.
This test statistic corresponds to a p-value(area under the standard normal curve to the left of z=-0.545)~.29. This is not a particularly small p-value, so we are unable to reject H0, and thus conclude this airline route meets the average occupancy rate to be financially viable.
Example 2
Suppose in large city, it is estimated that at least 70% of the workers at a galleria mall used some sort of public transportation to arrive there. From a random sample of 70 workers, 45 had used some sort of public transportation. Is the estimate acceptable?
Solution: If we let ppubtran be the proportion of the population of galleria workers who use public transportation, then
H0: ppubtran≥0.7 and Ha:ppubtran<0.7.
p= 45/70= 0.64 observed population
SD= Sqrt (0.64*0.36/70)=0.057.
z=(0.64-0.70)/0.057=-1.05. →p-value = ~0.147.
As above, we should not be too hasty to reject H0 based on such a p-value, so we conclude that the estimate is accurate, that the degree to which the observed sample proportion falls below the estimated population proportion is due to chance.
T-test
Example 1
A restaurant chain gets freshly squeezed orange juice from a supplier. The chain suspects that the juice is being watered down before it arrives at the restaurants. The pH of freshly squeezed orange juice is about 3.5. A random sample of 20 shipments is tested for pH, finding a mean of 3.65 and an SD=0.09 Does the restaurant chain have something to address with the supplier?
Solution: Because adding water to juice would lower the acidity, and hence raise the pH, we are only interested in the case where the pH is above 3.5. Hence, we will pick
H0: mpH≤3.5, which will constitute acceptable freshly squeezed orange juice.
Ha : mpH>3.5 indicating something might be wrong with the juice.
Since we are dealing with a small sample size, the t-test is employed. The test statistic will be
t=(3.67-3.5)/.09=1.889.
For a sample size of 20, we have a df=19,
t=1.889 → p-value of 0.038.
Since this is less than 0.05, we reject H0 and assume mpH>3.5. While there may be a perfectly acceptable explanation as to why the pH is higher than expected, it would be prudent of the restaurant chain to contact the orange juice supplier with these findings.
Example 2
A commercial real estate salesmen is selling a small hotel. In the information to the seller, the owner claims a steady average occupancy of 75% by month. An interested buyer requests previous occupancy records over the last five years. From a random sample of 17 months, the prospective buyer calculates an average monthly occupancy rate of 70% with an SD=4%. Does the owner’s claim seem valid?
Solution: Since the prospective buyer really doesn’t mind if occupancy is greater than 75%,he really only needs to check if it is less than 75%.
Let H0:occup≥0.75, then Ha::occup<0.75.
Again, since our sample size is so small, the t-test is employed. So df=16
t=(70-75)/4=-1.25 →a p-value of 0.115
While this is small, it is not really small enough to reject the H0, so the buyer can probably take the owner at his word.
Two Sample Hypothesis
Example 1
100 male subjects were given an initial strength test for chest, shoulder and arm strength, where a score from 0-50 was assigned based upon the results of the test. The 100 were then randomly divided into two groups evenly. All the subjects in the first group participated in a strength training program using free weights, while all in the second group participated in a strength training program using isometrics. After eight weeks, the strength test was administered again, and the difference between the first and second test scores, 2nd-1st were recorded. From this data, xbarfree= 16 with SDfree= 3 and xbariso=12 with SDiso= 4. Does one of these two training methods tend to produce a bigger increase in strength than the other?
Solution: This study is a two stage process:
1)you measure “before” and “after” ,from which you get a measure effectiveness by calculating the difference “after-before”, and
2)we compare the two independent samples of these differences by looking at their means and SD’s
To answer the question above, we will hypothesize that the true average “strength increase” due to free weight training is equal to that for isometric training. That is,
H0:free=iso, with Ha:free≠iso.
For the data collected, our test statistic will be
ztest=(xbarfree-xbariso)/SDdiff,
where for two independent samples, the standard deviation of the difference is
SDdiff=Sqrt(SDfree2 + SDiso2). =Sqrt(25)=5.
ztest=(16-12)/5= 4/5=0.8 → p-value = 2x0.2=0.4 (since this is a two sided test).
Hence, we see no statistically significant difference (in our data) in increase in strength between these two training methods.
Example 2
A local family entertainment center is planning to open a second batting cage and needs to purchase another pitching machine. They are considering buying a different model than the one they currently own, but want to compare to see if it is really any better for their needs. The company is able to set the two models against each other at the supplier’s store, model 1 being identical to what they already have and model 2 being the prospective new model. The pitching machines were placed in adjacent batting cages and were adjusted so as to pitch at the same speed and aimed at the center of identical strike zones. Next, each machine was used to pitch 75 balls, where the number of “strikes” was recorded. The results are summarized as xbar1st= 62, xbar2nd=68. Is there a significant difference between the performances, as indicated by proportion of “strikes” pitched, between these two machines?
Solution: The owner of the entertainment center are really only interested in seeing if the other model performs better than the one they already have. Therefore
H0: pexisting ≥ pnew and Ha: pexisting < pnew .
We will construct test statistic similar to that above, except we will be comparing proportions from two independent samples. Recall that the