STP 420 SUMMER 2005
STP 420
INTRODUCTION TO APPLIED STATISTICS
NOTES
PART 2 – PROBABILITY AND INFERENCE
CHAPTER 6
INTRODUCTION TO INFERENCE
Introduction
We want to be able to draw conclusions from the data collected through a sample.
Statistical inference
1.Confidence intervals – used to estimate a population parameter
2.Tests of significance – used to assess the evidence of a claim
Both are based on the sampling distributions of statistics. They report the probabilities that state what would happen if we used the inference method many times.
Inference is dependent on the probability model of the data and is most reliable when the data are produced by a properly randomized design (random sample or randomized experiment).
6.1Estimating with confidence
If = 500 for a population, and a sample is drawn from the population to give a mean of = 465; we say that the sample mean 465 is an estimate of the population mean 500. If you take a second sample, you may not get the same mean. It is important to present the variation along with the estimate of the population mean.
Statistical confidence
If repeated samples of size n is taken from a population that has mean and standard deviation , the sample mean ~ N(, /n) i.e – the sample has mean equal to the population mean and standard deviation equal to the population standard deviation divided by the square root of the sample size. This implies that the bigger the sample size, the smaller the sample standard deviation since n is in the denominator.
Remember the 68-95-99.7 rule ( is unknown but is known)
The probability is 0.68 (68%) that the sample mean will be within 2 standard deviations of the population mean (one standard deviation on each side).
The probability is 0.95 (95%) that the sample mean will be within 4 standard deviations of the population mean (two standard deviation on each side).
The probability is 0.997 (99.7%) that the sample mean will be within 6 standard deviations of the population mean (three standard deviation on each side).
Eg: if = 40 and = 4 the we can say that
P(36 44) = 0.68
lies within 8 points of is equivalent to is within 8 points of
68% of all samples will capture the true in the interval (36, 44)
P(32 48) = 0.95
lies within 16 points of is equivalent to is within 16 points of
95% of all samples will capture the true in the interval (32, 48)
P(28 52) = 0.997
lies within 24 points of is equivalent to is within 24 points of
99.7% of all samples will capture the true in the interval (28, 52)
Confidence Intervals – used to estimate a population parameter
A 68% confidence interval for is 4 or (36, 44)
A 95% confidence interval for is 8 or (32, 48)
A 99.7% confidence interval for is 12 or (28, 52)
Estimate margin or error
Where
margin of error – shows how accurate (68%, 95%, or 99.7%, etc.) we believe our guess is and depends on the variability of the estimate.
Confidence level (C) – how confident we are that the procedure will capture the true population mean
A level C confidence interval for a parameter is an interval computed from sample data by a method that has probability C of producing an interval containing the true value of the parameter.
Confidence Interval for a population mean
Choose an SRS of size n from a population having unknown mean and known standard deviation . A level C confidence interval for is
where
z* - value on standard normal curve with area C between –z* and z*
Exact interval if population distribution is normal.
Interval approximately correct for large n of populations with non normal distributions
Estimate z*estimate
How Confidence Intervals behave
We can reduce margin of error by
1.using smaller C (confidence level)
2.Increase n
3.Reduce
Choosing the sample size
The confidence interval for a population mean will have a specified margin or error m when the sample size is
Some cautions
1.Data must be and SRS from the population.
2.Formula not correct for probability sampling designs more complex than an SRS.
3.No correct method for inference from data haphazardly collected with unknown bias. Fancy formulas cannot rescue badly produced data.
4. is not resistant. Outliers can largely affect confidence intervals any may needed to be removed in a justified manner.
5.If n is small and population is not normal, true confidence level will be different from C.
6.Must know the standard deviation of the population.
Bootstrap – new procedure for approximating sampling distributions when theory cannot tell us its shape. It is taking the original sample as a population and then take many samples (each is a resample) from it.
6.2Tests of Significance – used to assess the evidence of a claim
Null hypothesis – statement being tested in a test of significance. Test is designed to assess the strength of the evidence against the null hypothesis. It is a statement of no effect or no difference.
Eg. H0 : = 187
Alternate hypothesis – statement we hope to suspect is true instead of H0
Eg. Ha : > 187one sided (right tailed test)
Ha : < 187one sided (left tailed test)
Ha : 187two sided (two tailed test)
A hypothesis always refers to some population or model, not to a particular outcome.
Test statistic – sample statistic () that measures the compatibility between the null hypothesis and the data.
where z has the standard normal distribution N(0, 1)
P-value – the probability, computed assuming that H0 is true, that the test statistic would take a value as extreme or more extreme than that actually observed. The smaller the p-value, the stronger the evidence against H0 provided by the data.
Statistical significance – if the p-value is as small or smaller than , we say that the data are statistically significant at level .
z Test for a population mean
To test the hypothesis H0 : = 0 based on an SRS of size n from a population with unknown mean and known standard deviation , compute the test statistic
In terms of a standard normal random variable Z, the p-value for a test of H0 against
Ha : 0 is P(Z z)
Ha : 0 is P(Z z)
Ha : 0 is 2P(Z |z|)
These p-values are exact if the population distribution is normal and are approximately correct for large n in other cases (non normal populations)
Confidence Intervals and Two-Sided Tests
A level two-sided significance test rejects a hypothesis H0 : = 0 exactly when the value 0 falls outside a level 1 - confidence interval for .
P-values (P) versus fixed
p-value – smallest level at which the data are significant.
Critical value – z* value on the N(0, 1) curve with the specified area to its right/left.
A test is significant when P and not significant otherwise.