STP 420 SUMMER 2005
STP 420
INTRODUCTION TO APPLIED STATISTICS
NOTES
PART 1 - DATA
CHAPTER 3
PRODUCING DATA
Introduction
Exploratory data analysis – covered in Ch. 1 & ch.2
Use of graphs and numbers to uncover the nature of a data set.
Not good enough to provide convincing evidence for its conclusions
Formal statistical inference – answers specific questions with a known degree of
confidence.
- it uses the descriptive tools given in the previous chapters along with new kinds of reasoning (numerical rather than graphical)
3.1First Steps
Major questions when trying to produce data.
1.What individuals shall you study?
2.What variables shall you measure?
Designs – arrangements or patterns used to collect data from many individuals
Some questions addressed by designs
1.How many individuals shall you collect data from?
2.How shall you select the individuals to be studied?
3.How shall you form groups where relevant?
Otherwise you may be misled by haphazard or incomplete data or by confounding (mixed effects of various variables on the same response without being able to determine which variable has which effect on the response)
Where to find data: library and internet
Anecdotal evidence – based on haphazardly selected individual cases that come to our attention, but may not be representative of any larger group of cases.
There are many places to find data including:
The annual Statistical Abstract of the United States
US Census Bureau
Website:
Available data – data produced in the past for some other purpose but may help answer a present question
Producing new data is expensive and available data is used whenever possible.
Statistical designs used for producing data rely on sampling or experiments.
Sample – group of individuals/subjects from which data is gathered and is representative of a larger body or population
Census – when information is gathered from the whole population
- time consuming and expensive, hence, the reason samples are used instead
Observational study – observes individuals and measures variables of interest but does not influence the responses
Experiment – deliberately imposes treatment on individuals in order to observed their responses
3.2Design of Experiments
Experimental units – individuals/subjects on which experiment is done
Treatment – specific experimental condition applied to the units
Factors – explanatory variables in an experiment
Level of a factor – specific value of a factor/variable
Placebo – dummy treatment
In principle, experiments can give good evidence for causation.
Comparative experiments
Simple design with only a single treatment is:
TreatmentObserve response
Control group – group that are given the dummy/placebo
It is more beneficial to have an experiment with at least two groups; one given the treatment and the other given the dummy treatment.
The design of a study is biased if it systematically favors certain outcomes.
Randomization – use of chance to divide experimental units into groups
Principle of Experimental Design
The basic principles of statistical design of experiments are
1.Control of the effects of lurking variables on the response, most simply by comparing several treatments.
2.Randomization, the use of impersonal chance to assign experimental units to treatments.
3.Replication of the experiment on many units to reduce chance variation in the results.
Statistically significant – observed effect so large that it would rarely occur by chance.
How to randomize
Drawing names/numbers in a hat
Table of random digits
Computer software
Random digits
Table of random digits – list of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 that has the foillowing properties:
1.The digit in any position in the list has the same chance o being any one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
2.The digits in different positions are independent in the sense that the value of one has no influence on the value of any other.
Completely randomized design – experimental design with all experimental units are
allocated at random among all treatments
Cautions about experimentation
Double-blind – neither the subjects nor the researcher knows which subject got which
treatment.
Lack of realism – subjects or treatments or setting of an experiment may not realistically duplicate the conditions we really want tostudy.
Matched pairs designs
Can produce more precise results than simple random sampling
Uses principles of comparison of treatments, randomization, and replication on several experimental units
Is an example of block design
Block designs
Block – group of experimental units or subjects that are known before the experiment to be similar in some way that is expected to affect the response to the treatments.
In a block design, the random assignment of units to treatments is carried out separately within each block.
3.3Sampling Design
Population – entire group of individuals that we want information about
Sample - a part of the population that we actually examine in order to gather information
Sample design – method used to choose the sample from the population
Voluntary response sample – consist of people who choose themselves by responding to a general appeal. Biased because people with strong negative opinions usually respond more often
Simple random samples
Simple random sample (SRS) – consist of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.
Stratified samples
Probability sample – gives each member of the population a known chance (> 0) to be selected
Stratified random sample – first divide the population into groups of similar individuals called strata. Choose a separate SRS in each stratum and combine these SRSs to form the full sample.
Multistage samples
Ex:
1.Select a sample from 3000 counties in the US
2.Select a sample of townships within each of the counties chosen
3.Select a sample of city blocks or other small areas within each chosen township
4.Take a sample of households within each block chosen.
This helps in sampling more randomly across the whole country.
Cautions about sample surveys
Undercoverage – occurs when some groups in the population are left out of the process of choosing the sample.
Nonresponse – occurs when an individual chosen for the sample can’t be contacted or does not cooperate.
Response bias – causes by the behavior of the respondent/subject or the interviewer
Wording of questions – most important influence on the answers given to a sample survey. Confusing or leading questions can introduce strong bias and may even change the outcome of a survey.
3.4Toward Statistical Inference
Statistical inference – producing data from a sample in order to draw conclusions about the wider population
Parameter – a number that describes the population. It is fixed but unknown.
Statistic – a number that describes a sample.
It is known when we take the sample but may vary from sample to sample.
We use a sample statistic to estimate a population parameter
Sampling variability – difference between the mean of one sample and the mean of another sample
Simulation – using random digits from a table or computer software to imitate chance behavior since taking many samples may be expensive and time consuming
Sampling distribution of a statistic – the distribution of values taken by the statistic in all possible samples of the same size from the same population. Usually normally distributed.
The bias of a statistic
Unbiased estimator – mean of its sampling distribution is equal to the true value of the parameter being estimated
The variability of a statistic – described by the spread of its distribution. The spread is determined by the sampling design and the sample size n. Larger samples have smaller spreads.
As long as the population is much larger than the sample (>= 10 times as large), the spread of the sampling distribution for a sample of fixed size n is approximately the same for any population size.
Bias and variability
High bias, low variability
Off target but close together
Low bias, high variability
On target but spread out
High bias, high variability
Off target and spread out
Low bias, low variability
On target and close together
Why randomize
It guarantees that the results of analyzing our data are subject to the laws of probability.
It eliminates bias
The shape of the sampling distribution is usually approximately normal and its center lies at the true value of the parameter.