STP 420 SUMMER 2005

STP 420

INTRODUCTION TO APPLIED STATISTICS

NOTES

PART 1 - DATA

CHAPTER 3

PRODUCING DATA

Introduction

Exploratory data analysis – covered in Ch. 1 & ch.2

Use of graphs and numbers to uncover the nature of a data set.

Not good enough to provide convincing evidence for its conclusions

Formal statistical inference – answers specific questions with a known degree of

confidence.

- it uses the descriptive tools given in the previous chapters along with new kinds of reasoning (numerical rather than graphical)

3.1First Steps

Major questions when trying to produce data.

1.What individuals shall you study?

2.What variables shall you measure?

Designs – arrangements or patterns used to collect data from many individuals

Some questions addressed by designs

1.How many individuals shall you collect data from?

2.How shall you select the individuals to be studied?

3.How shall you form groups where relevant?

Otherwise you may be misled by haphazard or incomplete data or by confounding (mixed effects of various variables on the same response without being able to determine which variable has which effect on the response)

Where to find data: library and internet

Anecdotal evidence – based on haphazardly selected individual cases that come to our attention, but may not be representative of any larger group of cases.

There are many places to find data including:

The annual Statistical Abstract of the United States

US Census Bureau

Website:

Available data – data produced in the past for some other purpose but may help answer a present question

Producing new data is expensive and available data is used whenever possible.

Statistical designs used for producing data rely on sampling or experiments.

Sample – group of individuals/subjects from which data is gathered and is representative of a larger body or population

Census – when information is gathered from the whole population

- time consuming and expensive, hence, the reason samples are used instead

Observational study – observes individuals and measures variables of interest but does not influence the responses

Experiment – deliberately imposes treatment on individuals in order to observed their responses

3.2Design of Experiments

Experimental units – individuals/subjects on which experiment is done

Treatment – specific experimental condition applied to the units

Factors – explanatory variables in an experiment

Level of a factor – specific value of a factor/variable

Placebo – dummy treatment

In principle, experiments can give good evidence for causation.

Comparative experiments

Simple design with only a single treatment is:

TreatmentObserve response

Control group – group that are given the dummy/placebo

It is more beneficial to have an experiment with at least two groups; one given the treatment and the other given the dummy treatment.

The design of a study is biased if it systematically favors certain outcomes.

Randomization – use of chance to divide experimental units into groups

Principle of Experimental Design

The basic principles of statistical design of experiments are

1.Control of the effects of lurking variables on the response, most simply by comparing several treatments.

2.Randomization, the use of impersonal chance to assign experimental units to treatments.

3.Replication of the experiment on many units to reduce chance variation in the results.

Statistically significant – observed effect so large that it would rarely occur by chance.

How to randomize

Drawing names/numbers in a hat

Table of random digits

Computer software

Random digits

Table of random digits – list of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 that has the foillowing properties:

1.The digit in any position in the list has the same chance o being any one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

2.The digits in different positions are independent in the sense that the value of one has no influence on the value of any other.

Completely randomized design – experimental design with all experimental units are

allocated at random among all treatments

Cautions about experimentation

Double-blind – neither the subjects nor the researcher knows which subject got which

treatment.

Lack of realism – subjects or treatments or setting of an experiment may not realistically duplicate the conditions we really want tostudy.

Matched pairs designs

Can produce more precise results than simple random sampling

Uses principles of comparison of treatments, randomization, and replication on several experimental units

Is an example of block design

Block designs

Block – group of experimental units or subjects that are known before the experiment to be similar in some way that is expected to affect the response to the treatments.

In a block design, the random assignment of units to treatments is carried out separately within each block.

3.3Sampling Design

Population – entire group of individuals that we want information about

Sample - a part of the population that we actually examine in order to gather information

Sample design – method used to choose the sample from the population

Voluntary response sample – consist of people who choose themselves by responding to a general appeal. Biased because people with strong negative opinions usually respond more often

Simple random samples

Simple random sample (SRS) – consist of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.

Stratified samples

Probability sample – gives each member of the population a known chance (> 0) to be selected

Stratified random sample – first divide the population into groups of similar individuals called strata. Choose a separate SRS in each stratum and combine these SRSs to form the full sample.

Multistage samples

Ex:

1.Select a sample from 3000 counties in the US

2.Select a sample of townships within each of the counties chosen

3.Select a sample of city blocks or other small areas within each chosen township

4.Take a sample of households within each block chosen.

This helps in sampling more randomly across the whole country.

Cautions about sample surveys

Undercoverage – occurs when some groups in the population are left out of the process of choosing the sample.

Nonresponse – occurs when an individual chosen for the sample can’t be contacted or does not cooperate.

Response bias – causes by the behavior of the respondent/subject or the interviewer

Wording of questions – most important influence on the answers given to a sample survey. Confusing or leading questions can introduce strong bias and may even change the outcome of a survey.

3.4Toward Statistical Inference

Statistical inference – producing data from a sample in order to draw conclusions about the wider population

Parameter – a number that describes the population. It is fixed but unknown.

Statistic – a number that describes a sample.

It is known when we take the sample but may vary from sample to sample.

We use a sample statistic to estimate a population parameter

Sampling variability – difference between the mean of one sample and the mean of another sample

Simulation – using random digits from a table or computer software to imitate chance behavior since taking many samples may be expensive and time consuming

Sampling distribution of a statistic – the distribution of values taken by the statistic in all possible samples of the same size from the same population. Usually normally distributed.

The bias of a statistic

Unbiased estimator – mean of its sampling distribution is equal to the true value of the parameter being estimated

The variability of a statistic – described by the spread of its distribution. The spread is determined by the sampling design and the sample size n. Larger samples have smaller spreads.

As long as the population is much larger than the sample (>= 10 times as large), the spread of the sampling distribution for a sample of fixed size n is approximately the same for any population size.

Bias and variability

High bias, low variability

Off target but close together

Low bias, high variability

On target but spread out

High bias, high variability

Off target and spread out

Low bias, low variability

On target and close together

Why randomize

It guarantees that the results of analyzing our data are subject to the laws of probability.

It eliminates bias

The shape of the sampling distribution is usually approximately normal and its center lies at the true value of the parameter.