COURSE NOTES - STAT 310 – SPRING 2008

1 – Basic Concepts, Terminology, and Types of Studies

1.1 - Basic Definitions and the Statistical Process

Statistics -

General Approach to Statistical Process

The Cycle of Statistical Investigation

More Definitions/Terms

Population –

Parameter –

Sample –

Statistic –

Descriptive Statistics –

Inferential Statistics –

1.2 - Types of Studies (see Powerpoint on website)

Two Main Types of Studies

Observational – researcher collects info on attributes or measurements of interest, but does not influence results.

Experimental – researcher deliberately influences events and investigates the effects of the intervention, e.g. clinical trials and laboratory experiments.

EXPERIMENTAL STUDIES

In this section we will examine some basic experimental designs and design principles, namely:

I. Completely Randomized Designs

II. Blocking and Block Designs

III. People as Experimental Units (e.g. clinical trials)

I. Completely Randomized Design (CRD)

The treatments are allocated entirely by chance to the experimental units.

Example 1: Tomato Plants

Which of two varieties of tomatoes (A & B) yield a greater quantity of market quality fruit?

Factors that may affect yield: soil fertility; exposure to wind/sun; soil pH levels; soil water content etc. Divide the field into plots and randomly allocate the tomato varieties (treatments) to each plot (unit).

Situation 1:

8 plots – 4 get variety A

Situation 2: What if the field had a slope to it?

8 plots – 4 get variety A

II. Blocking

Situation 2 on the previous page illustrates the use of what is referred in experimental design as blocking. In blocking we group (or block) experimental units by some known factor and then randomize within each block in an attempt to balance out the unknown factors.

Example 2: Comparing Three Pain Relievers for Headache Sufferers

How could we design an experiment? How could blocking be used to increase precision of our experiment?

Example 3: Race Horse Leg Wraps

•  17 “boots” tested, each boot is tested n = 5 times. Why?

•  Because of the time constraints all boots were not tested on the same day.

•  8 tested 1st day, 5 tested 2nd day, 4 tested 3rd day.

•  Leg was placed in freezer and thawed before the 2nd and 3rd days of testing.

Questions:

What problems do you foresee with this experimental design?

Example 3 – Race Horse Leg Wraps (cont’d)

What actually happened? Below is a plot of the force readings when no wrap was used on the leg during the three days of testing.

What is the implication of the results shown above?

Final Results of Horse Leg Wrap Study

Q: What should have been done?

III. Using People as Experimental Units (Medical Studies)

Example 4: Cholesterol Drug Study

Suppose we wish to determine whether a drug will help lower the cholesterol level of patients who take it.

Q: How should we design our study?

Important Concepts for Experiments with Human Subjects

•  control group:

–  Receive no treatment or an existing treatment

•  blinding:

–  Subjects don’t know which treatment they receive

•  double blind:

–  Subjects and administers / diagnosticians are blinded

•  placebo:

–  Inert dummy treatment

•  placebo effect:

–  A common response in humans when they believe they have been treated.

–  Approximately 35% of people respond positively to dummy treatments - the placebo effect

OBSERVATIONAL STUDIES

There are two major types of observational studies: prospective studies and retrospective studies.

I. Prospective Studies

Choose samples now, measure variables and follow up in the future, e.g. choose a group of smokers and non-smokers now and observe their health in the future.

II. Retrospective Studies

Looks back at the past, e.g. a case-control study

Separate samples for cases and controls (non-cases). Why?

Look back into the past and compare histories. For example, we could choose two groups: lung cancer patients and non-lung cancer patients. Compare their smoking histories.


Important Note:

1. Observational studies should use some form of random sampling to obtain representative samples.

2.  Observational studies cannot reliably establish causation. Only well-executed controlled experiments can be used to establish causation.

Controlling for various factors

A prospective study carried out over 11 years on a group of smokers and non-smokers showed that there were 7 lung cancer deaths per 100 000 in the non-smoker sample but 166 lung cancer deaths per 100 000 in the smoker sample. This still does not show smoking causes lung cancer because it could be that smokers smoke because of stress and that this stress causes lung cancer. To control for this factor we might divide our samples into different stress categories. We then compare smokers and non-smokers who are in the same stress category. This is called ______for a confounding factor, in this case stress level.

Example 1 - “Home births give babies a good chance”, NZ Herald, 1990

·  An Australian report was stated to have said that babies are twice as likely to die during or soon after a hospital delivery than those from a home birth.

·  The report was based upon simple random samples of home births and hospital births.

Comments:

Example 2 – Lead Exposure and Tooth Decay (USA Today, Children exposed to lead are more likely to suffer tooth decay, and vitamin C might help lower blood lead levels, say two recent studies.

In the first of the reports in the Journal of the American Medical Association, researchers estimate that lead exposure could account for tooth decay in 2.7 million children. “Other people may debate that, but that’s our position”, says head researcher Mark Moss of the Univ. of Rochester (NY) School of Medicine and Dentistry.

Prior studies showed that lead exposure can depress a child’s IQ. “There are a lot worse things that lead can do to you than hurt your teeth”, Moss says. He notes, however, that one of the key questions in dentistry is why low-income people experience more tooth decay than higher-income people. “This study suggests lead might be one of the reasons,” he says.

The study involved 24,901 children ages 2 and older. It showed that the greater the child’s exposure to lead, the more decayed or missing teeth. “The risk of getting tooth decay increased as the amount of lead went up,” Moss says.

1. What is the population of interest?

2. What is the sample?

3. What are some possible explanations for the results Moss and the other researchers observed?

In a second study, Joel Simon and his colleagues at the University of California at San Francisco studied 19,578 people who had no history of excess lead exposure. They found that the higher a person’s intake of vitamin C, the lower his or her blood lead level.

4. Does this prove that increasing one’s vitamin C intake will lower blood levels? Explain.

Causal and Populations Inferences

Only controlled randomized experiments can be used to make statistical inferences about causal effects. Observational studies cannot be used to reliably establish causation. However, if properly samples are drawn randomly from populations involved observational studies still can used to make inferences about the populations.

Diagram – Causal and Populations Inferences

SURVEYS & POLLS (and the errors inherent in them)

The sampling process introduces two types of error:

• Sampling / Chance / Random Errors

• Nonsampling Errors

Sources of Nonsampling Errors

Selection bias

Population sampled is not exactly the population of interest.

e.g.

Non-response bias

People who have been targeted to be surveyed do not respond.

Self-selection bias

People decide themselves whether to be surveyed or not.

Question effects

Subtle variations in wording can have an effect on responses.

Interviewer effects

Different interviewers asking the same question can obtain different results.

Behavioural considerations

People tend to answer questions in a way they consider to be socially desirable.



Transferring findings

Taking the data from one population and transferring the results to another.

Survey-format effects

§  errors caused by the act of taking a sample
§  have the potential to be bigger in smaller samples than in larger ones
§  possible to determine how large they can be
§  unavoidable (price of sampling) / §  can be much larger than sampling errors
§  are always present
§  can be virtually impossible to correct for after the completion of survey
§  virtually impossible to determine how badly they will affect the result
§  must try to minimize in design of survey (use a pilot survey etc.)

A pilot survey is a small survey that is carried out before the main survey and is often used to identify any problems with the survey design (such as potential sources of nonsampling errors).

A report on a sample survey/poll should include:

- target population (population of interest)

- sample selection method

- the sample size and the margin of error

- the date of the survey

- the exact question

Example - WSU Student Survey - In order to generate data for use in one of my introductory statistics courses a few years back, I had the class develop a short survey and administer this survey to ten WSU students of their choosing. In the end, survey responses were recorded for a total of 348 WSU students (n=348).

1. What is the population of interest?

2. What is the sample?

3. What are some potential problems with this survey methodology?

2 - Data/Variable Types

There are two main variable types:

Example - WSU Student Survey The following items comprised the survey. Classify each item (variable) as being either numeric, ordinal, or categorical.
(Data File: WSU Student Survey)

Survey Item/Variable / Variable Type
Gender
Age
Did student have a declared major?
Major, if declared
College major program is in, e.g. College of Liberal Arts (CLA)
Class (Fr, So, Jr, or Sr)
Hours spent studying per day
GPA
Is student involved in extra-curricular activity, e.g. intramural sports or biology club?
Is student living on- or off-campus?
Hours of sleep per night
Number of credits student is currently taking
Does student have a “significant other”?
Does student skip at least on class per week?
If they do skip, what is the most common reason for skipping?
Does student drink alcohol?
If they do drink, what would be a typical number of “drinks” they would have per night that they drink?
Does student smoke cigarettes?
If student is a smoker, how many cigarettes do they smoke per day?
Should President Clinton be impeached for his sexual relations with Monica Lewinsky?
Is student of legal drinking age (21 yrs. old)?
How much did student spend on textbooks this semester?
Does student think the WSU Laptop Program is a good idea?

POTENTIAL QUESTIONS OF INTEREST FROM THE STAT 110 SURVEY

3 – Descriptive Statistics

3.1 - Describing a Single Categorical/Qualitative Variable

Entering Data JMP


From the JMP Starter window click the New Data Table button. A new spreadsheet will appear with only one column (labeled Column 1). We need two columns in our spreadsheet to enter this table, one for the gas price opinion of the respondents and one for the number of respondents in each opinion category. To add columns to a spreadsheet simply double click to the right of the first column. Each time you double click to the right of the rightmost column another column will be added. Here we only need one additional column so we will double click once to the right of the first column. All columns in JMP are set to receive only numeric input by default. Because the gas price opinion information is non-numeric or categorical data we need to change the data type for this column. To do this, right-click at the top of Column 1 and then select Column Info from the pop-up menu. A window will open which will allow you to change the name of the column, the data type, the modeling type, the column width etc. For the name we will use Gas Price Opinion and for the data type we will use Character because we will be entering the opinion categories. Also because some of the opinions are quite lengthy, we will increase the Field Width to 30. Notice that by changing the data type to character causes the modeling type to become nominal (i.e. categorical/qualitative).
For the second column all that is required is a name change. This can be done using the process described above, or by simple clicking on the current column name, Column 2, and changing the name to # of Respondents. These counts are actually frequencies and we wish the computer to interpret them as such. To do this we will use the role assignment pop-up menu to change this column’s role to that of a frequency count. The Preselect Role menu is accessed by right-clicking at the top of a column. From this pull-out menu select Freq to change this column’s role to frequency/count.


We are now ready to input the information contained in the table but we first need to add 3 rows to our spreadsheet. To do this select Add Rows from the Rows pull down menu and change the number of rows to add to 3, which the is the number of opinion categories considered in this survey. Use the mouse, the return key, and/or the arrow keys to move about the spreadsheet and enter the data. When you are finished your spreadsheet should look like:

Note: Rows will automatically be added each time you hit enter when you are entering data in the last row so you do not necessarily have designate the number of rows (observations) in the dataset at the beginning of the data entry process if you do not want to.