Statistics 2014

Among the skills that you will be learning in this course are techniques for graphing data. Below is a type of graph that we will be looking at later in the course. You might find that this graph provides you with some useful information. (Note that in this course, we will consider only straight-line trends, not curvilinear trends.)

Relationship Between Class Attendance and Course Grade

for a General Biology Course


Chapter 1 –Data Collection

Definition:Statistics is the science of conducting studies to collect, organize, summarize and analyze data and to draw conclusions from the data.

Definition: A population is the collection of all objects, animals, or persons that are of interest.

Example: A population of interest might be all students at UNF. An individual member of the population would be a student.

Example: A population of interest might be all family units in the United States. An individual member of the population would be a family.

Definition: A variable is a characteristic under study that assumes different values for different elements of the population.

Example: For the population of all students at UNF, a variable might be the GPA of a student.

Example: For the population of all registered voters in the U.S., a variable might be the candidate supported by a voter for the office of President.

Definition:Data are the measured values of(a) variable(s).

Definition: Variables whose values are determined by chance are called random variables.

Example: If I randomly choose someone from the population of all registered voters in the U.S., that person’s preference for President would be a random variable; it’s value would depend on which voter was randomly chosen.

Definition: A collection of data values is called a dataset.

Definition: An experiment is a planned activity whose result yields a set of data.

Branches of Statistics

Definition: Descriptive statistics is the branch of statistics concerned with collection, organization, summation, and presentation of data.

Definition: A parameter is a numerical characteristic of a population.

Example: For the population of all students at UNF, one parameter would be the average GPA.

Example: For the population of all registered voters in the U.S., one parameter would be the level of support for a particular Presidential candidate.

Definition: A sample is a subset of a population.

Definition: A statistic is a numerical characteristic of a sample.

Example: If I select a random sample of 100 UNF students, a statistic would be the average GPA for the members of the sample.

Example: If I select a random sample of 1000 people from the population of all registered voters in the U.S., a statistic would be the fraction of the sample who support a particular Presidential candidate.

Definition:Inferential statistics is the branch of statistics concerned with inferring the characteristics of populations (i.e., parameter values) based on the information contained in sample data sets. Inferential statistics includes estimation of parameters and hypothesis testing. Included in an inferential procedure is a measure of reliability.

Example: If I want to predict the outcome of the next Presidential election, I might select a random sample of 1000 registered voters and ask for each voter’s preference. The fraction of voters in the sample who support a particular candidate would then estimate the fraction of voters in the entire population who support that candidate.

Types of Data

Definition: Acategorical, or qualitative, variable categorizes elements of a population or sample based on some characteristic of interest.

Examples: eye color, race, gender, nationality, species of tree

Definition: A quantitative variable is numerical in nature. It represents a count or a measurement made on elements of a population or sample. A quantitative variable is discrete if it can assume at most a finite or countable number of possible values. Otherwise, it is continuous.

Examples: 1)Consider the population of families in the United States. The variable is the number of children in the family. This is a discrete quantitative variable (a count) with a finite number of possible values.

2) Consider the population of all adults in the United States. The variable is each person’s height. This is a continuous quantitative variable (a measurement), with an uncountably infinite number of possible values.

Sampling Methods

Definition: A census is a set of data collected for every member of the population.

Definition: A representative sample is one whose characteristics reflect as closely as possible the characteristics of the population from which the sample was selected.

Example: I want to examine the differences in choice of major field of study for male and female students at UNF. I select a random sample of 100 male students and a random sample of 100 female students. I ask each student’s major.

Definition: A simplerandom sample of size n is a sample drawn from a population by a method which makes every sample of size n equally likely to be chosen. Alternatively, a simple random sample of size n is a sample selected by a method that guarantees that every member of the population will have exactly the same probability of being selected to be in the sample.

Steps in choosing a SRS of size n:

1) Obtain a list of all members of the population; this list is called a sampling frame. (Note: This is the most difficult step in the whole process, and is also error-prone.)

2) Assign a unique ID number to each member of the population.

3) Go to a table of random numbers; choose a convenient starting point; go down the column, recording numbers within the range of the assigned ID numbers, until n distinct numbers are selected.

4) The population members that have the ID numbers obtained by this process make up the SRS of size n.

(Step 3 may also be done using the TI-83 calculator)

Note: We can never be absolutely certain that our sample is representative, but simple random sampling gives us a good chance.

Example: I want to estimate the average height of the class, without gathering height data for every person in the class. I will select a simple random sample of size 6 and use the average height of the members of the sample as the estimate of the average height of the members of the class. I assign a unique ID number to each person in the class; the first person on the class roll will have the ID number 001, the second person 002, etc. I then go to a table of random numbers, open it, and blindly choose a starting point. Reading down the column from the starting point, I find six distinct three-digit numbers within the range of the values of the ID numbers. The class members with these 6 ID numbers constitute the SRS.

Definition: A systematic sample is chosen by randomly choosing one of the first k members on the population list, and then choosing every kth member thereafter.

Note: Systematic sampling has the advantage of simplicity; it has the disadvantage that any regularity in listing the population members could be reflected in the sample, making the sample less likely to be representative of the population.

Example:Coca-Cola wants to poll its 29500 employees. They begin with a complete employee roster; randomly choose one of the first 50 names on the list as a starting point; then choose every 50th name after that one, obtaining a systematic sample of 590 employees. Suppose that the first person selected is number 3 on the list, and suppose that the original list is arranged so that management people appear in positions 1, 2, 3, 51, 52, 53, 101, 102, 103, etc. Then the systematic sample chosen will consist solely of management people, and will thus not be representative of all employees.

Definition: A stratified random sample is chosen by separating the population into mutually exclusive subgroups, or strata, and choosing a SRS from each stratum.

This method is useful if there are identifiable subpopulations about which we want information.

Example: We want to estimate the average height of the members of the class, using a (hopefully) representative sample. Since males tend to be taller than females, and we also want to estimate the average height of the males in the class and of the females in the class, we will first stratify by gender. We will obtain a SRS from the group of males, and a SRS from the group of females. These two SRS’s together will make up our stratified random sample from the class.

Definition: A cluster sample is a simple random sample of groups or clusters of elements.

Note: This method is useful when it is difficult or costly to develop a complete list of members of the population, or when the population is widely disperse geographically.

Example: We want to estimate the average household income in a large city. To use a SRS, we would need a complete list of all households, with unique ID numbers. We would select a SRS of size n, and interview someone from each of these households. If we are sending out interviewers to each household, they may have to travel extensively, going from one chosen household to the next. On the other hand, we could make a list of all residential blocks in the city, choose a SRS of blocks, and have the interviewer go to each house in each selected block. This would involve less travel time. The disadvantage is that this method increases sampling error; we are somewhat less likely to obtain a representative sample than if we use SRS.

Definition: A convenience sample is a sample in which the individuals are easily obtained. (Note: It is almost never a good idea to do convenience sampling.)

Example: Online “surveys” often use convenience samples. Almost on a daily basis, I see online “surveys” that ask users for their opinions about current events in the news. The resulting data are based on response from users who volunteer to answer the questions. Conclusions drawn from such “surveys’ are questionable.

Sources of Error in Sampling

Definition: A nonsampling error is an error that results from the survey process. Such errors may be due to 1) nonresponse, 2) inaccurate responses, 3) poorly worded survey questions, etc.

Example: In 1936, the magazine Literary Digest conducted a poll to predict the outcome of the Presidential election. They mailed out sample ballots to a sample of 10 million voters. Based on the results, they predicted that Alfred M. Landon would win with 57% of the votes. Actually, Franklin D. Roosevelt won with 62% of the votes. What happened? The magazine used poor sampling methods. First of all, as a sampling frame, they used a list composed of subscribers to their magazine, households with telephones, and households with automobiles. Since the U.S. was in the middle of the Great Depression, the people listed on the sampling frame were those who were more affluent. A majority of the people on the list were Republicans. However, less affluent Americans, who were not included in the frame, were more likely to favor the New Deal policies of the Roosevelt administration. A second problem with the poll was that, of the 10 million sample ballots mailed out, only about one-fourth were returned.

Some other examples of causes of nonsampling error are given in the text on pp. 34-36.

Definition: Sampling error is the error that results from using information from a sample to make decisions about the characteristics of the population from which the sample was selected. This type of error occurs because a sample gives incomplete information about a population.

The Design of Experiments

Definition: An observational study draws conclusions about the characteristics of a population by measuring those characteristics of individuals in a sample, but does not attempt to manipulate or influence the variables of interest.

Definition: A designed experiment applies a treatment to individuals (called experimental units) and attempts to isolate the effects of the treatment on a response variable.

Example: In 1954, Jonas Salk tested a vaccine which he hoped would prevent polio. One group of children were given the vaccine, and another group of children were given a placebo (not containing any drug). He then observed whether there was a difference in the rates of polio between the vaccine group and the placebo group. It was found that there was a much lower rate of polio in the vaccine group than in the placebo group. Salk concluded that his vaccine would help to prevent polio.

Example: Suppose there are two methods for teaching a basic astronomy course.

1)One method is a standard chalk-and-talk method, in which the teacher lectures and assigns homework and tests.

2)Another method is self-paced; in addition to lectures from the teacher, the student, at his/her own pace, studies units of material (for example, there may be a unit of material on the internal structure of a star). The student then takes a quiz on the unit of material, and has two opportunities to pass the quiz. The student must successfully complete a certain number of units in order to get a passing grade in the course.

A researcher wants to know whether the self-paced method improves student performance over the standard method. The researcher selects a random sample of students enrolled in astronomy, and randomly assigns some students to the standard instructional method, and other students to the self-paced method. At the end of the semester, the student examines student grades.

Definition: The experimental unit (or subject) is a person, object, or some other well-defined item upon which a treatment is applied. The experimental unit is analogous to the individual in a survey.

Example: In the Salk vaccine study, an experimental unit was an individual child.

Definition: An experimental treatment is a condition applied by the researcher to the experimental unit.

Example: In the Salk vaccine study, one experimental treatment was the Salk vaccine; the other experimental treatment was the placebo.

Example: In the basic astronomy study, one experimental treatment was the standard instructional method; the other experimental treatment was the self-paced method.

Definition: In an experimental design, a factor is an explanatory variable that could influence the outcome of the experiment.

Example: In the Salk vaccine study, one factor is the experimental factor – whether a child is given the vaccine or a placebo. Another factor (uncontrolled) is opportunity to be exposed to the polio virus.

Definition: In an experimental design, the response variable is the outcome variable about which comparisons are made.

Example: In the Salk vaccine study, the response variable is a random variable with two possible values: 1) the child developed polio, or 2) the child did not develop polio.

Example: In the basic astronomy study, the response variable is a random variable, namely a student’s grade in the course.

Steps in Conducting an Experiment:

1)Identify the problem to be solved, as explicitly and completely as possible. (We want to know whether the Salk vaccine reduces the rate of occurrence of polio in children.)

2)Determine the factors that affect the response variable. (Presence or absence of the vaccine, opportunity for exposure to the virus that causes polio.)

3)Determine the number of experimental units. (How many children should be in the vaccine group? How many in the placebo group?)

4)Determine the level of the predictor variables. (One predictor variable is the experimental treatment – presence or absence of the vaccine. Another predictor variable is opportunity to be exposed to the virus. The first variable is completely under the control of the researcher. The second can only be controlled through randomly assigning children to receive either the vaccine or the placebo.)

5)Collect and process the data. (After a period of time, each child is examined to determine whether polio infection is present.)

6)Test the claim. (Compare the rate of infection in the vaccine group with the rate of infection in the placebo group.)