Sampling and Data Collection Ideas

In a very short time we will be ready to apply material from probability theory in order to make statistical inferences based upon samples. We begin by briefly looking at why and how a sample is made.

There are several reasons that make sampling a superior method over census (examining the entire population). These include:

(1)  Cost: In many situations, the cost of performing a census is so large as to be prohibitive. For example, imagine the cost of having to examine all transactions in a business concern when performing an audit. As another example, consider the cost of a market survey that attempts to take a census of a large customer population.

(2)  Destructive testing: In many industrial applications, the items sampled may have to be destroyed or made economically worthless by the sampling procedure. For example, the reliability of electronic circuits may be evaluated by measuring the time until they fail. Obviously, it is impossible to take a census of the population of electronic circuits.

(3)  Speed of analysis: A significant benefit from sampling is that the collected data can be processed and analyzed in a timely fashion. This might be very important in market surveys where customer attitudes can change quickly.

(4)  Accuracy of analysis: As you will see in our study of frequency distributions, at times it is better to concentrate on a small amount of information than to try to integrate a large amount of information. This is especially true for samples where it may be difficult to accurately collect the data. Thus the individual performing the study may concentrate on a subset of the population and, at time, ask more probing questions, etc.

(5)  Feasibility: In some cases, it is literally impossible to take a census. One very important example is the case of an infinite population.

Types of samples

A.  Nonrandom sampling

1.  Convenience sampling: select the sample with the ease of sample as the primary consideration.

What is wrong (if anything) with taking a telephone survey between 11 a.m. and 1 p.m.?

2.  Judgment sampling: select the sample based on past experience with the population.

Why survey someone who you already know will not respond?

B.  Random sampling

1.  Simple random sample: sample of size n where all subsets of size n have the same chance of being selected.

This may be accomplished by giving each member of the population the same chance of being selected at each draw. This can be done by assigning each member a number and using a random number table, drawing the numbers from a hat, etc.

2.  Systematic random sample: select the members at evenly spaced intervals where the first selection is determined randomly.

This is especially convenient in situations where the members are already ordered and numbered, as in a population of invoices.

3.  Stratified sample: Used for populations with strata or subgroups. It is frequently possible to improve the information drawn from the sample by sampling randomly from each stratum.

4.  Cluster sampling: Select clusters randomly instead of members. A cluster is just a group of members.

Examples: city blocks, multiple items packaged together, etc. This technique is attractive because of convenience, cost, time, etc.

Types of Data

·  Interval data

•  Values are real numbers.

•  Examples: age, weight, time measurements, etc.

•  All calculations (like taking an average) that we will discuss are valid.

·  Ordinal data

•  Values represent ranked ordering.

•  Examples: preferences, grades (!), etc.

•  Calculations involving orderings are valid; however, research exists that indicates that some preference scales, like a Likert scale, may be converted to interval data.

·  Nominal data

•  Values represent categories.

•  Examples: marital status, gender, etc.

•  Only calculations based on category frequency are valid.

Descriptive Statistics—Graphical

We will begin our investigation of statistics by studying how we can describe data. For example, the data set provided in the EXCEL file entitled GPS.xls does not efficiently convey information or give insights to the reader about its characteristics. Because the information conveyed is too detailed, there is too much information.

You may have already begun to peruse the data, perhaps looking for patterns or common features. In this way you were attempting to reduce the information in the data set into some simple notions. Next we will consider some formal ways to perform this reduction.

The Frequency Distribution

One may group the observations into carefully selected cells and count the number of occurrences in each cell. This yields a frequency distribution. A frequency distribution either in tabular or graphical form (called a histogram) can quickly convey a broader view of the data set to the reader.

The way in which one chooses the number of cells is a matter of artistic taste. Typically it is a good idea to choose the cells to be of equal size, with perhaps the exception of the end cells. Too many cells give too much information, while too few cells lose too much information. An extreme case is a frequency distribution with one cell as shown below.

I think you will agree that this distribution does not contain much information.

Sturges has developed the following rule-of-thumb for determining the number of cells to use. Let

n = the number of observations in the data set, and

k = the number of cells to use in the frequency distribution.

Then .

In our case, k = 1 + 3.3 log10 (100)

= 1 + 3.3 (2)

= 7.6

≈ 8

It is also important to consider the range of the data (largest value – smallest value). If R is the range of the data, we would like to choose k so that R/k is reasonably simple. Why?

For our problem, R ≈ 4 – 0 = 4 and R/k = 4/8 = 0.5 which makes our analysis easy. Consider what would have happened if we had chosen k = 7.

Another representation equivalent to the bar chart is called a line graph.

Characteristics of Frequency Distributions

·  Symmetry

·  Skewness

·  Unimodal and Bimodal

·  Bell shape

Cumulative Frequency Distributions

As an alternative but equivalent way to represent a data set in a frequency distribution is with a cumulative frequency distribution. In this representation, the frequency of a given cell represents not only the frequency of observations in that cell, but also the frequency of observations in all “prior” cells.

Can you get a frequency distribution from a cumulative frequency distribution?

Can you think of a situation where a cumulative frequency distribution conveys more information directly to the reader than the frequency distribution?

Graphical Representations and Data Type

While there are no detailed rules to rely on, we can make the following observations about the use of different types of graphs depending on our data. Pie and bar charts make a good deal of sense when using nominal or ordinal data. Histograms and ogives are effective when using interval data.

11