1

Terminology

Definition: A population is the set of all elements of interest (note: the population size is usually denoted by an upper case “N”.).

Definition: A sample is a subset of the population (note: the sample size is usually denoted by a lower case “n”.).

Definition: A parameter is a characteristic (usually numeric) of the population.

Definition: A statistic is a characteristic (usually numeric) of the sample.

Basic Concept

The field of statistics can be divided into two related areas: descriptive statistics and inferential statistics. Descriptive statistics consists of numerical and graphical procedures that allow for the organization of information (data) and the extraction of key characteristics from the data. Inferential statistics consist of techniques designed to use sample statistics to make estimates of and decisions about population parameters. Probability is used to measure the error rates and/or level of confidence associated with statistical decisions.

We use a sample because it is usually impossible or impractical to measure each and every element of the population. The parameter(s) is (are) what we want to know. The only way to know the value of the parameter is to measure each and every element of the population without error. Thus, it is rare when we know the value of the parameter. We use the information in the sample to describe and make decisions about the population. We use statistics to describe, estimate, and make decisions about parameters. It is the parameter we want to know; it is the statistic that we “settle for.” Statistics represent our “best guess” for the value of the parameter. The value of the statistic is a function of the sample. Different researchers, taking separate samples from the same population are likely to obtain different values for the statistic. Statistics are used as a decision making tool. The use of statistics does not guarantee correct answers.

TYPES OF DATA

There are two basic types of data: quantitative and qualitative. Quantitative data categorizes a response by a numeric attribute. Qualitative data categorizes a response by a non-numeric attribute. Numeric refers to the “natural state” of the measurement. A number is not necessarily numeric. Rather, a number can be a symbol representing a non-numeric characteristic. To determine if a number is numeric, answer the following question. Can the number be replaced with a letter, word, or symbol with no loss of information? If the answer is yes, the number is non-numeric; if the answer is no the number is numeric.

A further breakdown of the types of data is the scale of measurement. There are fours scales of measurement. Listed in order of the least amount of structure to the most, the scales are: (1) nominal, (2) ordinal, (3) interval, and (4) ratio. Nominal scaled data categorizes a response by a non-numeric, non-ordered attribute (e.g., religious preference, ethnicity). Ordinal scaled data categorizes a response by an ordered non-numeric attribute (e.g., military rank, Likert scale). Interval scaled data categorizes a response by a numeric attribute with no natural “zero” point. Thus, differences have meaning for interval scaled data but ratios do not (e.g., temperature in degrees Celsius or Fahrenheit). Ratio scaled data is numeric and has a natural “zero” point. Thus, ratios have meaning. Most numeric data is ratio scales (e.g., time to complete a project, number of defective items produced).

KEY PARAMETERS AND STATISTICS

There are a wide variety of parameters and statistics. In general, Greek letters (e.g., µ, π, 2, ) will be used to represent parameters and Latin letters (e.g. ) will be used to represent statistics. In some cases a Latin letter will be used to represent a parameter and the same Latin letter with a “hat” (^) over it will be used to represent the statistic (e.g., in some books the population proportion (the parameter) is denoted by p and the sample proportion (the statistic) is denoted by ).

Measures of Central Tendency

Measures of central tendency attempt to describe the typical response (note that the assumption is made the typical values fall in the “middle” region of a distribution). Two of the most commonly used measures of central tendency are the mean and the median.

Mean

For notation purposes we use the Greek letter µ (“mu”) to represent the population mean and a Latin letter with a “bar” over it to represent the sample mean (e.g. ). The mean is also referred to as the expected value, which is denoted as E(X) where X represents the characteristic being measured. How the mean is computed is dependent on the nature of the data. The data can be presented in one of two forms, raw observations or a table of frequencies.

If the data is a set of observations, the sample mean is computed as the sum of the observations divided by the number of observations.

Note that the indices were dropped from the summation notation in the last form of the formula. While technically incorrect, it is this last form that will be used. It is assumed that we will use the entire data set and, thus, we drop the subscript on “x” and the indices on the summation notation.

If the data is a table of frequencies (or relative frequencies) the sample mean is computed as:

where f(x) is the relative frequency for the specific value of X and P(x) is the probability that the specific value of X will occur. Another notation for P(x) is P(X = x), where the uppercase X represents the characteristic being measured (e.g. number of defective products) and the lowercase x represents specific values that may be obtained upon measuring the characteristic (e.g., 0, 1, etc.).

Example: Suppose a sample of five students yielded the following number credit hours each student is taking in the 2003 spring semester.

12, 9, 16, 15, 12

The sample mean is:

On average, the five sampled students are taking 12.8 credit hours during the 2003 spring semester.

Example: Suppose the table below represents the relative frequency of the number of defective items produced per day for a sample 25 days. Compute the expected number (mean) of defective items produced per day.

D, Number of Defective Items / f, Frequency / Relative Freq., f/n
0 / 7 / 7/25 = 0.28
1 / 9 / 9/25 = 0.36
2 / 5 / 5/25 = 0.20
3 / 3 / 3/25 = 0.12
4 / 1 / 1/25 = 0.04

The expected number of defective items is:

E(D) = df(d) = 0(.28) + 1(.36) + 2(.20) + 3(.12) + 4(.04) = 1.28

On average, 1.2 defective items are produced per day.

Notes:

(1)The mean is valid only for numeric data.

(2)The physically represents the center of gravity.

(3)The mean is sensitive to extreme values and is pulled toward the extreme values.

Median

The sample median, denoted with an uppercase M or a Latin letter with a tilde (~) over it (e.g. ) is the middle value in an ordered data set. The median can be found using the following procedure:

(1)Order the data

(2)Compute the location of the median (denoted by the letter “i”), where .

(3)If “i” is an integer the median is the value located at the ith position.

If “i” is not an integer, the median is the mean of the two values surrounding the ith position.

Example: Find the median for the following two data sets.

Data set 1: 12, 9, 16, 15, 13

Data set 2: 12, 9, 16, 15, 13, 8

Data set 1: 12, 9, 16, 15, 13

(1)Order the data: 9, 12, 13, 15, 16

(2)Compute the location of the median

(3)The median is the value at the 3rd ordered position, M = 13

Data set 2: 12, 9, 16, 15, 13, 8

(1)Order the data: 8, 9, 12, 13, 15, 16

(2)Compute the location of the median

(3)The median is the mean of the two values surrounding the 3rd ordered position, M = (12 + 13)/2 = 12.5

Notes:

(1)The median is valid for ordinal, interval, and ratio scaled data.

(2)The median is relatively stable in the presence of extreme values.

MEASURES OF DISPERSION

Statistics is a decision making tool used in the face of uncertainty. Perhaps no concept is more important in statistics than the measurement and understanding of variability. Measures of dispersion (also called measures of spread, measures of variability) are attempts to describe the spread or fluctuation is a data set. Three common measures of dispersion are: (1) the range, (2) the mean absolute deviation, and (3) the standard deviation.

The range, R, is simply the difference between the maximum and minimum values in the data set

In notation, the range is R = H – L.

The mean absolute deviation, MAD is the average of the absolute deviations from the mean.

.

Example: Suppose a sample of five students yielded the following number of credit hours each student is taking in the 2003 spring semester.

12, 9, 16, 15, 12

The sample mean is:

Find the mean absolute deviation.

x / /
12 / -0.8 / 0.8
9 / -3.8 / 3.8
16 / 3.2 / 3.2
15 / 2.2 / 2.2
12 / -0.8 / 0.8
64 / 0.0 / 10.8

The sample mean absolute deviation is:

The standard deviation, s is the square root of the “average” of the squared deviations from the mean. Just as with the mean, the formula used to find the standard deviation is dependent on the format of the data – raw observations or in a frequency distribution.

With raw data the standard deviation is computed as: .

With a frequency distribution the standard deviation is computed as:

Conceptual Motivation for the Standard Deviation

As a measure of spread consider, for each observation, the deviation from the mean.

For each observation, the deviation from the mean is measuring how far the observation lies from the mean. If the deviation is positive, the observation is greater than the mean; if the deviation is negative, the observation is less than the mean; if the deviation is zero, the observation is equal to the mean. If the data is “tightly grouped” most of the deviations from the mean should be small. If the data is “spread out” more of the deviations from the mean should be large. An intuitive measure of spread is the average of the deviations from the mean. Intuitively, if the data is “tightly grouped” then on the average the observations should be close to the mean and the average deviation from the mean will be small. Conversely, on an intuitive level, if the data is “spread out” then on the average the observations will not be close to the mean and the average deviation from the mean will be large. Good idea, but it does not work. The sum of the deviations from the mean always equals zero. We need a way to ensure that the sum will not equal zero.

Try:

•absolute values of the deviations – use for the MAD.

•squares of the deviations – use for the standard deviation.

The use of absolute deviations is intuitively pleasing but it has two drawbacks:

•absolute values are arithmetically tedious to compute

•absolute values lack “nice” statistical properties

The use of squares eliminates these two drawbacks.

The sample variance is:

Problem:The sample variance is in squared units.

Solution:Take the square root.

The sample standard deviation is the positive square root of the sample variance. If the standard deviation is “small” the data is tightly grouped and if the standard deviation is “large” the data is spread out.

The notation for the population variance is 2 (“sigma squared) and the notation for the sample variance is s2. When measuring a single, quantitative variable, the sample variance is defined as:

The two formulas above provide the definitional and computational formulas for the sample variance. The definitional formula provides insight into what the sample variance represents. The sample variance is an “average” of the squared deviations from the sample mean. Note that the variance measures deviations from a specified reference point. In the situation described above, the reference point is the sample mean. As the physical reality changes, the reference point used in the computation of the variance will change. However, the general idea is to measure the spread around the reference point.

The general format for the sample variance is “sum of squares” divided by “degrees of freedom.” Degrees of freedom, df, will be the total number of observations used in computed the sum of squares minus the number of parameters estimated in the computation of the sum of squares, SS.

Example: Suppose a sample of five students yielded the following number of credit hours each student is taking in the 2003 spring semester.

12, 9, 16, 15, 12

The sample mean is:

Find the sample variance and standard deviation using both the definitional and computational formulas.

Use of the definitional formula.

x / /
12 / -0.8 / 0.64
9 / -3.8 / 14.44
16 / 3.2 / 10.24
15 / 2.2 / 4.84
12 / -0.8 / 0.64
64 / 0.0 / 30.80

The sample variance is:

The sample standard deviation is:

Use of the computational formula.

x / x2
12 / 144
9 / 81
16 / 256
15 / 225
12 / 144
64 / 850

The sample variance is:

The sample standard deviation is:

Example: Suppose the table below represents the frequency and relative frequency of the number of defective items produced per day for a sample 25 days. Compute the standard deviation of the number defective items produced per day.

D, Number of Defective Items / f, Frequency / Relative Freq., f/n
0 / 7 / 7/25 = 0.28
1 / 9 / 9/25 = 0.36
2 / 5 / 5/25 = 0.20
3 / 3 / 3/25 = 0.12
4 / 1 / 1/25 = 0.04

Recall that the expected number of defective items was 1.28

.

Using

D / f / f(x) = f/n / (x-µ)2 / (x-µ)2f(x)
0 / 7 / 7/25 = 0.28 / 1.6384 / 0.458752
1 / 9 / 9/25 = 0.36 / 0.0784 / 0.028224
2 / 5 / 5/25 = 0.20 / 0.5184 / 0.103680
3 / 3 / 3/25 = 0.12 / 2.9584 / 0.355008
4 / 1 / 1/25 = 0.04 / 7.3984 / 0.295936
1.241600

Using

D / f / f(x) = f/n / x2 / x2f(x)
0 / 7 / 7/25 = 0.28 / 0 / 0.00
1 / 9 / 9/25 = 0.36 / 1 / 0.36
2 / 5 / 5/25 = 0.20 / 4 / 0.80
3 / 3 / 3/25 = 0.12 / 9 / 1.08
4 / 1 / 1/25 = 0.04 / 16 / 0.64
2.88

MEASURE OF LOCATION

The z-score is a commonly used measure of location. For an individual observation, x, the z-score indicates how many standard deviations the observations lies from the mean.

The z-score is a standardized measure. It converts an observation from its original units (e.g., dollars, number of credit hours, time, etc.) to units of “number of standard deviations.” The term “x - µ” tells computes the observation’s deviation from the population mean (i.e., in terms of the original units, “x - µ” is how far the observation lies from the population mean). Dividing by the population standard deviation converts the value of the deviation into units of “number of standard deviations.” When computing a z-score, if the population mean or standard deviation is unknown, use the sample mean or standard deviation as an estimate of the respective population parameter.

Example: Suppose a sample of five students yielded the following number credit hours each student is taking in the 2003 spring semester.

12, 9, 16, 15, 12

Find and interpret the z-score for a student taking 16 credit hours in the 2003 spring semester.

From previous examples the sample mean was found to be and the sample standard deviation was determined to be s = 2.77. Thus, the z-score for a student taking 16 credit hours in the 2003 spring semester is

A student taking 16 credit hours in the 2003 spring semester is 1.16 standard deviations above the average course load.

Descriptive Statistics © Mitchell J. Muehsam, Ph.D., January 2003