Univariate Statistics
Slide 1
This lecture addresses what most people refer to as descriptive statistical analysis.
Slide 2 (No Audio
Slide 3
Sadly, many people are either afraid of statistics and statistical analyses or think about statistics in the wrong way. Here are three examples of thinking about statistics in the wrong way. Walkup’s first three laws are:
Law #1: Everything correlates with everything, especially when the same individual defines the variables to be correlated.
Law #2: It won’t help very much to find a good correlation in between the variable you are interested in and some other variable that you don’t understand any better.
Law #3: Unless you can think of a logical reason why two variables should be connected as cause and effect, it doesn’t help much to find a correlation in between them. In Columbus, OH the mean monthly rainfall correlates very nicely with the number of letters in the names of the months.
Clearly, Walkup is addressing misconceptions about the use of statistics. Law #1—thateverything correlates with everything else, especially if the same person defines the variables—reflects the existence of researcher artifacts. Law #2indicates that it’s nonsensical to search for correlations between variables to help explain one another when you don’t understand the relationships between the variables. Law #3 suggests that there’s always data to show the relatedness between variables, but without a good theoretical basis for believing that the relationship is true and stabile, that supposed relationship is meaningless.
Slide 4
This slide makes a key point about providing evidence for drawing a conclusion. Mrs. Fergusen believes her son is a good boy—that’s her null hypothesis—but the police officer gives Mrs. Fergusen repeated reasons and evidence for believing she’s mistaken in that null hypothesis. By the time the officer gets to Thursday’s evidence, Mrs. Fergusen has decided that she should reject her null hypothesis. That’s what’s meant by statistical evidence; that the statistical analysis provides evidence that helps to test null hypotheses.
Slide 5
Here’s what is meant by descriptive analysis: the transformation of raw data into a form that make them easy to understand and interpret. It entailsrearranging, ordering, and manipulating data to generate descriptive information. In essence, descriptive analyses summarize data in a more meaningful and reduced form.
Slide 6 (No Audio)
Slide 7
One of the most basic ways to describe data is to summarize it through tabulation, which is the orderly arrangement of data in a table or other summary format. As subsequent examples will show, running tabulations in a statistical package like SPSS creates frequency tables that include both counts and percentages.
Slide 8
More formally, a frequency table contains numerical data arranged in a row and column format that shows the count and percentages of responses or observations for each category assigned to a variable. Those categories are preselected by the researcher. For example, if we assigned respondents to one of four income categories, those categories would be pre-determined. All a statistical program like SPSS does is to report the frequency and percentages of respondents that fall into one of those four categories.
Slide 9
The next two slides show sample frequency test tables of the types generated by SPSS. The first table shows the frequency of respondents by income.
- The eight income categories are mutually exclusive and exhaustive; in other words, all possible incomes are represented by a unique category. There is a ninth category for ‘no answer’ code, which indicates respondents who opted not to answer the income question.
- The column to the right of the code column contains the number or respondent count—out of 400 total respondents—who fell into that category.
- The percentages column shows the percentage that each count represents out of 400. For example, for income category $5,000 to $9,999, the count is 40, which represents 10% of the 400 respondents.
- It might be useful to adjust those percentages for non-respondents, and that’s what appears in the next column. Those percentages are adjusted and a little higher than the percentages immediately to the left. Consider that same first income category; it’s 40 out of 400 total respondents, but because 40 people didn’t respond, the net number of responders to the income question is 360. Dividing 40 by 360 is 11.1%, which appears in the adjusted percentage column.
- The rightmost column contains adjusted cumulative percentages, which are a running total of the percentages in the column immediately to the left. Take 6.9 and add it to 11.1; the sum is 18.0. Take 18.0 and add it to 16.7—the adjusted percentage for respondents in the $10,000 to $14,999income category; the sum to 34.7%. This summation continues until 99.9% is reached.
In summary, the adjusted cumulative percentage is a running total of the percentages adjusted for non-respondents. The raw numbers and percentages refer to all respondents, whether or not they answered the question. The code is the category in which each respondent could fall, and those numbersare entered into the computer database. The far left column is provided to enhance the readability of this table. The meaning of each code can be entered into an SPSS data; for examplea code of ‘1’ means less than $5,000income.
Slide 10
This slide contains a table that is more reflective of the output you would receive from SPSS. Notice the word ‘Affect’ in the upper left-hand corner;that’s the name for this question. In SPSS you may enter a name that is as long as eight characters. Immediately to the right of the question (or variable) name is ‘Feelings about the commercial you just saw.’ Enter that type of information into SPSS makes the output more readable. The second column—the one titled ‘Value’—contains the numbers 1 through 5, which are the numbers entered into the computer that correspond with respondents’ answers to this question. Immediately to the left of the column ‘Value’ is the column ‘Value Label’. That column also contains information that can be entered into SPSS to enhance the table readability. The remaining four columns—headed byFrequency, Percent, Valid Percent, and Cumulative Percent—correspond to the last four columns on the previous slide. The one major difference is ‘Valid cases’ and ‘Missing cases’ at the bottom of the table. In the previous slide, missing cases were designated by the number ‘9’ and were included in the table. In this example, missing data was left blank in the database and SPSS picked up that blank case. The last number in the ‘Frequency’ column (‘5’) corresponds to the five blank cases, the adjacent ‘.9’is the percent of blank cases, and the ‘Valid Percent’ column contains the word ‘Missing.’ The only difference between this table and the previous table is the handling of missing cases.
Slide 11
SPSS can also generate histograms, which is a pictorial way of showing frequency data.
Slide 12
The issue introduced in the previous slide, regarding valid cases, relates the notion of the base. A base is the number of respondents or observations in a row or column that is used as a basis for computing percentages. There are several alternative bases that make more or less sense to use depending on what’s being summarized. For example, for questions that all respondents answered, a base ofall respondents means taking the number who responded each way and dividing by the totalnumber of respondents. Alternatively, if only a subset of respondents answered a question (that they could validly skip), then it’s more meaningful to divide the number of responses in each category by the number of respondents who were asked that question. That’s what occurs in the previous frequency table; only 535 of the 540 respondents answered the‘Affect’ question. As a result, all percentages are based on respondents who answered that question.
The majorissue about bases relates to multi-response questions. For example, respondents could be asked check off all the magazines to which they subscribe and then provide a lengthy list. One way to look at that data is‘What percent of respondents subscribe to each magazine?’To calculate this percent, it’s necessary to divide the number of subscribers by the total number of respondents. Another way to look at that data is ‘To what degree does each magazine comprise the universe of what people read?’To calculate that percent, it’s necessary to divide the number of subscriptionsfor each magazine by the total number of subscriptions. Either is a reasonable way—depending on the research question—to look at the data.
Slide 13
Frequency tables provide an excellent way to summarize survey data, but as the previous tables suggest, such tables contain many numbers. Is there a more efficient way to summarize survey data? By efficient, I mean that survey responses are summarized by a smaller number of numbers. Fortunately, the answer is ‘yes’ because ofmeasures of central tendency. There are three such measures. The mode, which is the value that occurs most often, can summarize nominal or categorical data. For example, if more males than females responded to a ‘Sex’ question, then the mode on that question is‘male’. The median, which is the midpoint on scores ordered from lowest to highest, can summarize ordinally scaled variable. The medianis the same as the 50th percentile. The arithmetic mean, which is the sum all of scores divided by the number of respondents, can summarize metric—intervally or ratio-scaled—data.
Slide 14
Although measures of central tendency—like mode, median, and mean—provide the best single summary number for a set of responses to a survey question, that single number lacks information about the degree to which people’s responses differed on that question. It would be useful to know if everyone who answered that question answered similarly, or if they answered that question rather differently. Thus, both (1) a best single summary measure, and(2) a measure of the degree to which that single measure captures all responses,areneeded. That second measure would detect dispersion or spread. Five standard measures of dispersion arerange, inter-quartile range, mean absolute deviation, variance, and standard deviation.
Slide 15
The next two slides illustrate low versus high dispersion. In both examples, the mean value on the variable is roughly 180. In the first example, all the answers tend to cluster around that 180 score. In the second example, the scores range from 150 to 210. In the first example, the mean is an excellent summary of scores on this variable across all cases or respondents. In the second example, the mean is not reflective of the score for some cases or respondents.
Slide 16 (No Audio)
Slide 17
Although range is a relatively primitive measure of spread, it’s easy to compute and easy to understand. Range is the difference between the smallest and largest values in a set of numbers. Inter-quartile range, which loosely corresponds to plus or minus 1 standard deviation, is the difference between the 75th percentile and the 25th percentile. Range also provides a useful approximation for standard deviation. Often, the lowest score in the range corresponds to three standard deviations below the mean and the highest score corresponds to three standard deviations above the mean; hence, the range often is roughly six standard deviations. Taking the range and dividing by six should give a good estimate of the standard deviation.
Slide 18
If we have metric data, so that we can add and subtract, then we can compute deviation scores. These scores are the differences between each observed value and the mean.
Slide 19
The mean is the best single summary number for metric data. If we take themean and subtract it from all other scores, the net result is zero (0), which indicates the mean is an unbiased estimator of central tendency. The scores below the mean would equal the scores above the mean weighted for the difference from that mean.
Slide 20
Because average deviationis always zero (0), one useful way to make deviation scores meaningful is to square them. Squaring both eliminates the zero (0) problem and weighs bigger deviations from the mean more than smaller deviations from the mean. Because the square of a large number is far larger than the square of a smaller number, squaring deviations weigh numbers that differ more from the mean to a greater extent than numbers that differ less from the mean.
Slide 21
Continuing with the basic notion of difference from the mean, consider variation from the mean or variance. For a survey, sample variance is the variability of responses to a given question. Sample variance is an estimate of the population variance.
Slide 22
A mean squared deviation is computed by subtracting the mean from each score on a variable, squaring that difference, and dividing by ‘n’ (the number or sample members). The n-1 in the denominator of this equation is an adjustment.
Slide 23
Essentially, the n-1 adjustment in the denominator is ignored when calculating variance, which is a mean squared deviation given in square units. When it’s more convenient to take the square root of the variance and not present variability in terms of squared units, the result is the standard deviation.
Slide 24
To summarize the last several slides, I spoke about central tendency and dispersion. I indicated that the appropriate measure of central tendency depends on the type of scale: nominal, ordinal, or metric data. In addition, the measure of dispersion depends on the type of scale. For nominal data, it’s impossible to create a rank order, so no meaningful measure of dispersion is possible. For ordinal data, the appropriate measure of dispersion is the percentile, so measures like inter-quartile range are meaningful. For metric data, standard deviation or variance are appropriate measures of dispersion.
Slide 25 (No Audio)
Slide 26
If the data on one variable is distributed symmetrically, like in this bell-shaped curve, then the mean, the mode, and the median are identical.
Slide 27
In contrast, if data are skewed positively or negatively, then the values of the mode, median, and mean will differ. One way to remember positive versus negative skew is by looking at the direction of the tail. If the tail points to the positive end of the number line, then the data are skewed positively, and if the tail points to the negative end of the number line, then the data are skewed negatively. With positively skewed data, the mode is less than the median, which in turn is less than the mean. With negatively skewed data, the reverse is true.
Consider housing pricesmentioned in the news media. Realtors report the median selling price of existing and new homes. House prices are in dollars, which is metric data; hence, the mean is a valid and seemingly appropriate statistic. Nonetheless, the median rather than mean is reported because the former is more stable over time, which permits more reliable month-to-month comparisons. In contrast, the mean selling price for a month couldincrease or decrease markedly in a community if one or two expensive homes are or aren’t sold. To establish long-term trends in housing prices, a more stabile indicator of prices is preferred. Here, the data is metric, yet the median provides a better summary number. Researchers must consider the distribution of their data and what they’re trying to accomplish before selecting the best measure of central tendency.
Slide 28
Consider one type of symmetric distribution: the normal distribution. This bell-shaped distribution of values is symmetric about its mean. The mean identifies the highest point on the curve or the most frequent response on that variable. Almost all values are within +/-3standard deviationsfrom the mean, which is why the range divided by six provides a good estimate of standard deviation. Normal distributions are assumed to be continuous distributions; there could be an infinite number of cases. As survey datacontains discrete scores, it cannot form a true normal distribution. Nonetheless, it’s worthwhile to assess how closely survey data conforms to a normal distribution because a skewed distribution of responses has data analysis implications.
Slide 29
Inter-quartile range is a good approximation of +/-1 standard deviation from the mean. The inter-quartile range is the middle 50% of the distribution; +/-1 standard deviation from the meanin a normal distribution is a bit more than 68%.
Slide 30 (No Audio)
Slide 31
For IQ scores, the standard deviation, historically, has been 15 points. The mean score is 100 and one standard deviation above the mean score is 115; three standard deviations above the mean score is 145.
Slide 32
Normally distributed data can be standardized by consideringthe area under the bell curve as a probability density of 1.0. This sets the mean equal to ‘0’ and the standard deviation equal to ‘1’.
Slide 33 (No Audio)
Slide 34
Normally distributed data can be converted from raw scores into Z scores, which are scaled from +3 to -3. That score indicates the extent a score is above or below its mean. When dealing with raw metric data, it’s important to know the mean and difference from that mean. All that information is contained within a Z score.