Chapter 2-1. Describing Variables, Levels of Measurement, and Choice of Descriptive Statistics
Statisticians
In the front of M.G. Kendall and A. Stuarts, The Advanced Theory of Statistics, Vol 2, is a quotation attributed to the fictitious K.A.C. Manderville, The Undoing of Lamia Gurdleneck.
"You haven't told me yet," said Lady Nuttal, "what it is your fiance does for a living."
"He's a statistician," replied Lamia, with an annoying sense of being on the defensive.
Lady Nuttal was obviously taken aback. It had not occurred to her that statisticians entered into normal social relationships. The species, she would have surmised, was perpetuated in some collateral manner, like mules.
"But Aunt Sara, it's a very interesting profession," said Lamia warmly.
"I don't doubt it," said her aunt, who obviously doubted it very much. "To express anything important in mere figures is so plainly impossible that there must be endless scope for well-paid advice on the how to do it. But don't you think that life with a statistician would be rather, shall we say, humdrum?"
Lamia was silent. She felt reluctant to discuss the surprising depth of emotional possibility which she had discovered below Edward's numerical veneer.
"It's not the figures themselves," she said finally. "It's what you do with them that matters."
Statistics is the Mathematics of Distributions
At the individual level, we can describe features with single numbers (e.g., Fred is 26 years old). At the group level, however, we lose this ability. For example, there is not a single number to describe the age of the people in a college classroom. There are many ages represented, which we call a distribution of ages. Statistics can be said to be the mathematics of distributions. It allows us to describe and test theories in our universe, since our universe can be conceived of as an infinite set of distributions.
______
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
Displaying a Variable Distribution
In Windows Explorer, find the datasets & do files subdirectory of the course manual. Double click on
births_with_missing.dta
to start Stata and read in the data.
If that does not work, because the file is not associated with the program name, for example, use the following,
FileOpen
Find the directory where you copied the course CD
Find the subdirectory datasets & do-files
Single click on births_with_missing.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
births_with_missing.dta", clear
* which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do files"
use births_with_missing, clear
A way to discover the sample size of a dataset, is to use the Stata menus:
DataVariable utilities
Count observations satisfying condition
OK
or simply execute the following command in the Command Window:
countNotice that anytime you use the menus, a Stata command is constructed and executed and shown in the Results Window.
The crudest way to see the distribution of a variable is to simply list its values. Let’s do this for maternal age:
DataDescribe data
List data
Main tab: Variables: matage
OK
list matage
<hit abort button – the red dot with a white X
<- abort after one screenful, since we know there are 500 lines of data
Clearly, a list of 500 numbers is difficult to comprehend. What we do in statistics is a process called data reduction, which is to reduce the information into a form that is easier to comprehend.
The first level of data reduction is the frequency table, which we get using
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: matage
OK
tabulate matage
<or abbreviate command to:>
tab matage
<or abbreviate both command and variable name to:>
tab mat
<or use minimum abbreviation:>
ta m
This generates the following frequency table.
maternal |
age | Freq. Percent Cum.
------+------
23 | 1 0.21 0.21
24 | 1 0.21 0.41
25 | 5 1.03 1.44
26 | 9 1.86 3.30
27 | 14 2.89 6.19
28 | 12 2.47 8.66
29 | 26 5.36 14.02
30 | 27 5.57 19.59
31 | 29 5.98 25.57
32 | 39 8.04 33.61
33 | 45 9.28 42.89
34 | 51 10.52 53.40
35 | 44 9.07 62.47
36 | 31 6.39 68.87
37 | 44 9.07 77.94
38 | 44 9.07 87.01
39 | 27 5.57 92.58
40 | 24 4.95 97.53
41 | 7 1.44 98.97
42 | 2 0.41 99.38
43 | 3 0.62 100.00
------+------
Total | 485 100.00
The left column is the actual values (scores) for the maternal age variable. The “Freq” column is the number of times that particular value occurred in the data (the frequency). The “Percent” column is the frequency count divided by the sample size (N=485, rather than N=500, which we will discuss shortly). The “Cum” column is the cumulative sum of the Percent column, informing you what percent of the sample had “equal to or less than” that value.
By default, Stata commands only operate on non-missing values, which is what you normally want to report in your article. To see the number of missing values with the tabulate command, we simply add the “missing” option:
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: matage
Treat missing values like all other values
OK
tabulate matage , missing
maternal |
age | Freq. Percent Cum.
------+------
23 | 1 0.20 0.20
24 | 1 0.20 0.40
25 | 5 1.00 1.40
26 | 9 1.80 3.20
27 | 14 2.80 6.00
28 | 12 2.40 8.40
29 | 26 5.20 13.60
30 | 27 5.40 19.00
31 | 29 5.80 24.80
32 | 39 7.80 32.60
33 | 45 9.00 41.60
34 | 51 10.20 51.80
35 | 44 8.80 60.60
36 | 31 6.20 66.80
37 | 44 8.80 75.60
38 | 44 8.80 84.40
39 | 27 5.40 89.80
40 | 24 4.80 94.60
41 | 7 1.40 96.00
42 | 2 0.40 96.40
43 | 3 0.60 97.00
. | 15 3.00 100.00
------+------
Total | 500 100.00
This time we see that n=15 observations have a value of “.”, which is Stata’s reserved symbol for missing value for a numeric variable.
Let’s try this for a string variable, which is a variable of letters and/or special symbols (not numbers).
Summaries, tables & tests
Tables
One-way tables
Main tab: Categorical variable: sexalph
Treat missing values like all other values
OK
tabulate sexalph , missing
sex coded |
as string | Freq. Percent Cum.
------+------
| 41 8.20 8.20
female | 220 44.00 52.20
male | 239 47.80 100.00
------+------
Total | 500 100.00
The missing value looks like a space, but it is actually a null value. If you were entering a missing string value using the Stata data editor, you would hit the tab key or the enter key to input the missing value, rather than the space character.
When we look at the Freq or Percent columns of the frequency table, we get a sense for what values are most likely.
The next level of data reduction is to see this graphically with a histogram.
Histogram
Main tab: Variable: matage
OK
histogram matage
Notice that the tick marks don’t line up precisely with the middle of the bars, and that there appears to be a value of maternal age, perhaps age 33, that has a zero frequency. Looking back at the frequency table, however, reveals no age with a zero frequency.
Stata is attempting to provide the nicest looking graph, by forming what it considers the number of bins (cut-offs) that should provide the nicest looking graph. In the Results Window, we see what it did:
(bin=22, start=23, width=.90909091)
This algorithm generally works for a continuous variable that has a large number of distinct values. For our variable, which has a relatively small number of distinct values, we can bypass this feature using the “discrete” option.
GraphicsHistogram
Main tab: Variable: matage
Data is discrete
OK
histogram matage , discrete
The default scale, “density”, is the proportion of the total sample of non-missing values for each specific value of the matage. It is a graph of the Percent column, converted to a proportion, of the frequency table.
To change the y-axis to percents, use the percent option
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Percent
OK
histogram matage , discrete percent
To change the y-axis to frequency count, use the frequency option
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Frequency
OK
histogram matage , discrete frequency
Many people think its fun to overlay a normal distribution curve on their histogram, to get a feel for how normally distributed the variable is. (Note: the importance of having a normally distributed variable is overrated, as we will see in a later chapter.)
To get this, we request the normal option:
GraphicsHistogram
Main tab: Variable: matage
Data is discrete
Y-axis: Frequency
Density Plots tab: Add normal density plot
OK
histogram matage , discrete frequency normal
To see the line that passes through the top center of each bar, but with smoothing, we can overlay on the histogram what is called a “kernal density plot”, using,
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Frequency
Density Plots tab: Add kernal density plot
OK
histogram matage , discrete frequency kdensity
To visually compare the histograms for males and females, we use the by( ) option.
Histogram
Main tab: Variable: matage
Data is discrete
Y-axis: Frequency
By tab: Draw subgraphs for unique values of variables
Variables: sexalph
OK
histogram matage, discrete frequency by(sexalph)
Notice that it is difficult to tell how much the two histograms overlap, or diverge.
One approach we can take is to overlap the kernel density plots for the two sexes, using
(kdensity matage if sexalph == "male" , lcolor(blue))
Here, the “///” in the command means to continue the command onto the next line, but this only works in the do-file editor. To use that, click on the 5th icon from the right on the menu bar, which looks like a notebook. Type in your command, and hit the “Do” button, which is the last icon on the do-file editor menu bar. In the command window, you would just type the entire command in on one line, without the “///”. The “( ) ( )”, with a command for a different graph in each “( )”, tells Stata to overlay the graphs.
Notice it uses “density”, rather than “frequency”. With frequencies, the two sexes could have a different looking line simply due to a different sample size, while using proportions, or densities, make them directly comparable.
Another graph to show this, although more complicated to create, is
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "BiostatsCourse\datasets & do files"
use births_with_missing.dta
drop if matage==. | sex==.
count if sex==1
scalar Nmale=r(N)
count if sex==2
scalar Nfemale=r(N)
gen one = 1
collapse (count) one , by(sex matage)
rename one frequency
gen percent=frequency/Nmale*100 if sex==1
replace percent=frequency/Nfemale*100 if sex==2
replace percent = -percent if sex==1 // males on bottom
#delimit ;
twoway (bar percent matage if sex==1)
(bar percent matage if sex==2)
, legend(lab(1 "male") lab(2 "female"))
ylabel(-10 "10" -5 "5" 0 "0" 5 "5" 10 "10")
;
#delimit cr
The commands used for this graph are taught in the K30 Computer Practicum (Stata) course.
Although fancy, it is still too much information to get your head around.
Finally, a frequently reported graph for comparing two or more distributions is the box and whisker plot, or boxplot for short.
Box plot
Main tab: Variables: matage
By tab: Draw subgraphs for unique values of variables
Variables: sexalph
OK
graph box matage, by(sexalph)
The advantage of this graph is that it incorporates a side by side comparison, with just the right amount of data reduction, which makes it the most popular approach for visually comparing distributions.
Before we discuss what this graph represents, let’s add an outlier to the data to see how it is represented.
Go into the data editor [spreadsheet with pencil icon, the “Date Editor (Edit)” in Stata-11], 5th from right on menu bar) and change matage from 34 to 70 for the first observation. Click on the X to exit, and click on accept changes in the exit data editor box. When you clicked on accept changes, it executed the following command:
replace matage = 70 in 1Re-creating the boxplot,
GraphicsBox plot
Main tab: Variables: matage
By tab: Draw subgraphs for unique values of variables
Variables: sexalph
OK
graph box matage, by(sexalph)
Notice that the outlier is represented by a dot.
The interpretation of a boxplot is shown in the following box.
Boxplots
This graph is a box-and-whisker plot, or simply, a boxplox. The box shows the interquartile range (IQR) (top of box is the 75th percentile, the bottom of the box is the 25th percentile). The line inside the box is the median (50th percentile). The lines extending beyond the box, which look like error bars, called the whiskers, represent the minimum and maximum. However, if a data value extends beyond 1.5 ´ IQR in either direction, the whiskers exclude that value and the value is shown separately as a point on the graph. These points are called extreme values. Extreme values might represent outliers, an outlier being a data value that appears to have not come from the same population that the rest of sample came from.
This graphical approach for identifying outliers was proposed by Tukey (1977). Tukey referred to outliers as “extreme values” to avoid the whole “outlier” issue, it be controversial how outliers should be handled.