Basic Biostatistics in Medical Research: What (Not) to Do

Basic Biostatistics in Medical Research: What (Not) to Do

November 7, 2013

Leah J. Welty, PhD

Biostatistics Collaboration Center

Welcome to Basic Biostatics in Medical Research: What (Not) to Do. This is part of a bi-annual lecture series presented by the Biostatistics Collaboration Center.

A laudable goal for today would be for you to come away understanding everything you might want or need to know about biostatistics in medical research. Unfortunately, that’s highly unlikely in an hour, especially with an audience of varied specialties and backgrounds. Even condensing introductory biostatistics in to an hour long lecture would be impossible. So, rather than focusing on all the background and methodology that’s out there, I will instead focus instead on areas in which people are prone to making mistakes, incorrectly or inadequately applying biostatistics methods, or confused about what their results mean.

This lecture is accordingly divided in to four sections:

  1. A good picture is worth 1,000 words: the importance of statistical graphics
  2. Not all observations are created independent
  3. What is a p-value really?
  4. How to collaborate with a biostatistician.

For those who get excited about this today, please come back next week. If you’re serious about expanding your own biostatistics repertoire, there are a number of excellent biostatistics courses offered by the graduate school. If instead you’re looking for some guidance on the methods appropriate for your own research, I urge you to listen carefully in section 4, and consider visiting the Biostatistics Collaboration Center.

I. A good picture is worth 1,000 words.

Statistical graphics can do two very important things: (1) guide appropriate choice of statistical analyses; and (2) provide a powerful illustration of results. My first piece of advice to investigators is to look at their data -- not to “fish” for results -- but to understand how individual variables are distributed and best summarized. Then, once data have been (appropriately) analyzed and are being prepared for publication, my next advice is to think about (creative) ways to graphically display results.

A. Graphics guiding appropriate analysis choices

Example 1: Correlation and Anscombe’s Quartet

Correlation measures the strength of the linear association between two variables. It is often denoted by “r” and takes values between -1 and 1. The values have the following interpretations:

r near -1: Strong negative linear association

r near 0: No linear association

r near 1: Strong positive linear association.

Suppose I have two variables A and B, and I tell you that their correlation is 0.82. What impression does that make? Hopefully that A and B are fairly strongly linearly associated. The picture we associate with this relationship might look something like what is shown below (Figure 1), where in fact the Variables A and B have a correlation of 0.82.

Figure 1:

However, it’s also possible for the relationship between the two variables A and B to actually be much different, but for their correlation to still be 0.82.

First, the variables A and B may be related in a non-linear fashion. Figure 2 illustrates variables A and B which are quadratically related. Although r = 0.82 because there is still linear trend, correlation is not an accurate description of the strength of the relationship.

Figure 2:

Second, variables A and B may either have no relationship at all or we may not have adequate information to capture the relationship, yet still r = 0.82. In Figure 3, for all but one observation, it appears that A is completely unrelated to B, or at least the B may vary substantially without any change in A.

Figure 3:

The single value on the right side of the plot is what we refer to as an “influential observation.” Correlation is nororious for not being “robust” in the sense that it can depend heavily on just a few observations. If I were presented with this data, I would recomment two courses of action: (1) investigate the ‘influential point’ (is it an obvious mistake in coding or measurement?) and (2) if possible, collect more data in which observations don’t all have the same values for A. Don’t throw-out the influential observation unless it you can determine it was clearly an error (and not just an error in the sense that it doesn’t match the rest of the data). Sometimes the most unusual observations end up giving us the most insignt. It may well be the case that B increases as A increases, we just can’t detemine that from this limited amount of inforatmion. Reporting r = 0.82 would be highly misleading for this data.

Our third and final example involves another unusual observation. In Figure 4 below, variables A and B appear to have a perfect linear relationship, minus one observation, and the correlation is 0.82. As above, it would be wise to investige this observation a bit more -- is it an error, or are there some people (or observational units) for which A and B don’t have the same relationship? It’s also incredibly rare to see such a perfect linear relationship in practice, so I would also recommend investigating the points that appear perfectly related as well.

Figure 4:

In only one of the above four cases was correlation a reasonable summary measure of the relationship between A and B. Have you ever computed a correlation coefficient without first making sure it’s an apropriate summary measure?

As a final note, there are many kinds of correlation. We’ve been discussing the most common version, known as Pearson correlation. Spearman rank correlation and Kendall’s tau are somewhat common as well, but they do not measure strength of linear association.

Example 2: Means, Medians, and Days in Corrections

The most common summary measures of continuous data are undoubtedly the mean and the associated standard deviation. However, the mean may not always be the most accurate or appropriate summary statistic to associate with continuous data. Have you ever computed means and standard deviations without actually looking at the distribution of the variable first? The next example illustrates why that’s not such a good idea.

This example comes from some research I do with the group called Health Disparities and Public Policy in the Department of Psychiatry and Behavioral Sciences. In particular, we have a prospective longitudinal study of juvenile delinquents after detention. A number of participants cycle in and out of the correctional system (jail, prison), and one of the measures we are interested in is the amount of time they spend incarcerated.

For this mock data (a subsample of 1000 participants from an interview 5 years after detention as a juvenile), we found that the average number of days spent in corrections during the past year was 84. However, the median number of days in corrections in the past year was 0. Figure 5, below, illustrates what’s going on. Over half the participants (544) had no correctional stays during the past year, and the next largest chunk of participants (99) were in a correctional facility the entire year. The remaining participants are distributed between 1 and 364 in a fairly uniform way. However, the 99 participants who were in a correctional facility the entire time “pull” the mean to 84.

Figure 5:

The mean is not “robust” to outlying values, but the median is “robust.” The mean is actually the balance point of the distribution: if you can imagine putting the histogram on a fulcrum, you’d need to put the fulcrum at 84 to balance the two ends – those 99 values are ‘far away’ from the rest of the data, which gives them disproportionate weight.

The lesson here is not to blindly compute means without first making sure that they’re an appropirate summary meausre. If not, don’t be afraid to report the median. Just as the mean is generally reported with the standard deviation, the median should be reported with the range and quartiles (often 25% and 75%, as well as the distance between the two).

As a final note, both the histograms below illustrate fake data in which both variables have a mean of 2.0. For the symmetric (and normally distributed) data on the left, the median is also 2. For the skewed data on the right, the median is 1.4. For the data on the right, I would pause before reporting just the mean and the standard deviation.

Figure 6:

B. Graphics providing a powerful illustration of results.

The graphics you use to generate helpful summaries of your data are generally not the same ones you’ll want to use in presentations or publications. Figures are an exciting opportunity to convey your results in a visual fashion, and they may be far more convincing than text or even the numbers themselves. Unfortunately, programs like Microsoft Excel or Power Point don’t always provide good guidance on what makes an effective figure for publication. Or, in the event that they can be coerced in to making a nice figure, it’s not trivial to figure out how. Biostatisticians hardly ever use Excel to generate figures. Common statistical programs such as R, SAS, Stata, and SPSS all have reasonble graphics packages that can easily provide more appropriate graphical summaries.

The example below illustrates what’s possible with different options and increasing levels of sophistication. As in the previous section, this example uses data on time incarcerated. The purpose of the figure is to illustrate the racial/ethnic differences in time spent incarcerated.

The first example was created using Excel with the help of a graduate student who was highly proficient in Excel from her former life as a management consultant. At first glance, are you overwhelmed by the racial/ethnic differences in incarceration? Can you tell what these differences are? Does this type of figure look familiar?

Although such figures are commonplace, there are a number of ways in which this figure doesn’t work. Criticisms include: (1) the x-axis divides up a continuous variable – the number of months in corrections – in to categories; (2) the horizontal lines are distracting; (3) perhaps most importantly, to understand the relatiosnhip between race/ethnicity and months in corrections, you need to digest racial/ethnic comparisons in six different categories -- the first and the last being the most relevant.

This second presentation of the exact same data was generated using Stata, with no alterations to the standard boxplot command:

Side-by-side boxplots are a powerful way of conveying differences in the distributions of continuous variables, but are sadly underused. The boxes constitute the middle 50% of the data (from the 25th to the 75th percentile), the line within the boxes shows the median, and the whiskers reach to the upper and lower limits of the data. In the case of non-Hispanic whites, some of the large observations are considered ‘outliers’ (more the 1.5 times the length of the box beyond the 75th percentile), so they’re shown as dots rather than included as parts of the whiskers.

It’s clear from looking at this boxplot that non-Hispanic whites are generally spending less time incarcerated than minorities..

Finally, our last version, which is close to what was submitted for publication, shows a slightly different version of the data:

Note that it’s not a conventional plot, but is effective at demonstrating racial/ethnic and sex differences in two variables: (1) who had spent any time incarcerated, and (2) the length of those incarcerations. This figure was generated using R, which is open source and freely available statistical software. It has excellent graphics capabilities. For the technically inclined, it is certainly accessible. For others, know that this is the sort of figure your friendly neighborhood biostatisican can create.

C. Good and bad examples of graphics.

Edward Tufte has written extensively (in very accessible language) and elegantly about what makes good statistical graphics. Visit http://www.edwardtufte.com/tufte/ for more information. Much of what follows in this section is influenced by his work.

Here are some ideas to keep in mind when you’re generating graphics:

1. Graphics should have maximum information with minimum ink. The ubiquitous tower and antannea plots are horrible offenders in this category. One tower and antannea uses a lot of ink to illustrate just two numbers. Why not a dot and a line instead? All the extra ink is distracting. Only use color if color is necessary.

2. Graphics should have no more dimensions that exist in the data. The 3-d bar charts in Excel may look fancy, but they’re horrible when it comes to actually reading information from the plot.

3. Labels should be informative, not distracting, and have a sensible range.

4. Avoid pie charts (expecially the 3-d kind). Humans are horrible at comparing areas, and even worse at comparing volume. I recommend bars instead (see the final figure in the previous section). We’re better at comparing length.

If you’re looking for examples of good and creative graphics, check out the New York Times. Below is a picture of an interactive plot illustrating how people spend their time. Although few of us have the technical expertise to generate such a figure, it’s important to note that this is highly illustrative even though it’s not what we’re used to seeing or have likely seen in a statitics textbook. It’s also worth noting that the New York Times graphics department relies heavily on R for the first versions of many of the graphics they create.

In contrast, the example on the left below comes from USA Today. It’s heavily leaden with what Tufte refers to as “chart junk” – the gratuitous and cartoonish decoration of statistical graphics. The USA Today example is particularly bad because it combines chartjunk with a pie chart that’s in perspective.

II. Not all observations are created independent.

The majority of methods taught in introductory and intermediate biostatistics courses assume that observations are independent. However, in medical research especially, we encounter data in which our observations are not independent. Examples of non-independent data include: