BR I

Correlations

Correlations deal with the relationship between two or more variables. With this measure, we can answer questions such as: Are achievement tests scores related to Grade Point Averages? Is counselor empathy related to counseling outcomes? Is student toenail length related to success in graduate school? Correlational studies, however, are limited in that they do not allow us to make conclusions about causes and effects. No matter how high a correlation we find between two variables, we can only say the two variables are related, we can never say one is a cause of the other.

What do we mean by “cause”? When we make the claim that changes in one variable causes a change in another variable, we are saying that we have evidence that changing one variable will produce a predictable change in a second variable. We can only make this claim if we have, in fact, changed (manipulated) one variable and observed the effects of this change on a second variable (while holding all other variable constant). Correlational studies do not meet this standard for making causal conclusions. In a correlational study we simply measure the level of two (or more) variables for the same individual and then statistically determine whether knowledge of a person’s score on one variable (A) allows us to better predict their score on another variable (B) better than if we did not have knowledge of an individual’s score on variable A. If we find a significant correlation between two variables there are always three possible causal explanations for why that relationship exists. These explanations can be summarized according to two problems.

1) The Directional Problem - If we find a correlation between two variables, (e.g., Birth Rate in humans [Variable A] and number of storks nesting in a village [Variable B]), there are two possible explanations for this relationship:

That A causes B (i.e., High birthrates in humans cause higher nesting rates of strokes; perhaps the storks are attracted to babies).

An equally good explanation of the this correlation is that B causes A (i.e., Higher rates of storks cause higher birth rates - perhaps my mother is correct and the storks really do bring babies).

2) The Third Variable Problem - Some outside variable C (or set of variables) may effect both A and B in a manner so that they co-vary in a predictable way but A does not cause B and B does not cause A. In our example, a possible explanation of the relationship between Human birth rates and the number of nesting storks is population of the community. Communities with higher populations have more people, thus are likely to have higher birth rates. Storks nest in chimneys communities with higher populations have more homes, which have chimneys, thus providing more opportunities for storks to nest. They co-vary because they are both effected by a common underlying determinant, but A does not cause B, nor does B cause A.

What correlations do allow us to do is conclude that two variables are related. This can be very interesting and valuable information, but it does not justify a conclusion that states which variable is causing the other.

Example: You may well see this type of question on your exam.

Dr. Tippy Bakafew conducted a study looking at the relationship between religion and alcohol consumption. Dr. Bakafew begins by obtaining a list of all villages, towns and cities (we will refer to these as communities) in Wisconsin. From this list, he randomly selects a sample of 20% of the communities. For each of these communities he obtains a count of the number of Bars and the number of Churches listed in the Yellow Pages. He finds that there is a correlation of +0.88 between these two measures. Based on this Dr. Bakafew concludes that religion causes people to drink. Consider Dr. Bakafew’s conclusion – and present two other possible INTERPRETATIONS OF THIS CORRELATION.

In a correlational study, the researcher measures variables as they naturally occur. They do not do anything to manipulate the variables. Some of the studies that members of this class are conducting are purely, or at least in part, correlational. This does not make them bad studies. You just must be careful not to make conclusions that are not supported by your results.

The Correlation Statistic ( r ) provides two separate pieces of information. (1) the sign, negative or positive tells us the direction of the relationship. If a correlation is positive, it indicates that higher levels or one variable predict higher levels of the second variable (and conversely that lower levels of one variable predict lower levels of the other). A negative correlation indicates that higher levels of one variable predict lower levels of the second variable (and visa versa). When interpreting a correlation it is important to keep in mind that a negative correlation is not a negative result. It is just as informative as a positive correlation. For example, in my Introduction to Experimental Psychology course I have found a high positive correlation between class attendance and final grades. The more classes a student attends, the higher their final grade tends to be. Another equally valid way to discuss this correlation is to say that the higher a students final grade, the more classes they tended to attend. If instead of measuring the number of classes students attend, I had measured the number of classes they missed, I would find a negative correlation between classes missed and final grades. Both correlations are equally as informative

The second piece of information a correlation statistic tells us the strength of the relationship -- how accurately we can predict one variable based on knowledge of the other variable (i.e., how good a predictor one variable is of another). The value of a correlation ranges from +1 to -1. Therefore, a correlation is always either 1, 0, or a decimal. The closer the absolute value of the correlation is to 1, the stronger the relationship. Similarly the closer the absolute value is to zero the weaker the relationship.

One way to represent the correlation between two variables is to graph the relationship between two variables. On the Y (vertical axis), we put the scale for one variable, and on the X axis (horizontal axis) we put the second variable. For now, it really does not matter which you put on which axis. Each subjects score on the two variables represented as a data point on this graph, which is called a scatter diagram. On the diagram below a participant’s scores on two variables is depicted by one data point. Subject 1 attended 40 out of 45 classes and obtained a final grade of 75%.

Data Points

When we plot a data from a sample of scores the formation of the scores can tell us something about the relationship between the two variables. Two measures that are related to each other will produce scatterplots that approximate a line. The less the scores deviate from the line the stronger the correlation. If the line is sloping upward, the direction of the correlation is positive (higher the scores on one axis, predict higher scores on the other axis). If the slope is downward, the direction of the correlation is negative (higher scores on one variable, predict lower scores on the second variable).

The correlation statistic is a mathematical technique for finding the slope of a straight line that best summarizes the relationship between the two variables. This line is called the “line of best fit”. This line fits the data in a manner that reduces the overall distance between itself and all the data points better than any other straight line could. In other words, if you drew a series of lines on your scatterplot and then measured the total amount the data points varied from the line, you would find that there would be one line that produced the lowest total deviation scores. In other words, there is one line which summarizes the relationship between the two variables best, or that best fits the data.

Line of Best Fit

The line of best fit for a perfect correlation will have two characteristics.

1) all the data points will line up perfectly on the line of best fit.

2) when standardized scores are used. the slope of the line will be 45° (negative or positive.)

For example, this scatterplot shows a perfect positive correlation.

When two variables are not related, they may appear to be random, circular or form a straight vertical or straight horizontal line. For example, the scatterplot below show a correlation of +.06 (very low)

Zero correlations may also appear as straight vertical or horizontal lines. On the graph below, we see that no matter how many classes a person attends, the best prediction we can make is 65, which is also the class mean. Knowing how many classes a person attends does not give me any information about the there final grade. My best prediction in all cases would be the class mean. The line of best fit for a random or circular scatterplot is also both a straight vertical or horizontal line.

The correlation statistic is a based on the slope of the line.

While there are some great uses for scatterplots, which we will discuss, they are not great ways of presenting your results. Instead, we report the correlation statistic.

Significance of a correlation

The magnitude of a correlation is strongly affected by the size of the sample. The more subjects you have in your sample the less likely it is that you will obtain a high correlation just by chance. If you think about it, if we draw just a few numbers from two buckets, it is more likely they will form a pattern just due to chance than if we drew several pairs of numbers. The same magnitude of correlation may be significant in a large N study, but non-significant in a small N study. For example, in a study with 10 participants, a correlation of .67 is need in order for the result to be significant at the p = .05 level. In a study of 100 participants, the correlation need only be .20 to be significant at the same level. This leads to an interesting, but fairly common phenomenon in correlational studies. Often, researchers are excited about their data and will conduct preliminary analysis after having measuring only a few participants. They find fairly high correlations. As they run more subjects, they find the correlations tend to drop. While correlations are not meant to be used on small samples sizes, they do tend to give fairly stable results when sample sizes are larger. This is why I have suggested that those of you who are running Correlational studies aim for sample sizes of at least 60.

Coefficient of Determination is a measure that can be easily obtained from the correlation statistic that SPSS produces for you. It is simply r squared. It is often more useful to talk about the relationship between your variables in terms of the Coefficient of Determination. For examples, if the correlation between Intelligence test scores and GPA were r = .60, squaring it would give you .36. The obtained square of the correlation coefficient (in our example) indicates that 36% of the variability in Y is “accounted for” by the variability in X. Another way of saying this is that if the correlation between two variables is .60, then they have 36% of their variability”in common”. You may want to use this statistic when discussing the results of your study.

Linear and Nonlinear Relationships

One of the assumptions that the correlation statistic depends on is that the relationship between the two variables is linear. This means that there is a straight line that best describes the relationship between the variables. Not all relationships are linear. For example, arousal level and performance are not linearly related. Let’s say you have a very low arousal level (you can barely stay awake) and you write an exam. How well do you expect to do? On the other hand, lets say you are really highly aroused (15 cups or coffee), again how well do you expect to do. The relationship between arousal and performance is well known. At very high and very low levels of arousal performance is poor. but at middle levels performance is good. The line that best describes this relationship is curved. Looking at the scatter plot of your data should tell you if the relationship is curvilinear or has some other strange shape. If it is do not despair. There are other ways of analyzing the data, or alternatively there are ways of transforming the data to make the relationship linear. If you find you have a non-linear relationship – let me know and we will discuss the best way to handle it. If you have a non-linear relationship between your variables, it is unlikely you will find a significant correlation. Looking at the distribution below, think about where the line of best fit would be. But this line is not the best fit for the data. In fact looking at the scatterplot we can see that one variable is highly predictable if we know the value of the other variable, the relationship is just not linear, therefore the correlation statistic is not appropriate for this analysis.

Correlations with More than Two Variables

When you measure two or more continuous variables you are able to look at the relationship(s) between (among) them. In your study you may include more that two variables and this may allow you to ask more interesting questions.

Statistically controlling for a variable. GPA is related to aptitude. It is also related to other factors – such as IQ. If you conducted a study that looked at the relationship between GPA and aptitude test scores, the relationship between these two variables will be complicated by other predicative factors (such as IQ). We can use some special techniques to statistically control for other variables we have measured. If, for example I have SPSS calculate the correlation between GPA and aptitude with IQ statistically controlled for, what I am doing is removing the influence of IQ. The resulting correlation estimates the relationship you would have found between GPA and aptitude if you had controlled for IQ by selecting only subjects of average IQ. The statistical technique that allows you to look at the relationship between two variables with one or more other variables statistically controlled for, is called a partial correlation.