Elementary Statistics Chapter 9 Dr. GhamsaryPage 1
Elementary Statistics
M. Ghamsary, Ph.D.
Chapter 9
Correlation and Regression
Regression and Correlation
- Correlation: is a measure of association (Usually linear) between two variables. To see If two variables, like X and Y, are linearly related we use either of the following:
- Scatter Diagram
- Correlation Coefficient
Scatter Plots
A scatter plot reveals relationships or association between two variables. The relationship between two variables is called correlation. A scatter plot usually consists of a large body of data. The closer the data points come when plotted to making a straight line, the higher the correlation between the two variables, or the stronger the relationship. If the data points make a straight line going from the origin out to high x- and y-values, then the variables are said to have a positive correlation. If the line goes from a high-value on the y-axis down to a high-value on the x-axis, the variables have a negative correlation.
- Vertical axis: variable Y: usually the response,outcome or dependentvariable
- Horizontal axis: variable X:predictor, explanatory or independent
Questions:Scatter plots can provide answers to the following questions:
- Are variables X and Y related?
- Are variables X and Y linearly related?
- Are variables X and Y non-linearly related?
- Does the variation in Y change depending on X?
- Are there outliers?
- No relationship:If there is absolutely no correlation present the value given is 0.
- Perfect linear correlation: A perfect positive correlation is given the value of 1. A perfect negative correlation is given the value of -1.
- Strong linear correlation:The closer the number is to 1 or -1, the stronger the correlation, or the stronger the relationship between the variables.
- Weak linear correlation: The closer the number is to 0, the weaker the correlation.
Notice that for the perfect correlation, there is a perfect line of points. They do not deviate from that line. For moderate values of r(correlation coefficient, defined below), the points have some scatter, but there still tends to be an association between the x and the y variables. When there is no association between the variables, the scattering is so great that there is no discernable pattern. Correlations can be said to vary in magnitude and direction. Magnitude refers to the strength of association--higher r values represent stronger relationship between the two variables. Direction refers to whether the relationship is positive or negative, and hence the value of r is positive or negative.
Let's take an example. Did you ever wonder whether the person that took the longest on the test did very well or very poorly? It might be that the students who take the longest on the exam are the most careful, and they score the highest. This would be an example of a positive correlation, because high values of one variable (e.g., time spent on the test) are associated with high values on the other variable (e.g., better performance on the test).Or it might be the other way around: longer time on the test is associated with poorer scores. The latter is an example of a negative correlation, because high values on one variable are associated with low values on another variable. A person who scores highly usually finishes quickly.
Example1: To examine whether there is a positive or negative association between grades on an exam and time spent on an exam, one has to look to see if individuals who did well on the exam also spent longer on it. The following data is the score and the length of the time. Explain what if they are related? Use Scatter plot and then compute the correlation coefficient.
Score / Time88 / 60
96 / 53
72 / 22
78 / 44
65 / 34
80 / 47
77 / 38
83 / 50
79 / 51
68 / 35
84 / 46
76 / 36
92 / 48
80 / 43
57 / 40
78 / 32
74 / 27
73 / 41
88 / 39
90 / 43
Each point represents one student with a certain score for time on the exam, x, and grade, y. The scatter plot reveals that, in general, longer times on the exam tend to be associated with higher grades. Notice that there is a kind of stream of points moving from the bottom left hand corner of the graph to the upper right hand corner. That indicates a positive association or correlation between the two variables.
Example 2A: Many researches have shown that height is a good predictor variable for weight among people of the same age and gender. The following are the weight and height of 15 male between age 20 and 25 years.
Weight
In Kg / 85 95 75 65 80 70 68 57 66 89 60 65 79 85 90Height
In Cm / 180 195 172 160 175 165 170 168 158 182 172 173 175 168 195
Each dot represents a subject with their HEIGHT and corresponding WEIGHT. This plot shows as height increase then weight is also increasing. So there is a positive association between the variable height and weight. The variable height, X, is called the predictor, explanatory or independent variable and the variable weight, Y, is the response, outcome or dependent variable
Example 2B: Many researches has shown that # of hours you go out and have fun over the weekend is a good predictor variable for your grade in the quiz on Monday Math classes. The following are the data for 12 students with the results on their X(#of hours) and Y(the score on the quiz).
X / Y7 / 6
4 / 4
4 / 5
20 / 2
10 / 3
12 / 1
0 / 10
11 / 4
10 / 5
9 / 2
2 / 8
1 / 9
Correlation Coefficient
Association is one of the fundamental tools of scientists. Francis Bacon, for instance, discovered that heat is a form of motion by compiling lists of items that were hot and cold. Ivan Pavlov, who was originally studying the digestive system, discovered an important rule of learning, classical conditioning, by observing that dogs salivated when he rang their dinner bell. In both instances, an association was noted between two variables. As one variable increases, so does the other.
The statistical index of the degree to which two variables are associated is the correlation coefficient. Developed by Karl Pearson, it is sometimes called the "Pearson correlation coefficient". The correlation coefficient summarizes the relationship between two variables.
We denote the correlation coefficient of the population by and the sample by r which is defined by:
,
or
, where, , .
The inequality is indicating that the r cannot be out side of the range of –1 and 1.
About r
As always, we have a letter that stands for out statistic. In the case of correlation, it is r. The Pearson r can be positive or negative, ranging from -1.0 to 1.0. A correlation of 1.0 indicates a perfect positive association between the two variables. If the correlation is 1.0, the longer the amount of time spent on the exam, the higher the grade will be--without any exceptions. An r value of -1.0 indicates a perfect negative correlation--without an exception, the longer one spends on the exam, the poorer the grade. If r=0, there is absolutely no relationship between the two variables. When r=0, on average, longer time spent on the exam does not result in any higher or lower grade. Most often r is somewhere in between -1.0 and +1.0.
- The positive sign is an indication x and y are in the same direction. That is, they are both increasing or they are both decreasing.
- The negative sign is an indication x and y are in the opposite direction. That is, when x is increasing, then y is decreasing and vice versa.
- A value close to 0 implies little or no linear relationship between x and y, but the values closer to 1 or –1 is an indication of strong linear relationship between x and y.
About r2
One can think of a correlation as measure the degree of overlap, or how much two variables tend to vary together. Go back to the scatter plot printed above, and put your hand over the y-axis (vertical one!!). How much the points vary from left to right is how much variation there is in the time variable. Now, put your hand over the x-axis. Look at how much the points vary from top to bottom. That amount of scatter represents the variation in grades. Now, looking at the bivariate plot as whole, you can see how the points tend to scatter or vary together. Their "shared variance" is the amount that the variations of the two variables tend to overlap.
The percentage of shared variance is represented by the square of the correlation coefficient, r2. Another way to visualize this is with a Venn diagram that represents the amount of shared variance, or overlap of variation, of two variables. Because r-square is interpreted as the percentage of shared variance, it is best to compare two r2s rather than two rs. For instance, a correlation of .8 seems to be twice as large as a correlation of .4. But the larger coefficient actually indicates there is 4 times as much shared variance. 0.64 vs. 0.16. Occasionally, shared variance is called the variance accounted for in one variable by another variable. An r-square of .64 suggests that x accounts for 64% of the variance in y.
The Regression Equation
The regression equation is simply a mathematical equation for a line. It is the equation that describes the regression line. In algebra, we represent the equation for a line with something like this:
, or
b is the intercept, or the point at which the line travels through the y-axis (sometimes called the y-intercept), and m is the slope of the line. One can think of the y-intercept as the value of y when x is equal to 0. With a grid, we could find the slope of the line by counting how many points we have to go up to meet the line again after we have gone over one point to the right (remember "rise over run"). So the slope is a ratio of the increase in y with every point increase in x.
With regression analysis, we need to find out what the equation of the line is for the best fitting line. What is the slope and intercept for the regression line? If the slope is zero, there is no relationship between x and y. If the slope is larger than 0 (or smaller, if the relationship is negative), there is a relationship.
To figure out the equation for the regression line, we first want figure out the slope and intercept. Here is the formula for that:
Pretty simple, we've done similar formulas before.
where and are the means of x and y respectively. In regression analysis, we are attempting to predict y based on x scores, so we represent the regression equation with a symbol to indicate a predicted score:
Example 3: Compute the correlation coefficient in example1:
Score(y) / Time(x) / / /88 / 60 / 5280 / 7744 / 3600
96 / 53 / 5088 / 9216 / 2809
72 / 22 / 1584 / 5184 / 484
78 / 44 / 3432 / 6084 / 1936
65 / 34 / 2210 / 4225 / 1156
80 / 47 / 3760 / 6400 / 2209
77 / 38 / 2926 / 5929 / 1444
83 / 50 / 4150 / 6889 / 2500
79 / 51 / 4029 / 6241 / 2601
68 / 35 / 2380 / 4624 / 1225
84 / 46 / 3864 / 7056 / 2116
76 / 36 / 2736 / 5776 / 1296
92 / 48 / 4416 / 8464 / 2304
80 / 43 / 3440 / 6400 / 1849
57 / 40 / 2280 / 3249 / 1600
78 / 32 / 2496 / 6084 / 1024
74 / 27 / 1998 / 5476 / 729
73 / 41 / 2993 / 5329 / 1681
88 / 39 / 3432 / 7744 / 1521
90 / 43 / 3870 / 8100 / 1849
=1578 / =829 / =66364 / =126214 / =35933
Example 4:Repeat example 3, by using
Variable N Mean StDev
y 20 78.90 9.49
x 20 41.45 9.09
1
Elementary Statistics Chapter 9 Dr. GhamsaryPage 1
1
Elementary Statistics Chapter 9 Dr. GhamsaryPage 1
y / x / / /88 / 60 / 0.958904 / 2.040704 / 1.95684
96 / 53 / 1.801897 / 1.270627 / 2.289539
72 / 22 / -0.72708 / -2.13971 / 1.555746
78 / 44 / -0.09484 / 0.280528 / -0.0266
65 / 34 / -1.4647 / -0.81958 / 1.200441
80 / 47 / 0.115911 / 0.610561 / 0.070771
77 / 38 / -0.20021 / -0.37954 / 0.075988
83 / 50 / 0.432034 / 0.940594 / 0.406368
79 / 51 / 0.010537 / 1.050605 / 0.011071
68 / 35 / -1.14858 / -0.70957 / 0.814997
84 / 46 / 0.537408 / 0.50055 / 0.269
76 / 36 / -0.30558 / -0.59956 / 0.183216
92 / 48 / 1.3804 / 0.720572 / 0.994678
80 / 43 / 0.115911 / 0.170517 / 0.019765
57 / 40 / -2.30769 / -0.15952 / 0.368114
78 / 32 / -0.09484 / -1.0396 / 0.098593
74 / 27 / -0.51633 / -1.58966 / 0.820793
73 / 41 / -0.62171 / -0.0495 / 0.030778
88 / 39 / 0.958904 / -0.26953 / -0.25845
90 / 43 / 1.169652 / 0.170517 / 0.199446
-1.2E-14 / -6.1E-15 / 11.08109
0.583.
Example 5: Use data of example 3 to find the equation of regression.
Solution: From the table in example 3 we have:
=1588 / =829 / =66764 / =127454 / =35933Variable N Mean StDev
y 20 79.40 8.48
x 20 41.45 9.09
=79.40-0.61*41.35=53.7So the equation of regression is given by.
Example6: Compute the correlation coefficient in example2.
y / x / xy / x2 / y285 / 180 / 15300 / 32400 / 7225
95 / 195 / 18525 / 38025 / 9025
75 / 172 / 12900 / 29584 / 5625
65 / 160 / 10400 / 25600 / 4225
80 / 175 / 14000 / 30625 / 6400
70 / 165 / 11550 / 27225 / 4900
68 / 170 / 11560 / 28900 / 4624
57 / 168 / 9576 / 28224 / 3249
66 / 158 / 10428 / 24964 / 4356
89 / 182 / 16198 / 33124 / 7921
60 / 172 / 10320 / 29584 / 3600
65 / 173 / 11245 / 29929 / 4225
79 / 175 / 13825 / 30625 / 6241
85 / 168 / 14280 / 28224 / 7225
90 / 195 / 17550 / 38025 / 8100
Sum / =1129 / =2608 / =197657 / =455058 / =86941
=
As we observe the value of correlation coefficient is positive, but not too high. The scatter plot in above is confirming this value.
Example 7:Use data of example 6 to calculate the correlation coefficient by using
Variable N Mean StDev
y 15 75.27 11.85
x 15 173.87 10.74
85 / 180 / 0.821 / 0.571 / 0.469
95 / 195 / 1.665 / 1.967 / 3.276
75 / 172 / -0.023 / -0.174 / 0.004
65 / 160 / -0.867 / -1.291 / 1.119
80 / 175 / 0.399 / 0.105 / 0.042
70 / 165 / -0.445 / -0.826 / 0.367
68 / 170 / -0.614 / -0.360 / 0.221
57 / 168 / -1.542 / -0.547 / 0.843
66 / 158 / -0.782 / -1.478 / 1.156
89 / 182 / 1.159 / 0.757 / 0.877
60 / 172 / -1.289 / -0.174 / 0.224
65 / 173 / -0.867 / -0.081 / 0.070
79 / 175 / 0.315 / 0.105 / 0.033
85 / 168 / 0.821 / -0.547 / -0.449
90 / 195 / 1.243 / 1.967 / 2.446
Sum / =1129 / =2608 / -0.004 / -0.005 / 10.698
0.765. As we observe the value of correlation coefficient is the same as the previous page.
Example 8: Use data of example 6 to find the equation of regression.
Solution: From the table in example 2 we have:
=1129 / =2608 / =197657 / =455058 / =86941So the equation of regression is given by:
Example 9: The following is the age and the corresponding blood pressure of 10 subjects randomly selected subjects from a large city.
- Draw the scatter diagram of this data and comment.
- Compute the correlation coefficient and compare with part a.
- Is Correlation significant? Use 0.05 level of significant.
- Write the equation of regression and estimate the coefficients.
- Test to see if the slope is 0,
- Estimate the blood pressure of some one who is 40 years of age.
Age
/ 38 41 42 45 50 52 55 60 62 65Blood
Pressure / 120 115 130 120 132 135 140 145 140 149
a. Scatter plot
As we observe from this plot, there is a strong positive association between the variable age and blood pressure.
B. Correlation coefficient.
Age:/ Blood
Pressure:
/ / /
38 / 120 / 4560 / 1444 / 14400
41 / 115 / 4715 / 1681 / 13225
42 / 130 / 5460 / 1764 / 16900
45 / 120 / 5400 / 2025 / 14400
50 / 132 / 6600 / 2500 / 17424
52 / 135 / 7020 / 2704 / 18225
55 / 140 / 7700 / 3025 / 19600
60 / 145 / 8700 / 3600 / 21025
62 / 140 / 8680 / 3844 / 19600
65 / 149 / 9685 / 4225 / 22201
Sum / =510 / =1326 / =68520 / =26812 / =177000
=
As we observe the value of correlation coefficient is positive and this time is high. The scatter plot in above is confirming this value.
C-By using the table at the bottom of this file, we have CV=0.632. So r >CV and this means r is significant. That means r is high enough.
D-Equation of regression:
=510 / =1326 / =68520 / =26812 / =177000Variable N Mean StDev
y 10 132.6 11.41
x 10 51.0 9.44
=132.6-1.11*51.0=75.75
So the estimated equation of regression is given by:
E- As you observe from the above output, we cannot reject , since the p-value
is less than 0.05.
F-= 1.11*40+75.75=120
Example 10: The following is the age and the corresponding LDL of 12 subjects randomly selected subjects from Loma Linda.
- Draw the scatter diagram of this data and comment.
- Compute the correlation coefficient and compare with part a.
- Is Correlation significant? Use 0.05 level of significant.
- Write the equation of regression and estimate the coefficients.
- Test to see if the slope is 0,
- Estimate the blood pressure of some one who is 40 years of age.
Age / 40 / 50 / 21 / 42 / 47 / 63 / 46 / 32 / 34 / 40 / 51 / 20
LDL / 42 / 40 / 33 / 40 / 44 / 52 / 44 / 37 / 35 / 38 / 44 / 30
Table V:Critical Values of the Correlation Coefficient
df=n-2 / 0.10 / 0.05 / 0.02 / 0.011 / 0.988 / 0.997 / 1.000 / 1.000
2 / 0.900 / 0.950 / 0.980 / 0.990
3 / 0.805 / 0.878 / 0.934 / 0.959
4 / 0.729 / 0.811 / 0.882 / 0.917
5 / 0.669 / 0.754 / 0.833 / 0.874
6 / 0.622 / 0.707 / 0.789 / 0.834
7 / 0.582 / 0.666 / 0.750 / 0.798
8 / 0.549 / 0.632 / 0.716 / 0.765
9 / 0.521 / 0.602 / 0.685 / 0.735
10 / 0.497 / 0.576 / 0.658 / 0.708
11 / 0.476 / 0.553 / 0.634 / 0.684
12 / 0.458 / 0.532 / 0.612 / 0.661
13 / 0.441 / 0.514 / 0.592 / 0.641
14 / 0.426 / 0.497 / 0.574 / 0.623
15 / 0.412 / 0.482 / 0.558 / 0.606
16 / 0.400 / 0.468 / 0.542 / 0.590
17 / 0.389 / 0.456 / 0.528 / 0.575
18 / 0.378 / 0.444 / 0.516 / 0.561
19 / 0.369 / 0.433 / 0.503 / 0.549
20 / 0.360 / 0.423 / 0.492 / 0.537
21 / 0.352 / 0.413 / 0.482 / 0.526
22 / 0.344 / 0.404 / 0.472 / 0.515
23 / 0.337 / 0.396 / 0.462 / 0.505
24 / 0.330 / 0.388 / 0.453 / 0.496
25 / 0.323 / 0.381 / 0.445 / 0.487
26 / 0.317 / 0.374 / 0.437 / 0.479
27 / 0.311 / 0.367 / 0.430 / 0.471
28 / 0.306 / 0.361 / 0.423 / 0.463
29 / 0.301 / 0.355 / 0.416 / 0.456
30 / 0.296 / 0.349 / 0.409 / 0.449
35 / 0.275 / 0.325 / 0.381 / 0.418
40 / 0.257 / 0.304 / 0.358 / 0.393
45 / 0.243 / 0.288 / 0.338 / 0.372
50 / 0.231 / 0.273 / 0.322 / 0.354
60 / 0.211 / 0.250 / 0.295 / 0.325
70 / 0.195 / 0.232 / 0.274 / 0.303
80 / 0.183 / 0.217 / 0.256 / 0.283
90 / 0.173 / 0.205 / 0.242 / 0.267
100 / 0.164 / 0.195 / 0.230 / 0.254
1