Inferential methods in regression and correlation
* Correlations
Correlation Coefficient, r
The quantity r, called the linear correlation coefficient, measures the strength and
the direction of a linear relationship between two variables.The linear correlation
coefficient is sometimes referred to as the Pearson product moment correlation coefficient in
honor of its developer Karl Pearson.
The mathematical formula for computing r is:
where n is the number of pairs of data.
The value of r is such that -1 r +1. The + and – signs are used for positive
linear correlations and negative linear correlations, respectively.
Positive correlation: If x and y have a strong positive linear correlation, r is close
to +1. An rvalue of exactly +1 indicates a perfect positive fit.Positive values
indicate a relationship between x and y variables such that as values for x increases,
values fory also increase.
Negative correlation: If x and y have a strong negative linear correlation, r is close
to -1. An rvalue of exactly -1 indicates a perfect negative fit.Negative values
indicate a relationship between x and y such that as values for x increase, values
for y decrease.
No correlation: If there is no linear correlation or a weak linear correlation, r is
close to 0. A value near zero means that there is a random, nonlinear relationship
between the two variables
Note that ris a dimensionless quantity; that is, it does not depend on the units
employed.
A perfect correlation of ± 1 occurs only when the data points all lie exactly on a
straight line. If r = +1, the slope of this line is positive. If r = -1, the slope of this
line is negative
Person correlation test
t- distribution for a correlation test
with df= n-2
the null hypothesis versus
or or
EXAMPLE:
Table below show the age and price data for a sample of 11 orions, test at 5% significance level, do the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated.
car / Age / Price / car / Age / Price1 / 5 / 85 / 7 / 6 / 66
2 / 4 / 103 / 8 / 6 / 95
3 / 6 / 70 / 9 / 2 / 169
4 / 5 / 82 / 10 / 7 / 70
5 / 5 / 89 / 11 / 7 / 48
6 / 5 / 98
Solution
First: by hand
- the null and alternative hypotheses
2.Calculate the data as in table below
age(x) / price(y) / x-square / y-square / x*y5 / 85 / 25 / 7225 / 425
4 / 103 / 16 / 10609 / 412
6 / 70 / 36 / 4900 / 420
5 / 82 / 25 / 6724 / 410
5 / 89 / 25 / 7921 / 445
5 / 98 / 25 / 9604 / 490
6 / 66 / 36 / 4356 / 396
6 / 95 / 36 / 9025 / 570
2 / 169 / 4 / 28561 / 338
7 / 70 / 49 / 4900 / 490
7 / 48 / 49 / 2304 / 336
Tot=58 / Tot=975 / Tot=26 / Tot=96129 / Tot=4732
3.Substitute in the formula
- Calculate the test statistics
- Find the critical value from t distribution table at and degrees of freedom =11-2=9 in one tail(left tail)
Tcritical = -1.83
6.decision : ttest lies in rejectionregion (-ttest< - tcritical) and the p-value = p(t<-7.249) =0.00002 < 0.05
Interpret results
So we reject the null hypothesis means at 5% significance level, the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated
second: by using SPSS procedure
1.Enter the data , and plot Scatter plots
- We find Pearson correlation by using SPSS as follows
SPSS Outputs
Critical value of t = -1.833
The value of the test statistic falls in the rejection region , and the p-value = 0.0000244 < 0.05 so we reject H0
Interpret results
at 5% significance level, the data provide sufficient evidence to conclude that the age and price of orions are negatively linearly correlated, prices for orions tend to decrease linearly with increasing age.
Spearman correlation
Spearman's Rank Order Correlation using SPSS
Objectives
The Spearman Rank Order Correlation coefficient, rs, is a non-parametric measure of the strength and direction of association that exists between two variables measured on at least an ordinal scale. It is denoted by the symbol rs (or the greek letter ,pronounced rho). The test is used for either ordinal variables or for interval data that has failed the assumptions necessary for conducting the Pearson's product-moment correlation.
Assumptions
- Variables are measured on an ordinal, interval or ratio
- Variables need NOT be normally distributed.
- This type of correlation is NOT very sensitive to outliers.
Example
A teacher is interested in those who do the best at English also do better in Maths (assessed by exam) students in English are also the best performers in Maths. She records the scores of her 10 students as they performed in end-of-year examinations for both English and Maths.
English / 56 / 75 / 45 / 71 / 61 / 64 / 58 / 80 / 76 / 61Maths / 66 / 70 / 40 / 60 / 65 / 56 / 59 / 77 / 67 / 63
Hypothesis :
First, create a table with four columns and label them as below:
English (mark) / Maths (mark) / Rank (English) / Rank (maths) / d / d256 / 66 / 9 / 4 / 5 / 25
75 / 70 / 3 / 2 / 1 / 1
45 / 40 / 10 / 10 / 0 / 0
71 / 60 / 4 / 7 / 3 / 9
62 / 65 / 6.5 / 5 / 1 / 1
64 / 56 / 5 / 9 / 4 / 16
58 / 59 / 8 / 8 / 0 / 0
80 / 77 / 1 / 1 / 0 / 0
76 / 67 / 2 / 3 / 1 / 1
61 / 63 / 6.5 / 6 / 1 / 1
Where d = difference between ranks and d2 = difference squared.
We then calculate the following:
We then substitute this into the main equation with the other information as follows:
as n = 10. Hence, we have rof 0.67. This indicates a strong positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa.
How do you report a Spearman's correlation?
How you report a Spearman's correlation coefficient depends on whether or not you have determined the statistical significance of the coefficient. If you have simply run the Spearman correlation without any statistical significance tests then you are able to simple state the value of the coefficient as shown below:
Rs = 0.67
However, if you have also run statistical significance tests then you need to include some more information as shown below:
at, where N = number of pairwise cases from spearman rank table
Decision
Rs calculated(=0.67) > ( critical value)
So we reject HO
Conclusion:
There is a positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa.
Note: when the number of pairwise cases are large (>30)
We can use z distribution, and the statistical text as :
= 0.67 *sqrt(10-1)= 2.01
Zcritical = 1.96 from z-table
Zcal =2.01> ztab = 1.96 so we reject HO
There is a positive relationship between the ranks individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you ranked in English also, and vice versa
Test Procedure in SPSS
- Click Analyze > Correlate > Bivariate... on the menu system as shown below:
Published with written permission from SPSS Inc, an IBM Company.
- Transfer the variables "English_Mark" and "Maths_Mark" into the "Variables" box by dragging-and-dropping or by clicking the button. You will end up with a screen similar to the one below:
Published with written permission from SPSS Inc, an IBM Company.
- Make sure that you uncheck the Pearson tickbox (it is selected by default in SPSS) and check the Spearman tickbox under the "Correlation Coefficients" group.
- Click the button.
Output SPSS
You will be presented with 3 tables in output viewer under the title "Correlations" as below:
Published with written permission from SPSS Inc, an IBM Company.
The results are presented in a matrix such that, as can be seen, the correlations are replicated. Nevertheless, the table presents Spearman's Rank Order Correlation, its significance value and the sample size that the calculation was based on. In this example, we can see that Spearman's correlation coefficient, rs, is 0.669 and that this is statistically significant (P = 0.033).
Reporting the Output
In our example you might present the results are follows: A Spearman's Rank Order correlation was run to determine the relationship between 10 students' English and maths exam marks. There was a strong, positive correlation between English and maths marks, which was statistically significant
* Regression inference:
Assumptions for regression inferences
1-Population regression line: means that for each value x of the predictor variable , the conditional mean for the response variable is
2-Equal standard deviations ( homoscedasticity) : the conditional standard deviations of the response variable are the same for each values of the predictor variable
3-Normal distributions: for each values of the predictor variable , the condition distribution of the response variable are a normal distribution.
4-Independent observations : the observations of the response variable are independent of one another
Hypothesis test for the slope of the population regression line
Example: in table below, at 5%significance level, the data provide sufficient evidence to conclude that the age is useful as a linear predictor of price for orions ?
car / Age / Price / car / Age / Price1 / 5 / 85 / 7 / 6 / 66
2 / 4 / 103 / 8 / 6 / 95
3 / 6 / 70 / 9 / 2 / 169
4 / 5 / 82 / 10 / 7 / 70
5 / 5 / 89 / 11 / 7 / 48
6 / 5 / 98
Solution:
( age is not useful as a linear predictor of price for orions)
(age is useful as a linear predictor of price for orions)
Age : independent (explanatory) variable
Price : dependent ( response)variable
Test statistic
(where Se is the Std. Error of the Estimate) where ( sum of square of errors)
The critical value
The value of the test statistic falls in the rejection region , and the p-value = 0.000488 < 0.05 so we reject H0
Interpret the result in the hypothesis test:
At 5% significance level, the data provide sufficient evidence to conclude that the( the slope of the population regression line is not 0 and hence that age is useful as a linear predictor of price for orions
SPSS procedure:
Regression
Determine coefficients
RGRESSION LINE : PRICE = 195.468 – 20.261 * AGE
COEFFICINT OF DETERMINATION = 0.853
EXAMPLE
SOLUTION
Correlations
1