Correlation and Regression

Correlation and regression

1. Product moment correlation (Pearson Correlation Coefficient or simply correlation coefficient) between two metric (interval or ratio) variables

· Open file File3.sav (in SPSS)

· AnalyzeàCorrelateàBivariateàVariables: Attitude vs. Duration

· Pearson correlation coefficient

2. Partial correlation coefficient

· AnalyzeàCorrelateàPartialàVariables: Attitude vs. Duration, Controlling for: Importance of weather

3. Nonmetric correlation (ordinal scale): use the same procedure as above

· Spearman’s ρ (rho): use when there are not too many ties in the rankings

· Spearman correlation coefficient

· Kendall’s τ (tau): use when there are many ties in the rankings

· Kendall’s τ (tau)

4. Bivariate Regression Analysis (One dependent, one independent variable)

· AnalyzeàRegressionàLinear (Dependent: Attitude; Independent: Duration)

· Y(Attitude) = 1.079 + 0.59*X(Duration), where Duration is significant (Sig. = 0.000) with a positive sign (+0.59)

· Interpretation: If the Duration increases by 1 year, the Attitude increases by 0.59 point on the scale from 1 to 11.

· Example: Predict the Attitude of a person who lived 10 years in the city: Attitude = 1.079 + 0.59*10 = 6.979. Note: the above interpretation is valid only when Duration is significant. In order to determine this

· Let’s test Ho: Regression Coefficient = 0; H1: Regression Coefficient  0

· R2 (Coefficient of Determination) = 0.876

· Beta Coefficients allow us to determine which independent variable is relatively more important (has greater impact) in predicting the dependent variable. The (non-standarized) Regression Coefficients cannot be used for this purpose! Beta Coefficients eliminate the problem of dealing with different units of measurement, e.g. one independent variable expressed in thousands of dollars (say, parents income) and another independent variable expressed in dollars (say, child’s monthly allowance) when one wants to predict the child’s monthly expenditures on CD’s. Solution: first, standarize the independent variables (if there are different units of measurement), then run regression analysis on them à you will get Beta Coefficients that can be compared against each other.

· Adjusted R2 = 0.864 (always lower than unadjusted R2)

· Use Adj-R2 when the number of observations per each independent variable is below 10 to 15 observations (with 4 being an absolute minimum!!)

· Adj-R2 is useful when one has to compare several regression equations involving the same dependent variable against different numbers of independent variables or different sample sizes.

· To test the significance of the linear relationship, one may also use the F test for the significance of the R2: Ho: R2 = 0; H1: R2 > 0.

· F = 105.95/(14.96/10) = 70.803 and is statistically significant at alpha = 0.05 (this confirms the above presented t-test results)

5. Multiple Regression Analysis (One dependent, several independent variables)

· AnalyzeàRegressionàLinearàDependent: Attitude; Independent: Duration, Importance of the Weather; Method: Enter. [Try also: Method: Stepwise]

· Y = 0.337 + 0.481*Duration + 0.289*Importance of the Weather

· Beta Coefficients are: 0.764 for Duration and 0.314 for Importance of the Weather

· Interpretation of Partial Regression and Beta Coefficients

· Interpretation of Adj-R2 = 0.933

· Significance Testing

· Ho: All Regression Coefficients = 0; H1: Some Regression Coefficients  0 àt Test à Reject Ho at 5% significance level

· Ho: R2 = 0; H1: R2 > 0 à F Test àReject Ho at 5% significance level

6. Summary of steps needed in order to perform Regression (Bivariate or Multiple) Analysis

a. Determine what to do with Missing Data

· Options in SPSS:

· Exclude cases listwise (good when the % of respondents with missing data is small)

· Exclude cases pairwise (use when the % of respondents with missing data is substantial)

· Replace with mean (as the last resort, has many disadvantages though, e.g. all missing data have a single constant value which depresses the observed correlation)

b. Detect outliers

· Use the following SPSS procedures (Exercise: Open file World90.sav)

· For Univariate Detection of Outliers:

· AnalyzeàDescriptive Statistics àDescriptivesàPaste all the variables (except Respondent Number) into Variables boxàCheck the box: Save standarized values as variablesàOK

· For n <80, if a Zvariable (standarized value) is more than 2.5 or less than –2.5 àyou have an outlier

· For n >80, use the threshold value of 3 (or even 4 for very large n, say several hundred)

· For Multivariate Detection of Outliers:

· AnalyzeàRegressionàLinearàPaste Attitude as Dependent Variable, and Duration and Importance of the Weather as Independent VariableàMethod: Stepwiseà In Save window check a box for Residuals Studentized (to calculate SRE: the most common form of residuals used to denote outliers) and for the Mahalanobis Distance (as another way of identifying outliers: Mahalanobis Distance is a measure of the distance in multidimensional space of each observation from the mean centre of the observations for all the independent variables)

· If SRE > +1.96 or SRE < -1.96, the observation is a statistically significant (at 5%) outlier (Remember, however, that this works well only when n > 30-50)

· If an observation has a substantially higher (2 or 3 times higher) Mahalanobis Distance than the remaining observation à it is an outlier.

· Ex. Observation No. Mahalanobis Distance

1 1.2

2 1.4

3 1.9

4 2.1

5 4.3

6 4.5

Observations No. 5 and 6 are outliers.

· A researcher should use as many of the methods for detecting outliers as possible, looking for a consistent pattern of outliers across all these methods.

· There are many, many other methods for detecting both outliers and so called influential or leverage observations: they are beyond the scope of this course.

· All outliers should be eliminated from further analysis because our objective is to estimate the regression equation on a representative sample to obtain generalizable results.

c. Test the assumptions in Multiple Regression Analysis:

· Normality of the error term distribution with the mean 0

· Linearity: the relationship between dependent and independent variables is linear

· Constant variance of the error terms: called also homoscedasticity (if this condition is not satisfied, the data are said to be heteroscedastic)

· Independence of the error terms (lack of autocorrelation)

A researcher has to verify each of the above assumptions. How?

1. Assessing Normality and Linearity

· Analyze the plot of Predicted Value of Y (horizontal axis) vs. Studentized Residuals (vertical axis) and all the standarized partial regression plots.

· How to obtain these plots?

· AnalyzeàRegressionàLinearàDependent and Independent Variables (as before)à Method: StepwiseàPlots: Y-Axis: SRESID; X-Axis: DPENDNTàCheck the boxes: Produce all partial plots and Normal probability plotàContinueàOK

· Normality

· Interpretation: Normal P-P Plot of Regression Standarized Residual shows that the residuals fall along the diagonal with no substantial or systematic departures; thus the residuals are considered to represent a normal distribution.

· In the case of non-normality proceed as follows:

· If the distribution is skewed to the left (i.e. Skeweness < 0) à employ a square root transformation, i.e. SQRT(Y) and/or SQRT(X)

· If the distribution is skewed to the right (i.e. Skeweness > 0à employ the logarithm, i.e. log(Y) and/or log(X)

· If the distribution is flat (Kurtosis < 0) à employ the inverse 1/Y or 1/X or both.

· In general, you should apply all possible transformations and select the best result, i.e. the result that meets all the assumptions in the best possible way.

· Linearity

· Scatterplot: Attitude vs. Studentized Residuals is very similar to the Null Plot and thus there is no linear pattern visible. In general, the Null Plot indicates that ALL assumptions are met! However, also partial regression plots must be examined.

· Partial Regression Plot for Duration shows a nice, linear pattern

· Partial Regression Plot for Importance shows slight non-linear shape

· Possible remedy: for this type of non-linear pattern transform the variable X2 = Importance into a new variable which is a square of X2; then transform Y into logY or –1/Y or SQRT(Y). Follow these operations in SPSS: TransformàComputeàetc.

· General Guidelines for Transformations:

· For a noticeable effect from transformations, the ratio of a variable’s mean to its standard deviation (i.e. inverse of the CV – Coefficient of Variation) should be less than 4.0

· When the transformations can be performed on either of two variables, select the variable with the smallest 1/CV, i.e. the largest CV

· Apply transformations only to the INDEPENDENT variables (X), except in the case of heteroscedasticity, as explained below:

· In general, heteroscedasticity can be remedied only by transforming Y. However, if on top of heteroscedasticity there is also a non-linearity, you may have to transform both Y and X.

· Remember that transformations will change the interpretation of the variables!

2. Assessing Homoscedasticity

· Again, analyze the plot of Predicted Value of Y (horizontal axis) vs. Studentized Residuals (vertical axis). This time, however, look for a pattern of increasing or decreasing residuals. If there is no such pattern, this indicates homoscedasticity.

· In the case of heteroscedasticity (the residuals form a cone):

· In particular, if the cone opens to the right, take the inverse of Y; if the cone opens to the left, take the square root of Y.

3. Assessing Independence of the Error Terms

· Plot the residuals against any possible sequencing variable, e.g. time or even the respondent’s number (as in our case)

· How? Use the following SPSS steps:

· AnalyzeàRegressionàLinearàPaste all the variables as beforeàIn Save, save Studentized Residuals

· Then, GraphsàScatteràSimpleàDefineàY Axis: Studentized Residual, X Axis: Respondent NumberàOK and analyze the pattern. Ideally, the pattern should look like a Null Plot, which indicates independence of the error terms.

· If the error terms are NOT independent:

· In this case, the most common remedy is the addition into the model of one or more independent variables that represent some apparently omitted casual factors explaining Y.

Points 1-3 above show how to check the assumptions of the regression model and offer possible remedies in case when these assumptions are not met.

There is one more thing to verify: the MULTICOLLINEARITY of the independent variables X. High collinearity among the independent variables can make the results unstable and thus not generalizable. High multicollinearity lowers the R2. High multicollinearity makes determining the contribution of each independent variable difficult because effects of the independent variables are “mixed”. Multicollinearity can also have a devastating effect on the estimation of the regression coefficients (yielding even wrong signs) and their statistical significance tests.

· AnalyzeàRegressionàLinearàPaste all the variables as beforeàin Statistics: check Collinearity diagnosticsàContinueàOK

· Interpretation:.

· Condition Index = 4.561 for Duration and 5.141 for Importance. The Threshold Value for Condition Index is between 15 and 30, with 30 being the most commonly used value. Since none of the above CI’s exceed 30, there is no collinearity between X1 (Duration) and X2 (Importance).

· If CI > 30, look for Variance Proportions. If there are at least TWO Variance Proportions in a row with CI > 30, which are greater than 0.90, the corresponding variables are significantly collinear.

· Additionally, consider the Tolerance and VIF (Variance Inflation Factor) statistics:

· For example, Tolerance for Duration is 0.698 or VIF (= 1/Tolerance) is 1.433

· Interpretation: If VIF > 10 (or Tolerance < 0.10), there is a collinearity. Again, in our case there is no collinearity. Tolerance is the amount of variability of the selected independent variable not explained by the other independent variables, hence the higher the tolerance the lower the collinearity.

· What if we find collinearity among the independent variables?

· Remedy:

· Omit one or more highly correlated variables (if possible, replace them with other independent variables)

· If this is impossible, use the results only for forecasting. Do not interpret the regression coefficients.

· A more sophisticated approach is also possible: use regression on principal components.