Correlations

The Bivariate Correlations procedure computes the pairwise associations [1]for a set of variables and displays the results in a matrix. It is useful for determining the strength and direction of the association between two scale[2] or ordinal[3] variables.

Example :

In order to increase sales, motor vehicle design engineers want to focus their attention on aspects of the vehicle that are important to customers--for example, how important is fuel efficiency with respect to sales? One way to measure this is to compute the correlation between past sales and fuel efficiency.

Information concerning various makes of motor vehicles is collected in car_sales.sav[4]. Use Bivariate Correlations to measure the importance of fuel efficiency to the salability of a motor vehicle

To run a correlations analysis, from the menus choose:

Analyze
Correlate
Bivariate...

Select Sales in thousands and Fuel efficiency as analysis variables.

Click OK.

These selections produce a correlation matrix for Sales in thousands and Fuel efficiency.

The Pearson correlation coefficient measures the linear association between two scale variables. The correlation reported in the table is negative(!), although not significantly different from 0, which suggests that designers should not focus their efforts on making cars more fuel efficient because there isn't an appreciable effect on sales.

However, the Pearson correlation coefficient works best when the variables are approximately normally distributed and have no outliers. A scatterplot can reveal these possible problems.

To produce a scatterplot of Sales in thousands by Fuel efficiency, from the menus choose:

Graphs
Scatter...

Click Define.

Select Sales in thousands as the y variable and Fuel efficiency as the x variable.

Click OK.

The resulting scatterplot shows a point that is far to the right of the others.

To identify the point, activate the graph by double-clicking on it.

Click the Point Identification tool.

Select the point. It is identified as case 27.

The scatterplot shows that Sales in thousands has a skewed distribution.

Moreover, there is an outlier in Fuel efficiency. By fixing these problems, you can improve the estimates of the correlation.

Since Sales in thousands is heavily right-skewed, try replacing it with Log-transformed sales in further analyses.

The Data Editor shows that case 27, the Metro, is not representative of the vehicles that your design team is working on, so you can safely remove it from further analyses.

To remove the Metro from the correlation computations, from the menus choose:

Data
Select Cases...

Select If condition is satisfied and click If.

Type model ~= 'Metro' in the text box.

Click Continue.

Click OK in the Select Cases dialog box.

A new variable has been created that uses all cases except for the Metro in further computations.

To analyze the filtered data, recall the Bivariate Correlations dialog box.

Deselect Sales in thousands as an analysis variable.

Select Log-transformed sales as an analysis variable.

Click OK.

After removing the outlier and looking at the log-transformed sales, the correlation is now positive but still not significantly different from 0.

However, the customer demographics for trucks and automobiles are different, and the reasons for buying a truck or a car may not be the same. It's worthwhile to look at another scatterplot, this time marking trucks and autos separately.

To produce a scatterplot of Log-transformed sales by Fuel efficiency, controlling for Vehicle type, recall the Simple Scatterplot dialog box.

Deselect Sales in thousands and select Log-transformed sales as the y variable.

Select Vehicle type as the variable to set markers by.

Click OK.

The scatterplot shows that trucks and automobiles form distinctly different groups. By splitting the data file according to Vehicle type, you might get a more accurate view of the association.

To split the data file according to Vehicle type, from the menus choose:

Data
Split File...

Select Compare groups.

Select Vehicle type as the variable on which groups should be based.

Click OK

To analyze the split file, recall the Bivariate Correlations dialog box.

Click OK

Splitting the file on Vehicle type has made the relationship between sales and fuel efficiency much more clear. There is a significant and fairly strong positive correlation between sales and fuel efficiency for automobiles. For trucks, the correlation is positive but not significantly different from 0. Reaching these conclusions has required some work and shown that correlation analysis using the Pearson correlation coefficient is not always straightforward. For comparison, see how you can avoid the difficulty of transforming variables by using nonparametric correlation measures.

Reaching these conclusions has required some work and shown that correlation analysis using the Pearson correlation coefficient is not always straightforward. For comparison, see how you can avoid the difficulty of transforming variables by using nonparametric correlation measures.

The Spearman's rho and Kendall's tau-b statistics measure the rank-order association between two scale or ordinal variables. They work regardless of the distributions of the variables.

To obtain an analysis using Spearman's rho, recall the Bivariate Correlations dialog box.

Select Sales in thousands as an analysis variable.

Deselect Pearson and select Spearman.

Click OK

Spearman's rho is reported separately for automobiles and trucks. As with Pearson's correlation coefficient, the association between Log-transformed sales and Fuel efficiency is fairly strong. However, Spearman's rho reports the same correlation for the untransformed sales! This is because rho is based on rank orders, which are unchanged by log transformation. Moreover, outliers have less of an effect on Spearman's rho, so it's possible to save some time and effort by using it as a measure of association.

Remarks:

Using Bivariate Correlations, you produced a correlation matrix for Sales in thousands by Fuel efficiency and, surprisingly, found a negative correlation. Upon removing an outlier and using Log-transformed sales, the correlation became positive, although not significantly different from 0. However, you found that by computing the correlations separately for trucks and autos, there is a positive and statistically significant correlation between sales and fuel efficiency for automobiles.

Furthermore, you found similar results without the transformation using Spearman's rho, and perhaps are wondering why you should go through the effort of transforming variables when Spearman's rho is so convenient. The measures of rank order are handy for discovering whether there is any kind of association between two variables, but when they find an association it's a good idea to find a transformation that makes the relationship linear. This is because there are more predictive models available for linear relationships, and the linear models are generally easier to implement and interpret.

The Bivariate Correlations procedure is useful for studying the pairwise associations for a set of scale or ordinal variables.

·  If you have nominal variables, use the Crosstabs procedure to obtain measures of association.

·  If you want to model the value of a scale variable based on its linear relationship to other variables, try the Linear Regression procedure.

·  If you want to decompose the variation in your data to look for underlying patterns, try the Factor Analysis procedure.

Linear regression

Linear regression is used to model the value of a dependent scale variable based on its linear relationship to one or more predictors.

The linear regression model assumes that there is a linear, or "straight line," relationship between the dependent variable and each predictor. This relationship is described in the following formula.

The model is linear because increasing the value of the jth predictor by 1 unit increases the value of the dependent by bj units. Note that b0 is the intercept, the model-predicted value of the dependent variable when the value of every predictor is equal to 0.

For the purpose of testing hypotheses about the values of model parameters, the linear regression model also assumes the following:

·  The error term has a normal distribution with a mean of 0.

·  The variance of the error term is constant across cases and independent of the variables in the model. An error term with non-constant variance is said to be heteroscedastic.

·  The value of the error term for a given case is independent of the values of the variables in the model and of the values of the error term for other cases.

Example :

The Nambe Mills company has a line of metal tableware products that require a polishing step in the manufacturing process. To help plan the production schedule, the polishing times for 59 products were recorded, along with the product type and the relative sizes of these products, measured in terms of their diameters.

This information can be found in polishing.sav[5]. Use linear regression to determine whether the polishing time can be predicted by product size.

Before running the regression, you should examine a scatterplot of polishing time by product size to determine whether a linear model is reasonable for these variables.

To produce a scatterplot of time by diam, from the menus choose:

Graphs
Scatter...

Click Define

Select time as the y variable and diam as the x variable.

Click OK.

These selections produce the scatterplot.

To see a best-fit line overlaid on the points in the scatterplot, activate the graph by double-clicking on it.

From the Chart Editor menus choose:

Chart
Options...

Select Total in the Fit Line group.

Click OK.

The resulting scatterplot appears to be suitable for linear regression, with two possible causes for concern .

The variability of polishing time appears to increase with increasing diameter

·  The point on the far right of the graph may exert an undue amount of influence on the lay of the regression line.

You will investigate these concerns further during diagnostic checking of the regression model

To run a linear regression analysis, from the menus choose:

Analyze
Regression
Linear...

Select time as the dependent variable.

Select diam as the independent variable.

Select type as the case labeling variable.

Click Plots.

Select *SDRESID as the y variable and *ZPRED as the x variable.

Select Histogram and Normal probability plot.

Click Continue.

Click Save in the Linear Regression dialog box

Select Standardized in the Predicted Values group.

Select Cook's and Leverage values in the Distances group.

Click Continue.

Click OK in the Linear Regression dialog box

These selections produce a linear regression model for polishing time based on diameter. Diagnostic plots of the Studentized residuals by the model-predicted values are requested, and various values are saved for further diagnostic testing.

This table shows the coefficients of the regression line.

It states that the expected polishing time is equal to 3.457 * DIAM - 1.955. If Nambe Mills plans to manufacture a 15-inch casserole, the predicted polishing time would be 3.457 * 15 - 1.955 = 49.9, or about 50 minutes.

The ANOVA table tests the acceptability of the model from a statistical perspective

The Regression row displays information about the variation accounted for by your model.

The Residual row displays information about the variation that is not accounted for by your model

The regression and residual sums of squares are approximately equal, which indicates that about half of the variation in polishing time is explained by the model.

The significance value of the F statistic is less than 0.05, which means that the variation explained by the model is not due to chance

While the ANOVA table is a useful test of the model's ability to explain any variation in the dependent variable, it does not directly address the strength of that relationship.

The model summary table reports the strength of the relationship between the model and the dependent variable. R, the multiple correlation coefficient, is the linear correlation between the observed and model-predicted values of the dependent variable. Its large value indicates a strong relationship. R Square, the coefficient of determination, is the squared value of the multiple correlation coefficient. It shows that about half the variation in time is explained by the model.

As a further measure of the strength of the model fit, compare the standard error of the estimate in the model summary table to the standard deviation of time reported in the descriptive statistics table.

Without prior knowledge of the diameter of a new product, your best guess for the polishing time would be about 35.8 minutes, with a standard deviation of 19.0. With the linear regression model, the error of your estimate is considerably lower, about 13.7.

A residual is the difference between the observed and model-predicted values of the dependent variable. The residual for a given product is the observed value of the error term for that product. A histogram or P-P plot of the residuals will help you to check the assumption of normality of the error term.

The shape of the histogram should approximately follow the shape of the normal curve. This histogram is acceptably close to the normal curve.

The P-P plotted residuals should follow the 45-degree line. Neither the histogram nor the P-P plot indicates that the normality assumption is violated.

The plot of residuals by the predicted values shows that the variance of the errors increases with increasing predicted polishing time. There is, otherwise, good scatter.

To check the residuals by the diameter, recall the Simple Scatterplot dialog box.

Deselect time as the y variable and select Standardized Residual as the y variable.

Click OK.

The plot of residuals by diameter shows the same results. To correct the heteroscedasticity in the residuals in further analyses, you should define a weighting variable based on the inverse of the diameter of the product. Using the weighting variable will decrease the influence of products with large diameters and highly variable polishing times, resulting in more precise regression estimates.