Transformations in Simple Linear Regresssion
Example - PCBtrout.JMP in the Biometry JMP folder
In this experiment we are studying the relationship between age of trout and the PCB concentration found in their tissues. The rainbow trout were all sampled from Lake Cayuga in New York.

Sample Correlation (r)

/ PCB /
Age / r = 0.7364

Scatterplot Matrix


We begin by examining the correlation between age and PCB concentration as well as a scatterplot matrix.

The correlation is r = .7364 which is a moderate positive correlation however examination of the scatter plot matrix suggests that the relationship is not linear. The Bulging Rule discussed in class would suggest that lowering the power on Y and/or increasing the power on X should straighten the scatter plot.

Bulging Rule:

We begin by taking the log base 10 of the PCB concentration. To do this, create a new column and double click at the top of the column. Select Formula from the New Property pull-down menu and click Edit Formula. Then select Transcendental from the right hand menu (because the logarithm is a transcendental function) and click on Log 10 from the right menu. Finally select the variable you wish to take the log of, which in this case is PCB concentration. When you are finished the expression in the calculator window should read

Log10(PCB)

Now examine the relationship between age and log10(PCB). The results are shown below.

Sample Correlation (r)

/ log10(PCB) /
Age / r = 0.8552


Scatterplot Matrix

We can see that the correlation has increased after transformation.

There still appears to be some curvature present however. The bend is such that the Bulging rule suggests either raising the power on Y or lowering the power on X. We could consider raising the power on Y, which implies the log transformation of Y may have been to strong. However in this case we will consider lowering the power on X. Let’s consider using the square root of age. Again we need to add a column which will contain the results of a formula. To take the square root of a variable simply select the square root button and put the variable Age under the radical. The result should look like this:

Again examination of the correlation and scatter plot matrix shows improvement in terms of linearity.


Sample Correlation (r)

/ sqrt(Age) /
log10(PCB) / r = 0.8866


Scatterplot Matrix

At this point we could feel comfortable building a regression model for these data. Select Analyze > Fit Y by X menu and put log10(PCB) in the Y box and sqrt(Age) in the X box. To fit the regression line, select Fit Line from the Bivariate Fit pull-down menu located above the scatter plot. The results are shown below.

The regression equation is:

To use this equation to predict the PCB concentration for a fish that is 5 years old e.g. we would take the square root of 5 and plug that in to the regression equation. The predicted log 10 PCB concentration would be:
-.519 + .521*2.236 = .645 log10(ppm)
Which corresponds to a PCB concentration of ppm.
The R2 = .786 which implies 78.6% of the variation in the log10 PCB concentration is explained by the regression on the square root of the age of the trout.

To examine a plot of the residuals versus, select Plot Residuals from the Linear Fit pull-down menu located beneath the scatter plot. The resulting plot is shown on the following below.

No assumption violations are evident from this plot. To assess normality of the residuals first save the residuals from the fit by selecting Save Residuals from the Linear Fit pull- down menu. This will save the residual values to the original spreadsheet. Then select Analyze > Distribution to examine the distribution of the residuals and obtain a normal quantile plot. The results are shown below.

The residuals appear to be slightly kurtotic, but not too bad. To obtain prediction and confidence intervals we need to fit the regression model using the Fit Model option from the Analyze menu. Put log10(PCB) in the Y box and sqrt(Age) in the Model Effects box. From the Fit Model results select Save Columns > Prediction Formula, Mean Confidence Interval (CI for mean of Y when X = x) and Indiv Confidence Interval (CI for the Y value of an individual with X = x). To obtain these CI’s in the original scale (ppm) add two columns to the spreadsheet which will take the results of a formula. Then use the JMP calculator and the function 10x to create the following formulas:

This will transform the endpoints of the confidence interval for the mean back to the original ppm scale. Similar formulas could be used to convert the endpoints of the confidence interval for prediction of a single individual to the original scale.

A portion of the data spread sheet containing these additional columns is shown below.

Plot of PCB vs. Age with Estimated Mean and CI’s Added


You can use Graph > Overlay Plot to graph the original data points (PCB), the predicted values (Pred Orig), the lower confidence limit (Mean Lower), and the upper confidence limit (Mean Upper) in the same plot by placing these quantities in the Y box and either sqrt(Age) or Age in the X box. The plot above shows the results using Age for the X-axis. The Connect Points option has been selected and the Show Points option has been unselected for the predicted values and the confidence bands (right-click on the legend name for each of these quantities to obtain a menu from which these options can be specified).

1