732A20 Data Mining and Statistical Learning

Department of Computer and Information Science

Computer lab 3: Principal component analysis and partial least squares regression

Learning objectives

The main objective of this computer lab is to make the student familiar with the concept of principal components and the major pros and cons of principal components regression (PCR) and partial least squares regression (PLS).

After completing the lab the student shall be able to:

(i)  Explain how principal components can be derived from a data matrix by computing eigenvectors of a covariance or correlation matrix,

(ii)  Interpret a score plot

(iii)  Undertake a PCR and PLS regression in SAS Enterprise Miner and interpret the estimated parameters and performance measures.

(iv)  Use a cross-validation technique for model selection

Recommended reading

Chapters 3.1 – 3.4.3 in Hastie et al.

Assignment 1: Dimension reduction and principal components

The Excel document ‘fluorescein.xls’ contains data regarding light reflection from a total of 30 surfaces treated with zinc, rhodamine, manganese etc. For each surface, measurements are undertaken for light representing 146 different wavelengths (channels), i.e., the data have 146 dimensions. However, the measured values for adjacent channels are strongly correlated, implying that the effective dimension of the measured values is much smaller than 146.

Your task is to use SAS Enterprise Miner to create a diagram for principal components analysis of the given data and to interpret the results of that analysis. For the sake of simplicity the channels for which there was no variation in the light reflection have been omitted.

Create a new diagram

Define a new diagram and draw a data flow with the nodes Input Data Source and Princomp/Dmneural.

Assign a data set to the Input Data Source

Inspect the light reflection data to be analysed by making a lineplot (in Excel) of the data matrix in ‘fluorescein.xls’. Are data representing adjacent channels (wavelengths) strongly correlated?

Import the Excel file ‘fluorescein.xls’ to SAS and assign this SAS data set to the Input Data Source node. Define appropriate model roles.

Extract principal components

Open the Princomp/Dmneural node and select principal components analysis based on the covariance matrix of the given data. Run the cited node and examine the effective dimension of the given data, i.e., how many principal components that are needed to explain most of the variation in the data.

Draw a score plot

Add a Distribution Explorer node to the current workflow diagram. Make a plot of the observed data in the coordinate system defined by the first two principal components. (Use the Set Axis menu to assign the X and Y items to PRIN1 and PRIN2, respectively.) Can you identify distinct groups of objects in this new coordinate system?

Assign a new data set to the Input Data Source

Assign the SAS data set derived from the file ‘lakesurvey.xls’ to the Input Data Source node.

Extract and interpret the principal components

Extract principal components using the covariance and correlation matrices, respectively. Why are the principal components so different in the two cases? Is it possible to assign a physico-chemical meaning to the first two principal components derived from the correlation matrix?

Assignment 2: Principal components regression and partial least squares regression

The Excel file “tecator.xls” contains the results of study aimed to investigate whether a near infrared absorbance spectrum can be used to predict the fat content of samples of meat. For each meat sample the data consists of a 100 channel spectrum of absorbance records and the levels of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The moisture, fat and protein are determined by analytic chemistry.

The worksheet “data” contains data from 215 samples of finely chopped meat. Your task is to establish PCR and PLS models in which the fat content is regarded as target and the absorbance levels recorded in the 100 channels are regarded as explanatory variables.

Run proc PLS for a PLS regression and a PCR analysis

Import the worksheet ‘data’ in ‘tecator.xls’ to SAS. Open the log window of SAS and check that the file has been successfully imported.

Open the Editor window and write a SAS code in which proc PLS first performs a PLS regression and then a PCR analysis. Use cross validation for model selection splitting the entire data set into two parts. Compute the Root ASE for the test set.

Run Enterprise Miner for an ordinary least squares regression

Run the regression node in Enterprise miner to undertake an ordinary least squares regression using the same training and test sets as in the previous tasks. Use forward, backward and stepwise regression for model selection and note the Root ASE for the test set.

Compare and comment the results obtained for the different regression models. What model would you prefer?

To hand in

Assignment 1: The score plot and your interpretation of that plot. Your explanation to why the covariance matrix and the correlation matrix produce different eigenvectors and eigenvalues.

Assignment 2: A table of Root ASE values of your models and your interpretation of the obtained values.