Principal Component Analysis

Constrained Tucker3 models

MATLAB exercise

1. Introduction

Sometimes, the PARAFAC model does not converge to a good solution. One reason for this is that the PARAFAC model is only appropriate when the ranks of each mode of the multiway array are the same. However, this may not always be the case, e.g. when two compounds have nearly identical spectra or when the variation between different samples is low. In this case, the Tucker model is better.

Unlike the PARAFAC model, the Tucker model has rotational freedom which means that, like in PCA, the loadings do not necessary represent the real profiles (e.g. concentrations and spectra can be negative). The use of constraints can help make the loadings give useful information. In this exercise we will learn how to

recognize when a PARAFAC model is not appropriate
build a Tucker3 model
apply appropriate constraints in order to improve model interpretability

In order to make this exercise clear, a simulated data set representing some chromatographic data is used. After doing this exercise, you can try the Tucker model and applying constraints to your own real data!

2. Load and view the data

First of all, go the ‘multiway’ directory and then add The N-way Toolbox and the data files to your MATLAB path:

addpath nway210

addpath data

Then you can load and view the data:

load simul

Use the whos command to see what variables are currently in memory. You should find an array, X (253040). In terms of chromatography, this data set has dimensions sample number  time  wavelength. Plot some of samples in order to get an idea of how the data looks, .e.g.

figure

surf(squeeze(X(3,:,:)))

3. Build a PARAFAC model

Before we use the Tucker3 model, let’s first try and build a 3-component PARAFAC model on the data:

[Factors,it,err,corcondia]=parafac(X,3);

What percentage of the data is explained by the model? You may notice that the routine output gives a warning that some factors are highly correlated. You may also notice that the core consistency diagnostic is very low. Is something wrong with this model?

Plot the loadings:

figure

plotfac(Factors);

Do the loadings give nice information about the chemical components? Do you notice that some components are correlated? This is bad news, as it means that the model parameters cannot be determined properly.

Comment: Correlated loadings

High correlation (positive or negative) between loadings is a sign of rank deficiency in the data. This means that you are trying to take more latent variables than are actually present in the data. As a consequence, the model is poorly defined.

Why doesn’t PCA suffer from this problem? It is because in PCA, the loadings are orthogonal and so always have a correlation of zero.

4. Rank analysis

It seems that the PARAFAC model is not appropriate for this data. In order to find out which model is appropriate for this data, we will unfold along each of the three modes and estimate the psuedorank.

First, unfold the data to XIJK as follows:

X1=reshape(X,25,1200);

Now calculate the eigenvalues of this matrix:

eig(X1*X1’)

How many eigenvalues seem to be significant? This gives an idea of the pseudorank of the first (sample number) mode.

Did you know?

Pseudorank

The rank of a two-way matrix is given by the number of non-zero eigenvalues. When a matrix contains noise, the rank is equal to the number of rows (or columns, whichever is smaller). This means that matrices which contain noise are usually full rank. For example, if you type rank(X1) for the data above, the answer will be 25 – the number of rows in X1.

In chemometrics, we are not usually interested in modelling noise, and so those eigenvalues which are very small are usually ignored. The term pseudorank or chemical rank means the number of eigenvalues which are significant (i.e. do not describe noise). Sometimes the choice of how many eigenvalues are significant is subjective, but for the data used in this exercise it should be clear.

Now we will unfold the data to XJKI by permuting the three-way array first and then unfolding:

X2=permute(X,[2 3 1]);

X2=reshape(X2,30,1000);

eig(X2*X2’)

What is the pseudorank of the second (time) mode?

Finally, the third (wavelength) mode:

X3=permute(X,[3 1 2]);

X3=reshape(X3,40,750);

eig(X3*X3’)

Now consider the pseudoranks of each mode. What type of Tucker3 model is appropriate here? Tucker3 (3,3,3)? Tucker3 (4,2,3)? Tucker3 (1,3,3)?

5. Building a Tucker3 model

Type help tucker to get an idea of how to use the Tucker3 routine. You will see that there are many options which can be used. To start with, we will build the most simple Tucker3 (1,3,3) model:

[Factors,G,ExplX,Xm]=tucker(X,[1 3 3]);

Now plot the loadings:

figure

plotfac(Factors)

There is now no problem with highly correlated loadings. A high percentage of the variation in the data is explained. However, the loadings are still not easy to interpret, because Tucker models, like PCA, have rotational freedom.

6. Using constraints

What do we know about chromatographic data?

Each compound gives one peak as it elute from the separation column. The time mode loadings should be unimodal and non-negative.
Each compound had a characteristic spectrum which cannot by negative. The wavelength mode loadings should be non-negative.
The Lambert-Beer law should apply. Spectral intensity is linearly related to compound concentration and each compound is independent of the others. The core array, G (1 3 3), when unfolded, should look like this: G = [g111 0 0 0 g122 0 0 0 g133].

Now we can build a new, constrained Tucker3 model which uses this information:

[Factors,G,ExplX,Xm]=tucker(X,[1 3 3],[],[2 4 1],3,[],[1 0 0 0 1 0 0 0 1]);

Now plot the loadings again:

figure

plotfac(Factors)

Do they now make more sense?

Of course, this is only a set of simulated data, but the use of constraints can also be very useful for many different types of real data.

7. Conclusions

In this exercise, we have seen that the combination of the Tucker3 model and appropriate constraints can give models which provide insight into chemical data. You should have learnt how to

perform a rank analysis on each mode of a multiway data set
build a Tucker3 model
apply appropriate constraints in order to improve model interpretability