A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
Supplementary material
by
Harri Kiiveri
CSIRO Mathematical and Information Sciences
The Leeuwin Centre, Floreat, Western Australia
Abstract
Background
With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem
akin to finding needles in a haystack.
Results
The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A rapid algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions.
Conclusion
The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. The method is a powerful approach to finding parsimonious models for datasets with many more variables than observations. The method is capable of handling problems with millions of variables and makes it possible to fit almost any statistical model with a linear predictor in it to data with more variables than observations. The method compares favourably to existing methods such as support vector machines and random Forrests, but has the advantage of not requiring separate variable selection steps. It is also capable of handling data types which these methods were not designed to handle
Note: In the following, section titles are the same as those of the paper which refer to supplementary information.
BACKGROUND
Some insights from a simple linear regression example
Following Fan and Li (2001), to develop some insight into the algorithm we consider a linear regression model
1)
where X is N by p, full rank, with p N and. A Taylor expansion about the point shows that
2)
where denotes the variance of. For orthogonal (22) becomes
3)
where is the ith diagonal element of. From the form of (3) we see that P can be minimised by independently minimising each term in the sum. When is derived from a hierarchical prior as in (1), we can write
4)
where denotes conditional expectation. The derivation of the second term in (4) is given in Appendix 3. Note that this term could be regarded as a pseudo T-statistic.
Figure 1 gives a plot of a single component of (3) with the normal gamma prior,
and the usual regression estimate is =4.
Figure 1: Plot of penalised likelihood function for single parameter and=4
From (4) we get the MAP (penalised likelihood) estimate satisfying the equation
. 5)
Since the numerator is greater than one this implies that the penalised likelihood estimate is shrunken towards zero when compared to the usual regression estimate.
Threshholding
Equation (4) can be written as
6)
where. Note that h is an odd function. A plot of this function for several values of k and is given in Figure 2 below.
Figure 2: Plot of function h for various values of and k
In general, if the function h has a minimum at with the property that , then from (26) whenever the minimum of P occurs at since P is increasing for and decreasing for . Hence there is a threshold at
and absolute values of the regression estimate below this value give a penalised likelihood estimate of zero. This is the case for the NG prior. Figure 3 below plots the thresholding function for several values of k and.
Figure 3: Plot of thresholding functions for the normal gamma prior
The threshold values for the three cases in the above plot are 2, approximately 3.32 and 2 respectively. Hence for the case k=0 and, thresholding occurs whenever the regression estimate is less than 2 in absolute value.
Concerning initialisation of the algorithm for this linear regression case, from Figure 1 we can see that starting from the global maximum of the likelihood function (the regression estimate) an optimisation algorithm with a sensible line search will find the relevant minimum of minus the penalised likelihood function when the regression estimate is larger than the threshold value. A similar figure for the case that the regression estimate is less than the thresholding value shows that the minimiser of the penalised likelihood function will converge to zero.
Note that whilst the above discussion gives some insight into what the general algorithm is doing, the case of p>n is more complex as there are potentially many full rank subsets of the columns of and the columns of will not be orthogonal.
CONTEXT AND METHODS
Alternative sparsity priors or penalties
Fan and Li (2001) discuss three desirable properties of thresholding functions, namely unbiasedness, sparsity and continuity, and suggest a penalty function, the smoothly clipped absolute deviation (SCAD) penalty which has these three properties. The normal gamma prior (penalty) has thresholding functions which typically satisfy their first two properties but not the third. However, experience with numerous examples has shown that this set of penalty functions works well for problems with p>n. When there are strong effects present in the data, parameter estimates tend not to be close to thresholding values where the discontinuity lies.
Using the results in Appendix 3, the SCAD penalty function can be converted to a nominal conditional expectation “equivalent” using
7)
where denotes minus log of the SCAD penalty function. The algorithm described in the paper can then be used to fit models with the SCAD penalty. For numerical reasons this “expectation” needs to be truncated below some small value.
Fan and Li also describe an algorithm for computing estimates from penalised likelihood functions. When the penalty function is derived from a hierarchical prior, their algorithm is an EM algorithm.
An alternative to the NG prior and the SCAD penalty is discussed in Griffin and Brown (2005).
RESULTS
Algorithm initialisation
Let h be the link function in a generalised linear model. If p £ N compute initial values using
(8)
where the ridge parameter l satisfies 0 < l £ 1 (say) and z is small and chosen so that the link function h is well defined at y+z. If p > N, to keep matrix sizes manageable, Equation (8) can be rewritten using a familiar matrix identity in Rao (1973). In this case compute initial values b(0) by
. (9)
For example, for logistic regression, could have the value -0.001 when y=1 and +0.001 when y=0. In this case the function g is the logit function. The rationale for the perturbation is that the global maximum of the likelihood function would have probabilities estimated as one when y is one and zero when y=0. Unfortunately the logit function is undefined at these values so we perturb them slightly and then do a ridge regression to find a value of beta which gives probabilities close to these values.
Writing for the ridge regression predictor of y and using the singular value decomposition we can show that
where are the singular values of X and . If X is full rank, lambda can be chosen in the above expression to make the norm small. For example, given “small”, writing , and then implies that .If the singular values are large then a value of 1 for lambda will often be sufficient.
Note that the initial value calculations can be performed by inverting matrices of the size min(n,p). The matrix identity used in deriving (9) is also useful during the M step to avoid the calculation and inversion of large matrices. Considering that we will apply this algorithm to data sets with millions of variables, this is an important point.
Implementing multiclass logistic regression
To implement the algorithm for a particular model simply requires expressions for the first two derivatives of the likelihood function. In this section we present some details which enable the implementation of the algorithm for the multiclass classification problem.
Consider a set of N individuals classified into G2 classes with associated explanatory variables measured in the N by p matrix X with ith row. Here the response y is a vector of class labels for each individual. Define
and
where denotes the probability that individual i is in class g.
For the multiclass logistic regression model we write
(10)
From (10) the log likelihood function is
(11)
Writing , from (11) it is easy to show that
(12)
Setting, the above expressions can be used to implement the algorithm in Section 2. In practice we use a block diagonal approximation to the second derivative matrix as this allows the ascent directions in the M step to be calculated sequentially over the classes without storing matrices larger than N by p. Our experience has been that this approximation has worked very well over a range of problem sizes.
To implement the general algorithm for other likelihood functions simply requires the calculation of the two derivatives of the likelihood function in a similar manner as above.
Smoking data example
Predicted classTrue class / Smoker / non smoker
smoker / 33 / 1
non smoker / 3 / 20
Table s.1: Cross validated misclassification table (k=0, δ=0)
b - scale parameter in prior ()k / 0.01 / 0.1 / 1 / 10 / 100 / 1000 / 1e+04 / 1e+05 / 1e+06 / 1e+07
0 / 0.088 / 0.053 / 0.070 / 0.070 / 0.070 / 0.053 / 0.053 / 0.088 / 0.070 / 0.070
0.1 / 0.053 / 0.070 / 0.053 / 0.088 / 0.070 / 0.088 / 0.088 / 0.070 / 0.088 / 0.088
0.2 / 0.070 / 0.088 / 0.070 / 0.070 / 0.088 / 0.088 / 0.088 / 0.123 / 0.053 / 0.105
0.3 / 0.053 / 0.070 / 0.088 / 0.070 / 0.053 / 0.035 / 0.088 / 0.070 / 0.070 / 0.070
0.4 / 0.035 / 0.070 / 0.053 / 0.070 / 0.088 / 0.053 / 0.053 / 0.070 / 0.070 / 0.053
0.5 / 0.035 / 0.053 / 0.070 / 0.035 / 0.070 / 0.053 / 0.070 / 0.035 / 0.053 / 0.053
0.6 / 0.088 / 0.035 / 0.053 / 0.053 / 0.070 / 0.053 / 0.070 / 0.035 / 0.035 / 0.035
0.7 / 0.035 / 0.053 / 0.035 / 0.035 / 0.035 / 0.053 / 0.053 / 0.035 / 0.035 / 0.035
0.8 / 0.035 / 0.035 / 0.053 / 0.053 / 0.053 / 0.088 / 0.053 / 0.018 / 0.053 / 0.035
0.9 / 0.035 / 0.035 / 0.035 / 0.053 / 0.035 / 0.035 / 0.018 / 0.035 / 0.018 / 0.035
1 / 0.035 / 0.035 / 0.035 / 0.035 / 0.035 / 0.035 / 0.018 / 0.018 / 0.018 / 0.035
Tables s.2: Cross validated error rates for a grid of b and k values
Leukaemia example
PREDICTED CLASSTRUE CLASS / T-ALL / E2A-PBX1 / MLL / BCR-ABL / TEL-AML / HYPERDIP
T-ALL / 12 / 0 / 2 / 0 / 0 / 1
E2A-PBX1 / 0 / 18 / 0 / 0 / 0 / 0
MLL / 0 / 0 / 17 / 0 / 0 / 0
BCR-ABL / 1 / 0 / 0 / 19 / 0 / 0
TEL-AML / 0 / 0 / 0 / 0 / 20 / 0
HYPERDIP / 1 / 0 / 0 / 0 / 0 / 13
Table s.3: Cross validated misclassification matrix for leukaemia data (k=0, δ=0)
Data
Links to original data and associated R workspaces:
1. Smoking data
The raw data can be found at http://pulm.bumc.bu.edu/aged/download.html
2. Prostate cancer data
The raw data can be found at
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6099
by clicking on the Series Matrix File(s) link.
3. Leukaemia data
Primary data is available at http://www.stjuderesearch.org/data/ALL3
4. SNP data
The raw data can be found at http://genome.perlegen.com/browser/download.html
5. Follicular lymphoma survival data
The raw data is available for download from http://llmpp.nih.gov/FL/
Appendices
Appendix 1: Derivation of equation 7
Using the same notation and equation numbering as in the paper, from equation (6), namely
we have
Noting that is a normal distribution and ignoring terms which do not involve or this last expression can be written as
and Equation (7) follows on noting that
.
Appendix 2: Derivation of the conditional expectation in the E step
Suppose has a gamma distribution with shape parameter k and scale parameter b i.e.
.
From the definition of the conditional expectation, writing, we get
(A.1)
Rearranging, simplifying and making the substitution gives
(A.2)
Then using the result
(A.3)
where K denotes a modified Bessel function, see Watson (1966), we obtain