European Journal of Human Genetics -

Supplementary Materials for Pathway-based identification of SNPs predictive of survival

Herbert Pang, Michael Hauser, Stéphane Minvielle

Contents: Page

A) Survival Support Vector Machine2

B) Conditional inference survival forests2

C) Random Survival Forest - Conserve and Random split criteria3

D) Cox Boosting4

E) Unadjusted P-values5

F) Table 5 - SNPs in LD with SNPs in Table 45

G) Figure 3 - Pairwise r^2 of top genes in Stress Induction of HSP Regulation Pathway6

H) Table 6 - Prediction results for Stress Induction of HSP Reg. with r^2 > 0.8 restriction7

I) Table 7 - Prediction results for Cytokine Network with r^2 > 0.8 restriction7

J) Table 8 - Top genes for Stress Induction of HSP Regulation and their importance ranking8

K) Table 9 - Top genes for Cytokines Network and their importance ranking8

L) Notes associated with Tables 8 and 99-10

M) Table 10 - Univariate Cox Regression Results for HSP Regulation Pathway11

N) Table 11 - Univariate Cox Regression Results for Cytokine Pathway12

O) Notes associated with Tables 10 and 1113

Survival Support Vector Machine

The goal of the support vector machine for survival is to find a non-linear hyperplane that separates the individuals who had events before time t and those without an event at that point. The non-linear relationship is obtained from a reproducing kernel Hilbert space with Gaussian kernel of width 1. This procedure is repeated at every event time. As in soft-margin support vector machines, the survival counterpart allows for cases when the hyperplane do not exist by penalizing observations that lie on the different side of the margin. The minimization problem also involves a sum that is based on the concordance index (Evers and Messow, 2008). The penalty term lambda is set to 1. The optimization problem is a positive-definite quadratic problem and is solved by using sequential minimal optimization algorithm.

Conditional inference survival forests

The conditional inference forests, cforest implemented in R differs from original random forests in two ways. The base learners used are conditional inference trees (Hothorn et al., 2006). Also, the aggregation scheme works by averaging observation weights extracted from each of the trees built instead of averaging predictions as in the original version.

CIFs consists of the construction of many conditional inference trees, which are built using a regression framework by binary recursive partitioning algorithm. This algorithm has three main steps. First, variable selection by testing the global null hypothesis of independence between any of the predictors and the survival outcome. Second, if the hypothesis cannot be rejected, then terminate the algorithm, otherwise select the predictor with the strongest association to the survival outcome. Third, Implement a binary split based on split criteria using the selected predictors. Finally, recursively repeat the steps.

For the case of censored data, the association is measured bya P-value corresponding to the following linear test statisticof a single predictor and the survival outcome:

where 1(in node) = 1 if the individual isin node, 0 otherwise. Xji is the j-th covariate for the i-thsubject. ai is the log-rank score (Hothorn and Lausen, 2003) of subject i defined as follows:

Let X(1)X(2)···X(n) be the set of orderedpredictors, Yi be the response for individual i, and for eachsurvival time of observation i,and ai is as defined in the main manuscript. Since the distribution of Tj is unknown in most situations,permutation tests are employed. The conditional expectationµj and covariance j under the null hypothesis were derivedby Strasser and Weber (1999). A standardized form canbe obtained and the default in the cforest algorithm is in quadraticform Q = (t – µ)+ (t – µ)T, where tis the observed linear test statistics and + is the Moore-Penroseinverse of . The algorithm terminates if P-valuefalls below one minus the mincriterion as prespecified.

A split is implemented once the predictor has been selected from Step 1) described above and that it meets its criterion. Let XJ be the predictor J that was chosen, and let Rk be one of all possible subsets of the sample space of the predictor XJ. The linear statistic

,

which measures the difference between the samples

for i=1,...,n and and for i=1,...,n and where

if the individual is in node, 0 otherwise;

if XJi R, 0 otherwise;

and ai is the log-rank scores of subject i as defined above.

The best split BK is chosen from B such that

for all possible subsets BK. A split is established when the sum of the weights in both child nodes is larger than a pre-specified minsplit. This helps avoid splits that are borderline and reduces the number of subsets to be evaluated. One advantage as stated by the cforest authors is that the approach taken ensures that the right sized tree is grown without the need of pruning.

Random Survival Forest

Conserve split criterion

Another type of splitting rule is the conservation of events (Naftel et al.,1985). Denote the Nelson-Aalen cumulative hazard estimator for child j as:

where ti,j are the ordered event times for child j.

The conservation of events asserts that the total number of events is conserved in each child, i.e.

where is the censoring indicator.

Let be the ordered time points for child j, and be the corresponding censoring indicator for for k=1,...,nj

where

The Con(X,c) measures whether the two groups are well separated and (1+Con(X,c))-1 is used in the program It finds the best split by finding children closest to the conservation of events principle.

Random split criterion

The last splitting rule is "random" which implements a purely random uniform splitting. For each node, a variable is randomly selected from a random set of m variables. For a chosen split variable X, a random split point is chosen among all possible split points on that variable.

Cox Boosting

Given a set of outcome and predictors, the goal is to approximate the outcome with a function. To optimize a loss function, Cox boosting proceeds as follows. First, initialize the function to offset. Second, compute the residuals defined as the negative partial log-likelihood for Cox models . Third, fit the negative gradient vector by a base procedure, in this case, a component-wise univariate linear model as described inBuhlmann and Hothorn (2007). Fourth, update the function by taking a step of size 0 to 1 of the base learner. The second to fourth steps are repeated until a predetermined number of iterations is completed. The boosting algorithm is based on functional gradient descent, for more details please refer to Friedman (2001).

Unadjusted P-values corresponding to Table 3 in main manuscript.

Methods / p-val <0.05 / p-val <0.025 / p-val <0.01 / p-val <0.001 / p-val <0.0001
Random Survival Forest (log rank score split) / 17 / 10 / 4 / 2 / 1
Random Survival Forest (log rank split) / 23 / 12 / 7 / 1 / 0
Conditional Inference Forest / 39 / 23 / 11 / 1 / 0
Cox Boosting / 26 / 12 / 7 / 0 / 0

Table 5 - SNPs in linkage disequilibrium (r^2 > 0.8) with SNPs in Table 4 within the same pathway

Pathway / Gene / chr / dbSNP ID / In LD with dbSNPIDs / r^2
Stress Induction of HSP Regulation / CASP3 / 4 / rs4647669 / None
Stress Induction of HSP Regulation* / BCL2 / 18 / rs4941195 / rs7240326
rs4941187
rs12457700
rs7228914 / 0.87
0.86
0.90
0.89
Stress Induction of HSP Regulation* / BCL2 / 18 / rs1381548 / None
Stress Induction of HSP Regulation* / BCL2 / 18 / rs10503078 / None
Stress Induction of HSP Regulation* / BCL2 / 18 / rs4987839 / None
Cyctokine Network / IL18 / 11 / rs7106524 / None
Cyctokine Network / IL15 / 4 / rs4956404 / rs360718 / 0.87
Cyctokine Network / IL5/IRF1 / 5 / rs739718 / rs3181224 / 0.89
Cyctokine Network / IL12A / 3 / rs640039 / rs2115176 / 0.95

* Please refer to Supplementary Figure 3

Figure 3

Pairwise r^2 - Top genes from Table 5 in Stress Induction of HSP Regulation Pathway on chromosome 18

Table 6 - Table of prediction results from ten independent 10-fold CV runs for Stress Induction of HSP Regulation with the restriction that SNPs with r^2 > 0.8 are not allowed in the same CV.

10-fold cross validation / p-value
#1 / 2.7e-10
#2 / 1.71e-08
#3 / 1.24e-06
#4 / 2.69e-10
#5 / 1.38e-09
#6 / 6e-08
#7 / 6.28e-11
#8 / 5.35e-10
#9 / 3.45e-10
#10 / 1.64e-07

Table 7 - Table of prediction results from ten independent 10-fold CV runs for Cyctokine Network with the restriction that SNPs with r^2 > 0.8 are not allowed in the same CV.

10-fold cross validation / p-value
#1 / 3.96e-09
#2 / 1.06e-10
#3 / 8.83e-10
#4 / 4.56e-09
#5 / 1.07e-11
#6 / 6.93e-08
#7 / 1.8e-09
#8 / 3.47e-14
#9 / 1.3e-08
#10 / 3.71e-07

Table 8 - Top genes in Table 4 for Stress Induction of HSP Regulation and their importance ranking when restriction that SNPs with r^2 > 0.8 are not allowed in the same forests is imposed.

Forest / rs4647669 / rs4941195 / rs1381548 / rs10503078 / rs4987839
CASP3 / BCL2 / BCL2 / BCL2 / BCL2
Original rank / 1 / 2 / 3 / 4 / 5
#1 / 2 / n/a / 5 / 13 / 4
#2 / 1 / n/a / 4 / 6 / 5
#3 / 2 / n/a / 1 / 14 / 6
#4 / 1 / n/a / 2 / 9 / 4
#5 / 4 / n/a / 3 / 2 / 7
#6 / 3 / 1 / 5 / 4 / 7
#7 / 1 / n/a / 3 / 6 / 7
#8 / 1 / n/a / 2 / 5 / 13
#9 / 9 / n/a / 1 / 3 / 5
#10 / 2 / 1 / 3 / 5 / 7

n/a= when that SNP was not selected in building that forest.

Table 9 - Top genes in Table 4 for Cytokine Networkand their importance ranking when restriction that SNPs with r^2 > 0.8 are not allowed in the same forests is imposed.

Forest / rs7106524 / rs4956404 / rs739718 / rs640039
IL18 / IL15 / IL5/IRF1 / IL12A
Original rank / 1 / 2 / 3 / 4
#1 / 1 / n/a / 9 / n/a
#2 / n/a / 1 / n/a / n/a
#3 / n/a / 1 / 6 / 2
#4 / n/a / 2 / 10 / n/a
#5 / n/a / 1 / n/a / 2
#6 / 1 / n/a / 7 / n/a
#7 / n/a / n/a / n/a / 1
#8 / 1 / n/a / 5 / n/a
#9 / 2 / n/a / n/a / 1
#10 / 1 / n/a / 6 / 2

n/a = when that SNP was not selected in building that forest.

We discuss the biological plausibility of IL5/IRF1 here as this gene has dropped out of the top 4 when we account for LD. The IRF1 gene, localized on chromosome 5q31.1, is mutated in several cancers (Cavalli et al. (2010)). These authors have discovered that low IRF1 expression is strongly correlated with both risk of recurrence and risk of death which suggest a tumor suppressor role for the gene (Amiel et al. (1999)). For stable melphalan-treated multiple myeloma patients, 5q31.1 is a critical region that is consistently affected in deletion of 5q and it contains many genes associated with hematopoiesis including IRF1 (Amiel et al. (1999)). Furthermore, IL5has been hypothesized as one of the factors that can influence terminal differentiation to memory or plasma cells in B cell lymphoid malignancies, such as multiple myeloma (Barker et al. (2000)). In relation to the genomic region 5q31.1 we discovered, researchers have found that hyperdiploidy with 5q31 gain is a distinct entity that drives a more favorable prognosis (Avet-Loiseau et al. (2009)).

Notes associated with Tables 8 and 9 (Cont'd from main manuscript Results - LD section)

Second, we investigated whether this affects the ranking of important SNPs. We ran ten random forests with 3000 trees with the restriction that SNPs with r2 > 0.8 are not allowed in the same run. Tables 8 and 9 show the original rank of the important SNPs and their ranks in the different random forests for Stress Induction of HSP Regulation and Cytokine Network, respectively. For Stress Induction of HSP Regulation, the top ranked SNPs remain within +/- 2 ranks for all the top 5 SNPs at least 70% of the time. For Cytokine Network, SNPs rs7106524, rs4956404 and rs640039 remain the top 2 SNPs across different runs. However, rs739718 is less significant than the original rank of 3, but remains in top 10. Indeed, consistent with what was found in the classification setting, the ranks do differ when SNPs with high LD are removed, but not too severely. Given the variations and depending on how the SNPs within a pathway are correlated, we recommend investigating the ranks for significant pathways just like our presentation here.

Stress Induction of HSP Regulation is an example in which a single gene with many associated independent SNPs is identified as important for predicting survival. Four out of the five top SNPs are associated with the gene BCL2. As seen from Figure 3 in Supplementary materials, the top SNPs on chromosome 18 corresponding to BCL2 identified in Table 4 have r2 of less than .4 among them. Again, we point out that this does not affect the prediction as described in the previous paragraph, but this may affect the ranking of important SNPs. Table 5 in Supplementary materials show the list of SNPs that have high r2 with the important SNPs identified in Table 4.

Table 10 - Univariate Cox Regression on Survival Outcomefor Stress Induction HSP Regulation Pathway

Table 11 - Univariate Cox Regression on Survival Outcome for Cytokine Pathway

Notes associated with Tables 10 and 11

In Tables 10 and 11 of Supplementary Materials, we presented the single SNP survival association based on univariate Cox Proportional Hazard Regression since it does not make sense to look at single associations using random forests which is an ensemble method for non-parametric multivariate modeling. One of the motivations for using a pathway-based approach is similar to the motivation given in the study by Mootha et al.; a study in which the authors have not been able to identify genes predictive of binary outcome using a single gene based approach, but were able to identify a statistically and biologically significant pathway using gene set based methods (Mootha et al. (2003)). As pointed out by several authors, regression has been shown to be more powerful than single-SNP approach in the case control setting (Chapman et al. (2003), Tzeng et al. (2006), Ballard et al. (2009)). The ability to discover high order and non-linear effects is another motivation for using random forests for case-control GWAS study in Goldstein et al (2010).

References

Amiel A, Fridman K, Elis A et al: Deletion 5q31 in patients with stable, melphalan-treated multiple myeloma. Cancer Genet Cytogenet 1999; 113:45-8.

Avet-Loiseau H, Li C, Magrangeas F et al: Prognostic significance of copy-number alterations in multiple myeloma. J Clin Oncol 2009; 27:4585-90.

Ballard DH, Aporntewan C, Lee JY, Lee JS, Wu Z, Zhao H: A pathway analysis applied to Genetic Analysis Workshop 16 genome-wide rheumatoid arthritis data. BMC Proc 2009: 3 Suppl 7:S91.

Barker J, Verfaillie CM: A novel in vitro model of early human adult B lymphopoiesis that allows proliferation of pro-B cells and differentiation to mature B lymphocytes. Leukemia 2000; 14:1614-20.

Buhlmann P, Hothorn T. 2007. Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:477–505.

Carter K., McCaskie P., and Palmer L.. 2006. JLIN: A Java based Linkage Disequilibrium plotter. BMC Bioinformatics 7:60.

Cavalli LR, Riggins RB, Wang A, Clarke R, Haddad BR: Frequent loss of heterozygosity at the interferon regulatory factor-1 gene locus in breast cancer. Breast Cancer Res Treat 2010; 121:227-31.

Chapman J, Cooper J, Todd J, Clayton D: Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and determinants of statistical power. Hum Hered 2003; 18-31.

Evers L, Messow CM. Sparse kernel methods for high-dimensional survival data. Bioinformatics. 2008 Jul 15;24(14):1632-8.

Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statist 29, 1180.

Goldstein B, Hubbard A, Cutler A, Barcellos L: An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genet. 2010; 11:49.

Hothorn T, et al. Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. (2006a) 15:651–674.

Hothorn T, Lausen B. On the exact distribution of maximally selected rank statistics. Comput. Stat. Data Anal. (2003) 43:121–137.

Mootha V, Lindgren C, Eriksson K: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics 2003; 34:267-273.

Naftel D, et al. Conservation of events (1985) unpublished.

Strasser H, Weber C. On the asymptotic theory of permutation statistics. Math. Methods Stat. (1999) 8:220–250.

Tzeng J, Wang C, Kao J, Hsiao C: Regression-based association analysis with clustered haplotypes through use of genotypes. Am J Hum Genet 2006; 78:231-241.

1