1. Detail Methodology for Study of Transferability of Signature Genes

Supplementary Methods

1. Detail methodology for study of transferability of signature genes

1.1evaluating the transferability of signature genes using the Nearest Centroid (NN) and K-Nearest Neighbor (KNN) methods

Figure 1Workflow for evaluating the transferability of signature genes using the Nearest Centroid (NN) and K-Nearest Neighbor (KNN) methods. Both the Affymetrix (AFX) and Agilent (AGL) datasets were divided into the predefined training and test sets (see Materials and Methods), resulting in the identical training and test set pair for both platforms for Analysis Configuration 2 &3,but not for Analysis Configuration 1. The beginning of this testing process is depicted on the left side of diagram in which the best classifier is developed using the AFX training set and its performance (AFX prediction accuracy) was obtained from predicting the AFX test set. The classifier method and signature genes of the best AFX classifier was then transferred to the Agilent platform to develop the AGL classifier based on the AGL training set. The AGL classifier was then used to determine predication accuracy on the AGL test set. The degree of cross-platform transferability was assessed by comparing the AFX prediction accuracy with AGL prediction accuracy. The most important component in the transferability analysis is to obtain the best classifier. In this study, the best AFX classifier was derived from a stratified random sample-splitting (30/70 splitting) validation approach. Specifically (please follow the footnote in the left side of the graph):

Stratified random sample splitting – we used 30/70 splitting, where the 70% samples wereused to develop the classifier which was then use for predicting the remaining 30% of the samples. Since the sample splitting was random in nature, the process (illustrated within the red dashed line) was repeated 500 times, thereby generating the 500 best classifiers. The results of these 500 classifiers were used for calculating T-Index scores, which are reported in Supplementary Table3 online.
Filtering – This step was employed to generate an initial pool of probes/probesets that were used in subsequent analyses. The number of probes/probesets was around 300 after evoking a fold-change threshold of 2-fold and a p-value criterion of 0.05.Theseprobes/probesets were rank ordered based on fold change values, as suggested by the MAQC-I findings.
Feature selection – A sequential forward feature selection method was applied starting with theprobes/probesetswith the highest fold-change value to develop a classifier. The resulting classifier was then used to predict the 30% samples and the prediction accuracy was recorded (see step 4). The process was repeated by incrementally adding one probes/probesetsat a time to generate the next five classifiers.
Prediction – For classifier i (i is the number of probes/probesetsused in the classifier), if the subsequent five consecutive classifiers had prediction accuracy that was less than or equal to that of classifier i, the process stopped and classifier i was selected as the best classifier. If the predication accuracy was greater than that of classifier i, steps 3 and 4 were repeated.

(NOTE: this illustration only depicts the testing of transferability from the Affymetrix platform to the Agilent platform; however, the same approach was also applied for the reverse scenario as well.)

1.2 Evaluating the transferability of signature genes using the DecisionForest (DF) method

Figure 2Workflow for evaluating the transferability of signature genes using the DecisionForest (DF) method.Both the Affymetrix (AFX) and Agilent (AGL) datasets were divided into the predefined training and test sets (see Materials and Methods), resulting in the identical training and test set pair for both platforms for Analysis Configuration 2 &3,but not for Analysis Configuration 1. The DF classifier was developed based on the training set and prediction was made to the AFX test set to obtain the prediction accuracy. The probes/probesets that were used by the classifier (signature genes) were then transferred to the Agilent platform to develop the DF classifier based on the AGL training set and the AGL classifier was then used to determine predication accuracy on the AGL test set. The degree of cross-platform transferability was assessed by comparing the AFX prediction accuracy with AGL prediction accuracy. The description to each step of the workflow is explained below (please follow the footnote in the left side of the graph):

Training set – The training set data was predefined (see Method).
Filtering – This step was employed to generate an initial pool of probes/probesets that were used in subsequent analyses. The number of probes/probesets was around 300 after evoking a fold-change threshold of 2-fold and a p-value criterion of 0.05.Theseprobes/probesets were rank ordered based on fold change values, as suggested by the MAQC-I findings.
DF Classifier – The DF algorithm integrates the feature selection process within the modeling process. Thus, unlike KNN and NC depicted in Supplementary Figure 1 online, the multiple repetitions of the sample splitting procedurewas not necessary for DF. Consequently, only one DF model was developed for each permutation tested.

1.3 Pathway-based analysis using KNN to assess transferability of signature genes

Figure 3Workflow for the pathway-based analysis using KNN to assess transferability of signature genes.Both the Affymetrix (AFX) and Agilent (AGL) datasets were divided into the predefined training and test sets (see Materials and Methods), resulting in the identical training and test set pair for both platforms for Analysis Configuration 2 &3,but not for Analysis Configuration 1. The common transcript set was first mapped to 352 canonical pathways available from GeneGo’s MetaCore. The classifier for each pathway was first developed on the AFX training set using the common transcripts (i.e., probesets for Affymetrix) identified by that pathway. The process was then repeated for all 352 pathways. The pathways with the highest prediction accuracy for the AFX test set were selected as the signature pathways and were then used to develop classifiers on the Agilent platform using the same common transcripts. The transferability from AFX to AGL was assessed based on the difference in prediction accuracy between the AFX test set and AGL test set.

1.4 Assessing cross-platform transferability of signature genes based on the RHI score as a continuous endpoint variable

Figure 4Workflow for assessing cross-platform transferability of signature genes based on the RHI score as a continuous endpoint variable.The analysis was conducted for analysis configuration 3 and the common transcript of SeqMap. Both the Affymetrix (AFX) and Agilent (AGL) datasets werefirst divided into the predefined training and test sets (see Materials and Methods), resulting in the identical training and test set pair for both platforms for Analysis Configuration 3. Fourty three regression models were developed using General Linear Model (GLM), Partial Least Square (PLS), and Partition Tree (PT). Specifically, 430 iterations of split-sample (80/20) validation was carried out first for the Affymetrix platform (illustrated within the red dashed line) and the final AFX models were developed based on the internal validation results. The AFX models were used to predict the AFX test set to obtain the AFX Root Mean Square Error (RMSE) of prediction. The regression method and signature genes of the AFX models were then transferred to the Agilent platform to develop the AGL regression models based on the AGL training set. The AGL model was then used to determine the AGL RMSE of prediction on the AGL test set. The degree of cross-platform transferability was assessed by comparing the AFX RMSE with AGL RMSE.

2. Detail methodology for study of transferability of classifiers

2.1 Cross-platform transferability of classifiers using the K-Nearest Neighbor (KNN) method

Figure 5Workflow for performing cross-platform transferability of classifiers using the K-Nearest Neighbor (KNN) method.Both the Affymetrix (AFX) and Agilent (AGL) datasets were divided into the predefined training and test sets (see Materials and Methods), resulting in the identical training and test set pair for both platforms for Analysis Configuration 2 &3,but not for Analysis Configuration 1.The beginning of this testing process is depicted on the left side of diagram in which the best classifier is developed using the AFX training set. Subsequently, the best AFX classifier was used to classify the samples for the AFX test set, the AGL training set and the AGL test set. For KNN, the best AFX classifier was derived from a stratified random sample-splitting (30/70 splitting) validation approach on the AFX training set. Specifically (please follow the footnote in the left side of the graph):

Stratified random sample splitting – we used 30/70 splitting, where the 70% samples wereused to develop the classifier which was then use for predicting the remaining 30% of the samples. Since the sample splitting was random in nature, the process (illustrated within the red dashed line) was repeated 500 times, thereby generating the 500 best classifiers. The results of these 500 classifiers were used for calculating T-Index scores, which are reported in Supplementary Tables 5 and 6 online.
Filtering – This step was employed to generate an initial pool of probes/probesets that were used in subsequent analyses. The number of probes/probesets was ~300 after evoking a fold-change threshold of 2-fold and a p-value criterion of 0.05.Theseprobes/probesets were rank ordered based on fold change values, as suggested by the MAQC-I findings.
Feature selection – A sequential forward feature selection method was applied starting with theprobes/probesetswith the highest fold-change value to develop a classifier. The resulting classifier was then used to predict the 30% samples and the prediction accuracy was recorded (see step 4). The process was repeated by incrementally adding one probes/probesetsat a time to generate the next five classifiers.
Prediction – For classifier i (i is the number of probes/probesetsused in the classifier), if the subsequent five consecutive classifiers had prediction accuracy that was less than or equal to that of classifier i, the process stopped and classifier i was selected as the best classifier. If the predication accuracy was greater than that of classifier i, steps 3 and 4 were repeated.

2.2 Cross-platform transferability of classifiers using the DecisionForest (DF) method.

Figure 6Workflow for the cross-platformtransferability of classifiers using the DecisionForest (DF) method. Both the Affymetrix (AFX) and Agilent (AGL) datasets were divided into the predefined training and test sets (see Materials and Methods), resulting in the identical training and test set pair for both platforms for Analysis Configuration 2 &3,but not for Analysis Configuration 1. The DF classifier was developed on the AFX training set and subsequently was used to classify the samples for the AFX test set, the AGL training set and the AGL test set. The detailed steps involved in the classifier development are explained below (please follow the footnote in the left side of the graph):

Training set – The training set data was predefined (see Materials and Methods).
Filtering – This step was employed to generate an initial pool of probes/probesets that were used in subsequent analyses. The number of probes/probesets was limited to approximately 300 by evoking a fold-change threshold of 2-fold and a log ratio p-value criterion of 0.05.Theseprobes/probesets were rank ordered based on fold change values, as suggested by the MAQC-I findings.
DF Classifier – The DF algorithm integrates the feature selection process within the modeling process. Thus, unlike KNN and NC depicted in Supplementary Figure 1 online, the multiple repetitions of the sample splitting procedurewas not necessary for DF. Consequently, only one DF model was developed for each permutation tested.

2.3 Cross-platform transferability of classifiers using the SVM

Figure 7Workflow for performing cross-platform transferability of classifiers using the SVM. Both the Affymetrix (AFX) and Agilent (AGL) datasets were divided into the predefined training and test sets (see Materials and Methods), resulting in the identical training and test set pair for both platforms for Analysis Configuration 2 &3, but not for Analysis Configuration 1.The beginning of this testing process is depicted on the left side of diagram in which the best classifier is developed using the AFX training set. Subsequently, the best AFX classifier was used to classify the samples for the AFX test set, the AGL training set and the AGL test set. For SVM, the best AFX classifier was derived from a stratified random sample-splitting (30/70 splitting) validation approach on the AFX training set. Specifically (please follow the footnote in the left side of the graph):

Stratified random sample splitting – we used 30/70 splitting, where the 70% samples were used to develop the classifier which was then use for predicting the remaining 30% of the samples. Since the sample splitting was random in nature, the process (illustrated within the red dashed line) was repeated 100 times (for all the combinations of feature number and cost violation parameter in linear SVM). Supplementary Tables 5 and 6 online.
Feature selection – Two-sample t-test was applied to select the top N features (N = 2, 4, 6, …,50) based on the 70% samples ONLY.
SVM: Linear SVMs with different cost violation parameter (C = 10-3, 10-2, …,103) were used as classifiers.
Predictions within random split – The SVMs defined in step 3 were used as classifiers to do predictions based on the selected features (the output of step 2). For each combination of feature number and C, there are 100 classification accuracies generated.
Predictions to AFX test set, AGL training set and AGL test set. Based on the 100 classification accuracies for each combination of feature number and C, the optimal combination of C and feature number was picked as the one with the maximum average classification accuracy across 100 times. Then the best classifier is obtained by:

Performing feature selection using all the samples in AFX training set based on the optimal feature number determined above.
Train a linear SVM based on the selected features and the optimal C, using all the samples from AFX training set.

The predictions to AFX test set, AGL training set and AGL test set were obtained by applying the best classifier.