Appendix 1: Machine Learning Methodology

SUPPLEMENTARY MATERIAL

Appendix 1: Machine learning methodology

1.1 Pre-processing

Data preprocessing is applied to improve the quality of data used for performing feature selection, prediction and optimization. It includes two main sub steps that can be applied or not depending on their impact on the prediction:

Filtering: all the features that were sampled less than a certain sampling frequency are removed. The filtering cut offs used were 30, 40 and 50%.
Imputation: this technique consists in interpolating all the missing values using a Nearest-Neighbor algorithm [1]. The algorithm consists in, giving a partially-informed sample (with missing values), finding the closest sample within the set of fully-informed samples, and use the values of the missing variables into the imputed sample. Similarity is measured using the standard Euclidean dot product in n-dimensional vector spaces, with n the number of informed variables. This way of interpolation has the advantage of not introducing outliers that are not originally present in the dataset before imputation. Although the success of the different imputed algorithms might be data-driven, imputing the data improved the accuracy in the predictions and did not alter the prognostic variables that were involved providing a shorter list with higher discriminatory power.

2 Featureselection methods

Feature selection (FS) methods are used to find the minimum-size list of most discriminatory variables (also called the reduced base) for the different classification problems addressed in this paper. The inferred prognosis variables will be used for optimally predicting treatment response with its corresponding uncertainty assessment, since due to the ill-posed character of these classification problems there might exist different listsof prognostic variables with similar predictive accuracy.

To explore the existence of these lists we have used two different techniques:

Fisher’s Ratio (FR) [2, 3]: Feature selection by the Fisher’s ratio (FR) obtains the set of prognostic variables (attributes) that had both: 1) Maximum separation of the centers of distribution between classes, and 2) Minimum variance within classes. TheFisher’s ratio of an attribute j in this two-class problem is defined as follows:

where are the centers of the distribution (means) of attribute j in classes 1 and 2, and are measures of the dispersion (variance) within these classes. Attributes with the highest FR therefore had the greatest power to discriminate between the two classes.

Maximum Percentile Distance (MPD)[4]: The percentile distance FS method selects the attributes with higher distances between the corresponding cumulative probability functions (percentile array) within each class, defined for attribute j as follows:

where stand for the percentile vector of the attribute j in classes 1 and 2. Percentiles vary from 5 to 95 to avoid the possible effect of outliers. Variables with higher percentile distances between the corresponding discrete cumulative probability functions (percentile array) in each class will be selected. The percentile array takes into account both, the distance between the centers of the distributions (medians or percentile 50%), the interquartile range (percentiles 25% and 75%), and also the distance between the tails of the distributions (percentiles 5% and 95%).

These two feature selection methods have been selected due to their efficiency and low computational cost. As a result of this analysis, variables are ranked according to their discriminatory power. (Fisher’s ratio and/or percentile distance values).

Once the most discriminatory variables are determined and ranked in decreasing order by their discriminatory power, the aim is to determine the shortest (having the smallest number of variables) list of prognostic variables with the highest predictive accuracy.

The algorithm to find the minimum-size list of features is similar to the Recursive Feature Elimination (RFE) [5]. Feature elimination tries to unravel the existence of redundant or irrelevant features to yield the smallest set of prognostic variables that provide the greatest possible classification accuracy. Redundant features are those that provide no additional information than the currently selected features, while irrelevant features provide no useful information in any context.

The algorithm of backward feature elimination works as follows: beginning by the tail of the ranked list of prognostic variables, the algorithm iteratively generates increasingly shortest lists by eliminating one prognostic variable at a time, calculating their classification accuracy. Finally, the list with the optimum accuracy and minimum size is therefore selected. This way of proceeding is based on the following idea: prognostic variables with higher discriminatory ratios (FR or MPD) span low frequency features of the classification, whilst variables with lowest discriminatory ratios account for the details in the discrimination (high frequency features). This method determines the minimum amount of high frequency details that are needed to optimally discriminate between classes.

The predictive accuracy estimation is based on the Leave One Out cross-validation method (LOOCV), using the average distance of the reduced set of features to each training class set. The goal of cross-validation is to estimate howaccuratelya predictive model (classifier) will perform in practice. LOOCVinvolves using a single sample from the original dataset as the validation data (sample test), and the remaining samples as training data. The class assignment is based in a nearest-neighbor classifier in the reduced base, that is, the class with the minimum distance in the reduced base to the sample test is assigned to the sample test. The average LOOCV predictive accuracy is calculated by iterating over all the samples, using the Heterogeneous Value Difference Metric (HVDM) [6]. This metric in the case of continuous variables coincides with the Euclidean distance between the corresponding normalized variables. The weights used to normalize the variables, are the inverse of two times the prior variability (standard deviation) of the prognostic variables. These weights serve to scale the different kinds of measurements into approximately the same range in order to give to each variable a similar influence on the overall distance measurement. The scaling value is based on the fact that 95% of the values in a normal distribution fall within two standard deviations of the mean. Using this metric, attributes with higher standard deviation are assigned lower weights. In this procedure the feature selection method is executed only once using all training samples, before estimating the accuracy by means of a leave-one-out procedure. Our goal is not to study the effectiveness of feature selection methods --despite one of them may provide better sets of attributes for a particular problem-- but to analyze the goodness of such groups of prognosis variables.

1.3 k-NN design and risk analysis

In the previous step, maximizing the predictive accuracy according to the LOOCV criterion allowed to determine the best reduced-base of prognostic variables. However, it is also important to analyze the structure of the confusion matrix, obtained from the set of predictions of the training set using the LOOCV method. The confusion matrix is composed by: True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). These concepts depend on how the classification problem is set up. The physicians could use this knowledge in their decision-making process. From the confusion matrix we can calculate different rates that are very useful to understand the risk in the prediction:

1. True Positive Rate or Sensitivity (TPR): measures the proportion of actual positives that are correctly predicted as such.

2. True Negative Rate or Specificity (SPC): measures the proportion of negatives that are correctly predicted as such.

3. False Positive Rate (FPR): fraction of false positives out of the total actual negatives.

4. False Negative Rate (FNR): fraction of false negatives out of the total actual positives.

5. False Discovery Rate (FDR):fraction of false positives out of the total actual positives.

It is possible to optimize the true positive and/or true negative rates (improving at the same time the overall predictive accuracy) by optimizing the parameters of the k-NN classifier. The idea is to balance/improve the confusion matrix by optimizing the prior weights assigned by the HVDM metric to the best reduced-base that has been found applying the LOOCV approach. This optimization is performed via a powerful family of Particle Swarm Optimizers (PSO) [7, 8].

1.4The Particle Swarm Optimization (PSO)

Particle swarm optimizationis a stochastic evolutionary computation technique used in optimization, which was initially inspired in the social behavior of individuals (called particles) in nature, such as bird flocking and fish schooling.

The algorithm consists of the following:

A prismatic space of admissible solution M, is defined:

whereare the lower and upper limits for the j-th coordinate for each optimization model.In PSO terminology, each model is called a particle, and is represented by a vector whose length is the number of model parameters of the optimization problem. Each particle has its own position in the search space. The particle velocity represents the parameter perturbations needed for these particles to move around in the search space and explore solutions of the inverse problem.

PSO updates the positions, and velocities, of each particle in the swarm in each iteration, according to 3 main components:

a)The inertia term, which consists of the old velocity of the particle, , weighted by a real constant, , called inertia.

b)The so-called social term, which is the difference between the global best position found so far in the entire swarm (called ), and the particle's current position ().

c)The so-called cognitive term, which is the difference between the particle's best position found so far (called , the local best) and the particle’s current position ():

are vectors of random numbers uniformly distributed in (0,1), to weight the global and local acceleration constants, .

are the PSO parameters to be tuned in order to achieve convergence. PSO has been chosen for this purpose because its convergence has been analyzed using stochastic stability analysis. Consequently, the tuning of the PSO parameters can be done automatically, based on these stability results. Particularly, the RR-PSO [9] and CP-PSO [10] versions are used due to their higher exploratory properties that allowed us to escape from local minima.

Appendix 2: Example of an implementation of the methodology in an Excel spreadsheet

The final classifier has been implemented in an Excel spreadsheetthat is provided assupplementary material (HL_TreatmentResponse_Predictor.xls).As an example of its use, the prediction of 3 different samples corresponding to the different classes is (PD, PR, and CR) is given below:

New Sample 1:

This sample has a Serum Ferritin value SF = 2500 ng/mL. This value is input into the data box (orange) so we can firstly differentiate between CR or PR and PR. Since, the value is closer (using the weighted distance) to the mean value of PD, the system classify the new sample as belonging to the PDclass with a probability of 88% (P2 = 0.88). In conclusion the patient would not respond to the treatment adequately with a probability of 88%.

New Sample 2:

In this case SF = 450 ng/mL. As in the previous example this value is inputinto the data box (orange). The weighted distance is in this case closer to the CR or PR group, with a probability of 93% (P1 = 0.93). To differentiate further between CR and PR the values for ALT and ALP re needed. In this case , ALT = 20 U/L and ALP = 119 U/L. These values are introduced in the pop-up data box. The weighted distance is closer to CR, with a probability of 91% (P1 = 0.91). In conclusion the patient will be in the complete remission class with a probability of 91%.

New Sample 3:

In this case the values are: SF = 1350 ng/mL, ALT = 21 U/L and ALP = 380 U/L. Following the same procedure the new sample belongs to the PR class with a probability of 53% (P2 = 0.53). The system would predict that the patient would remit partially with some uncertainty, that is, complete remission will still be possible with a probability of 48%.

It may seem bizarre to take decisions with a certain probability, but it is the right way to proceed due to the intrinsic uncertainty of the corresponding classifier. The probabilities P1 and P2 provide an idea of the distance between the new sample and mean signatures in each class as follows:

where snew is the new sample, d1 the average distance to class 1, and d2 the average distance to class 2.

The procedure to decide the class assignment is as follows:

Particularly, if the SF value is higher than 1540 ng/mL the classifier assigns to the new sample the PD class. Obviously the uncertainty of this classification when the SF value is in the range of 1540 ng/mL is very high, as noted by the P1 and P2 probabilities. Besides, as it has been highlighted in the main text, the difference between partial and complete remission is very hard to tell.

Finally, the cut-offs values corresponding to the decision variables SF, ALT and ALP could be tailored to a particular population.

Appendix 3: Other comparisons

3.1Multi-classcomparison: CR vs. PR vs. PD

For this purpose we have used a generalization of the feature selection methods to the multiclass problem. The multiclass comparison provided worse results in general terms. The best subset of variables was obtained with MPD, but only reaching a predictive accuracy of 84%: SF, ALT, SUR and ALP. However, it is remarkable that SF is on the top of the list and ALT and ALP take also part of this reduced base. Although the accuracy was not higher as in the binary comparisons, this result highlights the important role that SF, ALT and ALP have in predicting Hodgkin’s Lymphoma treatment response.

3.2 Binary Comparison 1: CR (+) vs. PR+PD (-)

In this binary comparison, PR and PD patients were considered as the same group. Biologically, this has sense since both types of patients followed the second line of treatment. CR has been considered as the positive class.

The MPD provided the shortest list with the highest accuracy. The variables of the reduced base are the Serum Ferritin (SF) and ALT, providing a LOOCV accuracy of 90.11%. The SF mean in the TN group (2480 ng/mL) is about 10 times higher than in the TP group (247 ng/mL) with complete remission. The true positive rate of this comparison is very high (97.89%), that is, patients with complete remission are easily identified. The true negative rate is much lower (19.23%), meaning that some members of this group (PR and PD) are wrongly identified. The ALT also shows a similar behavior to SF. The mean ALT value in the TN group (52 U/L) is more than double of the mean value in the TP group (24 U/L), and is slightly higher than the maximum value that is considered to be normal (around 43 U/L).

FP (21 samples) and FN (5 samples) groups provoke ambiguity in the classification. Particularly, the members of the false positive group have a mean signature for the Serum Ferritin of 392 ng/mL, and are mainly composed by patients in partial remission (15 of the 21 samples).

The FN group is composed in this case by five samples that have high anomalous values for SF and ALT, typical from the TN group. These samples might be misclassified or defined a new class of patients that do not correspond to the typical TP group, that is, patients with complete remission with a very low inflammatory response (SF values) and normal ALT values (no liver injuries).This comparison also suggests that the optimum way to differentiate the 3 classes is first to differentiate CR+PR vs. PD, and in second step to differentiate CR vs. PR.

Finally, the Fisher’s ratio provided a longer list composed by 9 variables with a lower accuracy (82.5%). However, the variable with the highest FR ratio is also SF. In the second place of the list we found the number of affected lymph areas (ALA). ALT and ALP appeared in this list in the seventh and eighth place respectively.

Appendix References

[1] Troyanskaya OG, Cantor M, Sherlock G, et al. Missing value estimation methods for dna microarrays. Bioinformatics 2001; 17(6): 520–5.

[2] Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics 1936; 7(7): 179–88.

[3] Yang F, Mao K. Robust feature selection for microarray data based on multicriterion fusion. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2011; 8(4): 1080–92.

[4] Fernández-Martínez J L, deAndrés-Galiana E J. Numerical analysis and comparison of different filter methods for feature selection in high dimensional spaces. Research report. Mathematics Department, Oviedo University, 2014.

[5] Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Mach Learn 2002; 46(1-3): 389–422.

[6] Wilson DR, Martinez TR. Improved heterogeneous distance functions. J Artif Intell Res 1997; 6: 1–34.

[7] Kennedy J and Eberhart R C. Particle Swarm Optimization; vol. 4. PICNN; 1995.

[8] Fernández-Martínez JL, García Gonzalo E. The generalized PSO: A new door to PSO evolution. Journal of Artificial Evolution and Applications, 2008.

[9] Fernández-Martínez JL, García Gonzalo E. Stochastic stability and numerical analysis of two novel algorithms of the pso family: PP-GPSO and RR-GPSO. Int. J. Artif. Intell. Tools 2012; 21(03): 1240011.

[10] Fernández-Martínez JL, García Gonzalo E. The PSO family: deduction, stochastic analysis and comparison. Swarm Intell 2009;3(4):245–73.