Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards

Matthew M Churpek, MD, MPH, PhD1,*; Trevor C Yuen, MS1; Christopher Winslow, MD2; David O Meltzer, MD, PhD1; Michael W Kattan, MBA, PhD3; Dana P Edelson, MD, MS1

SUPPLEMENTARY APPENDIX

eMETHODS

Overview

This Appendix Methods section provides more rigorous definitions and details of the methods used in the study. As noted in the main text, model tuning was completed in the training (derivation) data only, and predicted probabilities from the final models were used in the testing (validation) data to compare the accuracy of the different methods. Finally, there are many variations of each of these methods described in the literature. For the purposes of this study, the default settings for binary classification problems were used to fit each model unless otherwise stated.

Logistic regression

Logistic regression is a linear statistical model that utilizes the logit (log odds) link to predict a binary outcome from one or more predators using maximum likelihood estimation.1 Logistic regression has been extended for use with survival analysis data in the discrete-time framework, as described elsewhere.2,3 Importantly, predictor variables entered into the model are assumed to have a linear relationship to the outcome on the log odds scale, unless coded otherwise. In this study, two different logistic regression models were compared. The first model was fit using linear terms for each predictor. The second model relaxed the linear assumption by using restricted cubic splines with three knotsfor all continuous variables, with placement of the knots made a priori as recommended by Harrell.1This allowed predictor variables to have increased risk at both low and high ends of the range (e.g. low and high respiratory rates).A post-hoc sensitivity analysis was performed by fitting the spline regression model with four knots to determine if this improved model accuracy. The logistic regression models were fit using the “glm” package in R.

Tree-based models

Tree-based models partition the feature space into a set of rectangles, and are often presented as a series of “if-then” statements in the form of a tree.4, 5 For this study, the Gini index was used to determine the optimal variable and location of the split at each node in the tree.4In order to optimize the area under the curve (AUC) of the resulting tree, a cost complexity parameter, which penalizes larger trees, was used to control the size of the final tree. The tree with the highest ten-fold cross-validation AUC in the training set was used as the final tree.

Decision trees are created one variable at a time, and thus splits earlier in the tree affect which splits are optimal later in the tree. Thus, there is no guarantee that the resulting splits will lead to an “optimal tree” overall, only that the split at each node is optimal at that point for the data at hand. In addition, small changes in the data can lead to different variables or cut-offs used in the tree and can thus have a large effect on the tree itself. These issues, and others, often lead decision trees to have suboptimal accuracy compared to other methods.

To improve the accuracy and stability of the decision tree model, a procedure called bagging was used to fit a bagged tree model. This involved taking random bootstrap samples of patient data, with replacement, and fitting an unpruned tree model to each sample.4 The number of bagged trees in the final model was determined in the training dataset using ten-fold cross-validation to maximize the training set AUC, with 500, 1000, and 2000 trees considered.

In addition to a bagged tree model, a random forest model was also fit. Random forests modify the bagged tree procedure by only allowing a random number of the predictor variables to be considered at each split of each tree.4,6This results in trees that are less correlated with each other compared to bagged trees, and thus potentially increasing accuracy. The optimal number of trees and predictor variables to be considered at each split were determined using ten-fold cross-validation, and the combination with the highest training set AUC was denoted the final model.

The final tree-based model fit was a gradient boosted machine (i.e. boosted tree). This algorithm fits one tree at a time, first to all the outcomes in the training data and then to the residuals of the previous models, thus creating a combination of trees that increasingly weight the “difficult to predict” events to a greater degree.4,7The optimal number of splits for each individual tree, the total number of trees, and an additional shrinkage factor, which weights the contribution from each individual tree, were determined using ten-fold cross-validation.

All the tree-based models were run using the “caret” with the “rpart” and rf” methods and the “randomForest” package in R.

K-nearest neighbors

K-nearest neighbors (KNN) models use local geographic information in the predictor space to predict the outcome of a new sample.4,5 For example, a KNN model utilizing five neighbors uses the classes of the five closest observations in multidimensional space, based on a distance measure, to predict the outcome of a new observation. In this study, Euclidian distance was used to determine the nearest neighbors for each observation, and ten-fold cross-validation was used to determine the number of neighbors for the final model that maximized the training AUC. Of note, because distance is scale dependent, all predictors were rescaled to have a mean of zero and a standard deviation of one when fitting this model. The R package “caret” with the “knn” method was used to fit the K-nearest neighbors model.

Support vector machines

Support vector machines are a generalization of the maximal margin classifier and the support vector classifier, which fit a separating hyperplane in high-dimensional space to the data in order to separate the two classes.5Specifically, support vector machines generalize the idea of creating a hyperplane that separates the two classes of interest (e.g. event vs. no event) with the farthest minimum distance to the training observations to the situation where the two classes are not completely separable and the hyperplane is allowed to be non-linear. This class of methods is unique in that they ignore data points from each outcome class that are “easy” to predict when determining the structure of the boundary and instead focus on the points that are hardest to predictor (i.e. closest to the boundary) and those points that are misclassified.4, 5In order to allow the boundary separately the classes to be non-linear, the radial basis function, a commonly used flexible kernel function, was used for the support vector machine model in this study. The main tuning parameter for support vector machines is the cost penalty, with higher values penalizing misclassified observations to a greater degree. The optimal value was determined using ten-fold cross-validation and the scale parameter was determined computationally using the method of Caputo et al.4 Similar to K-nearest neighbors, this method is scale dependent so all predictors were rescaled to have a mean of zero and a standard deviation of one. The R package “caret” with the “svmRadial” method was used to fit the support vector machine model.

Neural networks

Neural networks are non-linear models that involve creating a set of linear combinations of the original predictor variables and then using those as inputs into a hidden layer (or layers) of units, which then create new combinations of these inputs to finally output the probability of the event of interest (after a suitable transformation).4, 5A feed-forward multi-layer perceptron neural network was fit for this study. A penalty term, known as weight decay, and the number of hidden units in the model were determined using ten-fold cross-validation in order to maximize the training set AUC. The “caret” package in R with the “nnet” method was used to fit the neural network model.

REFERENCES

1.Harrell FE. Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis. New York: Springer; 2001.

2.Singer JD, Willett JB. Its About Time - Using Discrete-Time Survival Analysis to Study Duration and the Timing of Events. Journal of Educational Statistics. Sum 1993;18(2):155-195.

3.Efron B. Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve. Journal of the American Statistical Association. Jun 1988;83(402):414.

4.Kuhn M, Johnson K. Applied predictive modeling. New York, NY: Springer; 2013.

5.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics). 2nd Ed. New York, NY: Springer; 2009

6.Boulesteix AL, Janitza S, Kruppa J, Konig IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery. Nov-Dec 2012;2(6):493-507.

7.Friedman JH. Greedy function approximation: A gradient boosting machine. Annals of Statistics. Oct 2001;29(5):1189-1232.

eTables

Hosp A
(n=83,948) / Hosp B
(n=85,787) / Hosp C
(n=52,610) / Hosp D
(n=33,066) / Hosp E
(n=14,588) / Total
(n=269,999)
Dates included / 11/2008-1/2013 / 1/2006-11/2011 / 1/2006-11/2011 / 1/2006-11/2011 / 11/2009-11/2011
Number of beds / 500 / 350 / 150 / 120 / 135
Teaching status / Major teaching / Major teaching / Major teaching / Minor teaching / Non-teaching
Age, mean (SD), years / 54 (18) / 56 (21) / 71 (17) / 67 (18) / 71 (16) / 60 (20)
Race
Black, % / 39,461
(47) / 7,744
(9) / 803
(2) / 1,098
(3) / 564
(4) / 49,670
(18)
White, % / 31,208
(37) / 42,076
(49) / 35,359
(67) / 21,840
(66) / 9,947
(68) / 140,430
(52)
Other/ Unknown, % / 13,279
(16) / 35,967
(42) / 16,448
(31) / 10,128
(31) / 4,077
(28) / 79,899
(30)
Female sex, n (%) / 48,021
(57) / 57,640
(67) / 30, 218
(57) / 18,200
(55) / 8,202
(56) / 162,281
(60)

eTable 1. Hospital and patient characteristics

Model / HL Chi-squared / HL
p-value / Intercept / Slope
Random forest / 1503.8 / <0.001 / 2.2 / 1.42
GBM / 7.44 / 0.68 / -0.1 / 0.98
Bagged trees / 82.9 / <0.001 / -0.35 / 0.92
SVM / 30.4 / <0.001 / -0.32 / 0.94
Neural network / 120.3 / <0.001 / -0.55 / 0.89
Logistic regression (splines) / 987.5 / <0.001 / -1.8 / 0.65
KNN / 3489.4 / <0.001 / 2.3 / 1.32
Logistic regression (linear) / 25.0 / 0.005 / -0.56 / 0.89
Decision tree / 114.3 / <0.001 / -0.17 / 0.97

eTable 2. Model calibration in the validation dataset*

*A well-calibrated model should have a non-significant H-L test, a slope near 1 and an intercept near 0.

Abbreviations: HL = Hosmer-Lemeshow goodness of fit test

eTable 3. Association between the number of variables allowed to be considered at each split in the random forest model with model calibration*

Number of variables allowed / Intercept / Slope
3 / 0.037 / 1.27
5 / 0.023 / 1.20
9 / 0.011 / 1.12
15 / 0.009 / 1.05
20 / 0.006 / 1.03
25 / 0.003 / 1.01

*A well-calibrated model will have an intercept near 0 and a slope near 1

eFigures

eFigure 1. Discrete-time data format for model development.*

*As shown, time is separated into discrete intervals (i.e., every eight hours for this study) and the predictor values closest to the beginning of each time block are used to predict whether the outcome occurs during that interval.

Abbreviations: SBP = systolic blood pressure, RR = respiratory rate, HR = heart rate

eFigure 2. Area under the receiver operator characteristic curves of the compared methods for cardiac arrest in the validation cohort.*

*Error bars indicate the upper 95% confidence intervals.Abbreviations: MEWS: Modified Early Warning Score, AUC: Area under the receiver operating characteristic curve

eFigure 3. Area under the receiver operator characteristic curves of the compared methods for intensive care unit transfer in the validation cohort.*

*Error bars indicate the upper 95% confidence intervals.Abbreviations: MEWS: Modified Early Warning Score, AUC: Area under the receiver operating characteristic curve

eFigure 4.Area under the receiver operator characteristic curves of the compared methods for death on the wards in the validation cohort.*

*Error bars indicate the upper 95% confidence intervals.Abbreviations: MEWS: Modified Early Warning Score, AUC: Area under the receiver operating characteristic curve

eFigure5.Partial plot of the interaction between heart rate, systolic blood pressure, and risk of an adverse event.*

*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk. Abbreviations: SBP, systolic blood pressure (in mm Hg)

eFigure6.Partial plot of the interaction between heart rate, respiratory rate, and risk of an adverse event.*

*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk.

eFigure7.Partial plot of the interaction between heart rate, age, and risk of an adverse event.*

*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk.

eFigure8.Partial plot of the interaction between respiratory rate, systolic blood pressure, and risk of an adverse event.*

*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk. Abbreviations: SBP, systolic blood pressure (in mm Hg)

eFigure9.Partial plot of the interaction between respiratory rate, age, and risk of an adverse event.*

*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk.

eFigure10.Partial plot of the interaction between age, systolic blood pressure, and risk of an adverse event.*

*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk. Abbreviations: SBP, systolic blood pressure (in mm Hg)

eFigure 11. Calibration plot across deciles of risk for selected models in the validation dataset