Grinspan et al Neurology 2015. Supplemental Material.

Grinspan ZM et al. Predicting Frequent ED Use By People with Epilepsy with Health Information Exchange Data. Supplemental Material.

Appendix e-1.

Table of Contents

A. Description of Computational Techniques.

B. Definition of Calibration

C. Statistical Packages.

D. Additional Details on Comorbidities.

E. Details of the Lasso Model

F. ROC curves for 1-variable model

A. Description of Computational Techniques.

Logistic Regression, Best Subsets. We created models with 1, 2, and 3 variables, using logistic regression and the best subsets algorithm. Best subsets works by systematically examining all possible combinations of variables. This is computationally feasible when the number of variables is fewer than 30 or when the number of variables allowed in the model is restricted to a small number (as in this case).1

Regularized Logistic Regression. The lasso1, 2 (Least Absolute Shrinkage and Selection Operator) and elastic net1, 3 are regularized regression techniques, in which the coefficients of the regression equation are penalized as they grow too large. This provides a useful trade-off: in exchange for introducing a small amount of bias into the predictions, (a) the varianceof the predictions is reduced and (b) many of the coefficients become zero, limiting the number of variables in the final model. Both models require a parameter λ, which we determined via cross-validation within the training set. Elastic net also requires a second parameter α, which we set to 0.4.1

Decision Trees and Variants. We used three decision tree algorithms. (1) CART (Classification and Regression Trees) is an automatic algorithm to construct and prune a decision tree.1, 4 (2) Random forests1, 5 builds a series of large trees from random subsets of the training data. Given a new example, each tree “votes” on the outcome. The final prediction is the outcome with the majority of votes. And (3) Adaboost1, 6 builds a series of small trees, each designed to improve the prediction. Similar to random forests, each tree “votes” on the outcome for a novel example. The votes are weighted, based on characteristics of each tree. The outcome with the largest weighted vote is the final prediction.

Geometric. Support vector machines7 is a popular machine learning technique based on geometric principles. The algorithm represents every case as a point in a high dimensional space. It then constructs an optimal hyper-plane that best separates the outcomes. For a novel case, the predicted outcome is determined by which side of the hyper-plane the case appears.

B. Definition of Calibration.

We defined “calibration” as a measurement of the error between predicted probability and observed proportion, calculated as follows:

a)Split the individuals into 5 groups, based on evenly distributed bins of predicted probability (i.e. 0-20%, 20-40%, etc).

b)Calculate the mean absolute difference between the predicted probability and the observed proportion of frequent ED use, across these 5 groups.

C. Statistical Packages.

We used several additional packages to supplement the basic R software installation. These included the “data.table”,8 “car”,9 “ROCR”,10 “randomForest”,5 “ada”,6 “rpart”,4 “e1071”,7 “leaps”,11 “ggplot2”,12 and “glmnet”1, 2 packages.

D. Additional Details on Comorbidities.

Table e-1 contains detailed information on the prevalence of all 33 Jetté comorbidities, stratified by frequency of hospital use in year two.

E. Details of the Lasso Model.

Methods -- Lasso model. We present the baseline probability and odds ratios for the parameters selected by the lasso technique. We intentionally did not present standard errors. The lasso technique introduces bias into parameter estimates in order to reduce variance in the predictions. Thus the parameters are better interpreted as a prediction recipe and should not be used for inference.

Results -- Lasso Model. The nine variables selected by the Lasso technique included three health care utilization variables (ED visits, inpatient admissions, and outpatient visits), one measure of healthcare fragmentation (number of sites visited for ED care), and five specific comorbidities (depression, drug abuse, pulmonary disease, peptic ulcer disease, and fracture, Table e-2). In addition to demonstrating the lowest classification error (5.17%), the model also tied for the highest AUC (0.88). Similar to other models, it had a good positive predictive value (79%), though poor sensitivity (20%). Calibration was also good, at 10%. (Table e-2)

Tablee-2. Parameters of Lasso Model
Base Probability, ≥4 ED visits in Year two / 2.1%
Factor / Odds Ratio
Utilization and Fragmentation, Year one
ED Visits (per visit) / 1.17
Inpatient Admissions (per admission) / 1.08
Outpatient Visits (per visit) / 1.002
Sites visited for ED Care (per site) / 1.92
Cormorbidities in Year one
Depression / 1.31
Drug Abuse / 1.24
Pulmonary Disease / 1.28
Peptic Ulcer Disease / 1.23
Fracture / 1.16

F. ROC curves for 1-variable model.

F. Additional References

1.Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning : data mining, inference, and prediction, 2nd ed. New York, NY: Springer, 2009.

2.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 2010;33:1-22.

3.Yuan GX, Ho CH, Lin CJ. An Improved GLMNET for L1-regularized Logistic Regression. J Mach Learn Res 2012;13:1999-2030.

4.rpart: Recursive Partitioning. R package version 4.1-4. [computer program] 2013.

5.Liaw A, Wiener M. Classification and Regression by randomForest. R News 2002;2:18-22.

6.ada: an R package for stochastic boosting. R package version 2.0-3. [computer program] 2012.

7.e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.6-2 [computer program] 2014.

8.data.table: Extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping and list columns [computer program]. R package version 1.8.6. : 2012.

9.Fox J, Weisberg S. An R Companion to Applied Regression, Second Edition. Thousand Oaks, CA: Sage, 2011.

10.Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics 2005;21:3940-3941.

11.Lumley T, Miller A. leaps: regression subset selection. R package version 2.9. . 2009.

12.Wickham H. ggplot2: elegant graphics for data analysis. New York: Springer, 2009.