Figure A1. Study sample, inclusion, and exclusions.
Assessment of Model Performance
We used several criteria to compare the overall predictive values of alternative models. Goodness-of-fit-How effectively a model describes the outcome variable is referred to as its goodness-of-fit. We used following measures of goodness-of-fit. Deviance (D statistic) compares the fit of the saturated model to the fitted model. This will be a small value if the model is good. For purposes of assessing the significance of non-linear terms, the values of D with and without the non-linear terms were compared by computing deviance difference (G statistic). Akaike information criterion (AIC) was used to account for complexity. Difference in AIC >10 was considered significant [1].
Discrimination-In the survival analysis, discrimination capacity is quantified by Harrell’s C statistic. It measures the probability that a randomly selected person who developed an event, at the certain specific time has a higher risk score than a randomly selected person who did not develop an event during the same, specific follow-up interval [2]. For Harrell’s C statistics of different models 95% confidence intervals were estimated with bootstrap resampling.
Calibration-Calibration describes how closely predicted probabilities agree numerically with actual outcomes. We examined calibration using a test very similar to the Hosmer-Lemeshow test proposed by Nam and D’Agostino [3]”.
Index of Determination or Explained variation-The degree to which a model explains the variation in outcome in survival analyses is quantified by Royston’s R2. We used bootstrap to obtain 95%CIs. R2s were compared as suggested by Royston [4].
Table A1. Linear-terms-only vs. cubic splines model.Linear-term-only model / Cubic spline model
Harrell’s C (95%CIs) / 0.828 (0.808-0.849) / 0.833 (0.812-0.854)
Nam-D’Agostino χ2 (P value) / 12.1 (0.206) / 8.5 (0.476)
Royston’s R2 (95% CIs) / 0.74 (95%CIs 0.69-0.79) / 0.75 (95%CIs 0.70-0.81)
AIC / 5687 / 5673
Deviance / 5675 / 5653
G statistic (P value for χ2 test) / 21.7 (0.0002)
* Non-linear-terms-included-model was developed using restricted cubic splines
AIC, Akaike information criteria;
Harrell’s C statistic is a measure of discrimination.
Nam-D’Agostino χ2 measures how closely predicted probabilities agree numerically with actual outcomes; χ2 > 20 suggest lack of adequate calibration.
The degree to which a model explains variations in an outcome in survival analyses is quantified by Royston’s R2.
Akaike information criterion (AIC) was used as a measure of model fit. The lower is the AIC the better will be the model fitness. Difference in AIC >10 was considered significant.
Deviance (D statistic) is a measure for judging the degree of matching of the model to the data. For purposes of assessing the significance of non-linear terms, the values of D for the models with and without the non-linear terms were compared. This is referred to as the deviance difference (G statistic).
Table A2. Predictability of IGT and the FPG levels ≥5.05 mmol.l-1 for diabetes.
IGT / FPG≥5.05 mmol.l-1
Sensitivity (%) / 55.6 (50.3-60.7) / 82.1 (77.8-85.9)
Specificity (%) / 89.1 (88.3-89.9) / 58.3 (57.0-59.6)
Area under the ROC curve / 0.723 (0.698-0.749) / 0.702 (0.681-0.723)
Likelihood ratio positive / 5.1 (4.54-5.74) / 1.97 (1.86-2.08)
Likelihood ratio negative / 0.499 (0.445-0.559) / 0.307 (0.246-0.382)
Positive predictive value (%) / 25.2 (22.2-28.3) / 11.5 (10.3-12.8)
Negative predictive value (%) / 96.8 (96.3-97.3) / 98.0 (97.5-98.5)
Brier score (P value*) / 0.056 (0.758) / 0.052 (0.756)
Values in parentheses are 95% confidence intervals unless asserted otherwise.
FPG, fasting plasma glucose; IGT, impaired glucose tolerance; ROC, receiver-operating characteristics.
* Derived from Spiegelhalter test
In the diagnostic setting where the outcome is already known (though unknown to the investigator), discrimination (either assessed by or C-statistic or AROC) is of most interest for classification into groups with or without prevalent disease. In prognostication, however, the outcome has not yet developed at the time of assessing predictors [5]. The performance of risk prediction, thus, needs to be examined using more global measures of fit like Brier score that combine calibration and discrimination [6]. Using Brier score, we compared the predictive performances of the suggested FPG cut-off point and IGT. The Brier score is an aggregate measure of disagreement between the observed outcome and a prediction. A perfect prediction rule would have a Brier score of zero. Spiegelhalter’s z statistic was calculated for testing whether an individual Brier score is low enough [6]. Sensitivity, specificity, likelihood ratios, area under the receiver operating characteristic (ROC) curve (AROC), and predictive values were also reported for the two criteria.
REFERENCES
[1] Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automat Contr AC-19: 716-723
[2] Harrell FE, Jr., Lee KL, Mark DB (1996) Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 15: 361-387
[3] D'Agostino RB, Nam BH (2004) Evaluation of the performance of survival analysis models: Discrimination and Calibration measures. In: Balakrishnan N, Rao C.R. (eds) Handbook of Statistics, Survival Methods. Elsevier B.V., Amsterdam, The Netherlands, pp 1-25
[4] Royston P (2006) Explained variation for survival models. Stata Journal 6: 83-96
[5] Cook NR (2007) Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 115: 928-935
[6] Spiegelhalter DJ (1986) Probabilistic prediction in patient management and clinical trials. Stat Med 5: 421-433