Appendix 1. Detailed description of statistical analyses

Standard descriptive statistics were used to describe the study population and to compare the distribution of routine clinicopathological factors between 70-gene signature high and low risk patients. Analyses for time to LRR were performed using competing risk analyses as not to overestimate the absolute LRR risk (18). For this, follow-up time started at diagnosis and ended at the first manifestation of LRR (event) or death (competing event), or at the end of follow-up without LRR or death (censored). Follow-up beyond 10 years was limited and truncated. Occurrences of distant metastasis, contralateral breast cancer, or second primary tumors were not considered censoring events nor competing risks. The univariable 5- and 10-year absolute risk of LRR for the 70-gene signature high and low risk groups was estimated using the cumulative incidence function (19), and compared using Gray’s test (20). This was performed for the entire cohort, but also in subgroups according to primary locoregional treatment (BCT, or mastectomy with or without radiotherapy). Multivariable analyses were performed using Fine and Gray competing risk regression (21). Following univariable regression analyses, a multivariable model was constructed comprising solely of traditional clinicopathological factors and treatment. To this model the 70-gene signature (high vs low risk) was added to evaluate its additional and independent prognostic value. The model based on clinicopathological factors consisted of age (continuous), grade (2 vs 1 and 3 vs 1), tumor size (continuous), estrogen receptor (positive vs negative), number of tumor involved lymph nodes (continuous), surgery (mastectomy vs BCT), adjuvant chemotherapy, endocrine therapy, and radiotherapy (all yes vs no), and radiotherapy boost (yes vs no).

At least one routine clinicopathological factor was missing for 24% of patients. As analyses leaving out patients with missing data is less efficient and may lead to biased results, multiple imputation by chained equation was used to account for missing data (10 imputation datasets, 25 iterations, healthy convergence). The imputation model included all available clinicopathological factors including patient outcome data (25). All regression analyses were performed separately in each imputation dataset and then combined using Rubin’s rules (29).

Evaluation of the linearity assumption with restricted cubic splines showed that a spline for age significantly improved the model based on clinicopathological factors (pseudo-likelihood ratio (pLR) test P=0.049), and this spline was retained in all analyses. Tumor size and number of involved lymph nodes showed no departure from linearity. As previous analyses of the 70-gene signature and distant recurrence risk and survival suggested that its prognostic value may be highest in the first five years after diagnosis (10), a time-covariate interaction (t≥5 years*70-gene signaturehigh risk) was added to the 70-gene signature extended primary model. This significantly improved the model (pLR test P=0.019), and this interaction was retained. We specifically choose the time-split at 5 years because of its clinical relevance and refrained from evaluating best fitting time-covariate interaction due to sparse data. The above regression analyses were performed in the entire cohort. There was no indication that the effect of the 70-gene signature on LRR risk differed between subgroups according to primary locoregional treatment or in patients aged ≥50 versus <50 (multiplicative interaction terms were not statistically significant). These interaction terms were therefore not included in the final models.

Besides assessing the independent prognostic value of the 70-gene signature for LRR (i.e. the evaluation of the adjusted hazard ratio (HR)), the combined prognostic performance of the multivariable models was evaluated with regard to discrimination (Harrell’s C-index adapted to competing risk analyses (18)) and calibration. A C-index of 1 indicates perfect discrimination (i.e. all patients with LRR have higher predicted recurrence risk than those without), whereas 0.5 means as poor discrimination as predictions based on just the average LRR risk. Calibration was assessed by plotting model predicted versus actual observed risk. Model fit improvement upon addition of the 70-gene signature was tested by the pLR test, discrimination improvement by the improvement in C-index, and reclassification by net reclassification improvement (NRI) measures adapted to time to event data with censoring (22). NRI measures show the net percentage of correct movement (in predicted probability or risk category) for those with events (up) and for those without events (down).

Discrimination, calibration and NRI measures were evaluated at 5 and 10 years following diagnosis, for the entire cohort, as well as in primary locoregional treatment subgroups (using predicted probabilities derived from the models fitted in the entire cohort). Analyses were performed using R version 3.0.1. All statistical tests were two-sided with a cut-off for statistical significance of 5%, without accounting for multiple testing. Estimates are reported together with 95% confidence intervals. For the C-index and NRI measures, 2000-fold bootstrapping was used for statistical testing and standard error estimation.