Evaluating the risk of ovarian cancer before surgery using the ADNEX model: a prospective multicenter external validation study
Sayasneh A*, 1,2, Ferrara L*,2,3, De Cock B4, Saso S1,2, Al-Memar M2,Johnson S5, Kaijser K4, Carvalho J2, Husicka R2,Smith A6, Stalder C2, Ettore G3, Van Calster B4, Timmerman D4, Bourne T1,2,4
*: The authors consider that the first two authors should be regarded as joint First Authors.
1: Department of Surgery and Cancer, Hammersmith Campus, Imperial College London, Du Cane Road, London W12 0HS, UK.
2: Early Pregnancy and Acute Gynecology Unit, Queen Charlottes and Chelsea Hospital, Imperial College London, Du Cane Road, London W12 0HS, UK.
3: Department of Obstetrics and Gynecology - Garibaldi Nesima Hospital, Catania, Italy.
4: Department of Development and Regeneration, KU Leuven, Leuven, Belgium.
5: Southampton University Hospitals, Princess Anne Hospital, Southampton, UK, SO16 5YA.
6: Ultrasound Scan Department, Queen Charlottes and Chelsea Hospital, Imperial College London, Du Cane Road, London W12 0HS, UK
Corresponding author: Mr. Ahmad Sayasneh
Locum Consultant Gynecological Oncologist, Guys and St Thomas’ Hospital, and Honorary Senior Clinical Lecturer, Imperial College London.
Department of Surgery and Cancer
Hammersmith Campus
Imperial College London
Du Cane Road
London
W12 0HS
Email:
Tel: 00442083835131
Fax: 00442083835115
Running title: characterizing ovarian masses using ADNEX multiclass model
Key words: Diagnostic imaging, ovarian neoplasm, statistical models, ultrasonography
Abstract (250 words max limit)
PURPOSE: To externally validate the International Ovarian Tumor Analysis (IOTA) ADNEX model (The Assessment of Different NEoplasias in the adnexa model) for the multiclass characterization of ovarian masses. The secondary aim was to assess the performance of the ADNEX model by level II ultrasound examiners with varied training and experience.
EXPERIMENTAL DESIGN: This was a cross-sectional multicenter cohort study for diagnostic accuracy. Patients were recruited from three cancer centers (two in the UK and one in Italy). Patients with an ovarian mass underwent transvaginalultrasonography. Only patients who had a histological diagnosis of surgically removed tissue were included. The diagnostic performance of the ADNEX model with and without CA125 was calculated.
RESULTS: 610 women were includedin the final analysis. The prevalence of malignancy was 30 % (182) with 7% borderline tumors, 8% stage I primary ovarian cancers, 11% stage II-IV primary ovarian cancers and 4% secondary metastatic cancers. The area under the curve AUC for the diagnostic performance for the ADNEX model to differentiate between benign and malignant masses was 0.937 (95% CI: 0.915-0.954) when CA125 was included,and 0.925 (95% CI: 0.902-0943) when CA125 was excluded. The ADNEX model showed good discrimination between the different subtypes (benign, borderline, stage I primary cancer, stages II-IV primary cancers and metastatic secondary cancers).
CONCLUSION: The performance of the ADNEX model retains its performance on external validation. Furthermore the model performs well in the hands of ultrasound examiners with varied training and experience.
Introduction
According to the latest statistics from the National Cancer Institute in USA, there were 12.1 per 100,000 women new cases of ovarian cancers per year between 2008 and 2012, with a mortality of 7.7 per 100,000 women (1). The overall five year survival is estimated to be around 45.6 % for all stages of the disease (1). However, for early localized ovarian cancers the five year survival exceeds 90% (1). A combination of early diagnosis and centralized management are thought to be key factors to optimize survival (1-3). For early diagnosis, trials to evaluate ovarian cancer screening have not been successful (4, 5). Recently,the United Kingdom Collaborative Trial of Ovarian Cancer Screeningshowed that screening using the risk of ovarian cancer algorithm (ROCA), doubled the number of detected primary invasive epithelial ovarian or tubal cancers (iEOCs) compared with a fixed cutoff of CA 125 (6). However, until the follow up of these patients is complete, the impact of screening on ovarian cancer mortality will not be known (6).
A further important aspect of clinical management is that an accurate diagnosis is made when a woman presents with an ovarian mass. The International Ovarian Tumor Analysis group (IOTA) have developed and validated models and rules to characterize ovarian masses as benign or malignant (7-11). These models and rules have also been validatedin the hands of less experienced (level II) ultrasound examiners (12)[TB1].
The IOTA group has developed the multiclass ADNEX model which can differentiate between benign tumors, borderline tumors, early stage primary cancers, late stage primary cancers (stage II to IV) and metastatic cancers (validation area under the receiver operating characteristic curves (AUCs) between 0.85 and 0.99). This model should facilitate the management of ovarian masses more efficiently as it allows patients to be triaged to the correct management pathway, whether for conservative follow up, surgery at a general gynecology unit, or management at high volume specialized cancer centers. Correctly classifying the subtype of malignancy if also of critical importance as borderline ovarian tumors and early stage ovarian cancers can be treated less aggressively, leading to the possibility of fertility preservation in younger women (13, 14). On the other hand metastatic ovarian cancers should be managed according to the origin of the primary cancer (14).
ADNEX is based on three clinical and six ultrasound parameters (15). The model was developed and temporally validated using parameters collected by experienced (or level III) ultrasound examiners (15, 16). The primary aim of this project was to externally validate the ADNEX model. The secondary aim was to assess the performance of the model by level II examiners with varied training (MDs and sonographers) (15, 16).
Methods
Settings and design
This was a cross-sectional multicenter cohort study for diagnostic accuracy. Data was collected prospectively, including the ultrasound variables required for the ADNEX model, from transvaginal ultrasound examinations performed by level II ultrasound examiners (ref for level II).Resultsusing the ADNEX modelwere calculated by a single investigator AS using a dedicated excel spreadsheet.The final histological outcome was then added to the same spreadsheet at a later date when results became available. Accordingly the ultrasound examiners and investigator calculating the result of the ADNEX model were blind to the results of the reference test. Patients were recruited from three cancer centers (Queen Charlotte’s Chelsea Hospital (QCCH), London, UK; Princess Ann Hospital (PAH), Southampton, UK; Garibaldi Nesima Hospital (GNH), Catania, Italy). The study was approved as a service evaluation audit at the UK centers and as a validation study by the hospital authority at the Italian center. The guidelines of the STARD (Standards for Reporting of Diagnostic Accuracy) initiative were used (17). Patients were recruited consecutively from September 2010 to November 2014 at QCCH, May 2012 to May 2014 at PAH and September 2012 to February 2015 at GNH. All patients from GNH and 12 patients from QCCH were recruited into the IOTA 5 study ( Patients at QCCH and PAH were recruited also to the IOTA 4 study (12). All[TB2] ultrasound examiners received a half-day theoretical training session on IOTA terminology and the ultrasound variables included in IOTA models.Transvaginal ultrasonography was performed using the standardized approach previously published by the IOTA group (11, 18). Transabdominal ultrasonography was undertaken when a large mass could not be fully evaluated transvaginally(11).
Participants and data collection
The inclusion criteria were patients presenting with at least one adnexal mass who underwent transvaginal ultrasonography at one of the participating centers. For bilateral adnexal masses, the mass with the most complex ultrasound features was included (11, 18). If both masses had similar ultrasound morphology, the largest mass or the one most easily accessible by ultrasonography was included (11).
The exclusion criteria were (i) pregnancy, (ii) patients examined by a consultant, (iii) refusal of
transvaginal ultrasonography, (iv) cytology rather than histology as an outcome, and (v) failure to
undergo surgery within 120 days of the ultrasound examination
The NHS Caldecott report guidelines were followed in all steps of data handling (19). At QCCH and GNH, a secure electronic data-collection system was used (Astraia Software, Munich, Germany). A unique identifier was generated automatically for each patient’s record. Dedicated data collection forms and excel sheets were used at PAH. Serum CA125 was measured as per clinician’s discretion or clinical practice in each center, using Abbott Architect CA125 II (Abbott Park, IL, USA) immunoassay kit at QCCH and GNH, and UniCelDxI Immunoassay System (Beckman Coulter Inc., Brea, CA, USA) Assay at PAH.
The ADNEX model
The Assessment of Different NEoplasias in the adneXa (ADNEX) model contains three clinical and six ultrasound predictors: age (in years), serum CA-125 level(U/mL), type of center (oncology centersvs.other hospitals), maximum diameter of lesion (in millimeters), proportion of solid tissue, more than 10 cyst locules (yes or no), number of papillary projections (0,1, 2, 3 or more than 3) acoustic shadows (yes or no), and ascites (yes or no) (15). The ADNEX model is available online and in mobile applications ( (15). The ADNEX model can still be calculated without including the serum CA125 value. In this study we calculated the performance of the model with and without CA125.
Reference tests
The reference standard was the histopathological diagnosis of the mass after surgical removal.The excised tissues underwent histological examination at the local center. Tumors were classified according to the WHO (World Health Organization) classification of tumors and malignant tumors were staged according to the FIGO International Federation of Gynecology and Obstetrics) criteria(20, 21). Histological classification was performed without knowledge of the ADNEX results. The final diagnosis was categorized into five types: benign, borderline, stage I invasive, stage II-IV invasive, and secondary metastatic cancer.
Statistical Analysis
There were missing values for serum CA-125 and whether there were more than 10 cyst locules (loc10). Missing values were handled differently for serum CA-125 and loc10. The number of missing values for the latter variable was small (2%), so these were dealt with using single stochastic imputation based on logistic regression. Missing Loc10 values were predicted by a logistic regression model with Firth correction with the following predictors: age, maximum diameter of the lesion, proportion of solid tissue, number of papillations, presence of acoustic shadows, ascites, type of ovarian tumor and type of operator.
The missing serum CA-125 values were handled with multiple stochastic imputation using predictive mean matching regression. Since the distribution of serum CA-125 was heavily skewed, the log-log transformation of CA-125 was used (i.e. log(log(CA-125))). In this imputation model, age, maximum diameter of the lesion, proportion solid tissue, loc10, number of papillations, presence of acoustic shadows, ascites, type of ovarian tumor, hospital and operator type were used as predictors. Using this approach, the missing values were replaced by 100 plausible values, leading to 100 completed data sets. Imputed values were back transformed to the original scale. For the ADNEX model with CA-125, each of the 100 completed datasets were analyzed separately and their results combined using Rubin’s Rules (22). Supplementary table 1 illustrates the numbers of missing values for each of the study centers.
External validation of the ADNEX model with and without CA-125 was performed by evaluating the model’s performance for discrimination and by evaluating its calibration performance. The area under the receiver operating characteristic curve (AUC) was calculated for the basic discrimination between benign and malignant tumors, as well as for each pair of tumor types using the conditional risk method (23).In addition, the polytomous discrimination index was calculated (24), which estimates the average proportion of correctly classified patients by the model when presented with five patients, one with each tumor type. Sensitivity and specificity were calculated using a 1%, 5 %, 10%, 15%, 20% and 30 % cutoff denoting the total risk of malignancy (i.e. the sum of the estimated risks of the four malignant subtypes). Calibration of the predicted probabilities was assessed through use of calibration plots. These plots show the relation between the observed and predicted probabilities for malignant tumors.
Results
During the study period 751 women underwent ultrasonography for a pelvic mass and went through the surgical management pathway. 141 women were excluded from the final analysis for the following reasons: 65 women were examined by a consultant, 26 women had no histology result (14 only cytology, 12 no cytology or histology), 24 women had surgery >120 days from the characterizing ultrasound scan, 15 women were pregnant, 5 women only had a transabdominal scan, 5 women had no surgery performed (declined or were not medically fit), finally 1 woman who had a recurrence of cervical cancer in the pelvis a few years after radical hysterectomy and underwent a bilateral salpingoopherectomy was excluded as the tumor was not considered adnexal. In the final analysis 610 women were included (figure 1). Supplementary table 2illustrates the detailed numbers of excluded and included cases for each center. The prevalence of malignancy was 30 %(182) with 7% (42)borderline tumors, 8% (47) stageI primary ovarian cancers, 11% (69) stage II-IV primary ovarian cancers and 4% (24) secondary metastatic cancers. Supplementary table 3 illustrates the prevalence of all tumor subtypes for each center.The median age was 47 (IQR: 34-61)with 352 (58%)premenopausal and 258 (42%) postmenopausal women. Table 1 illustrates the distribution of the ADNEX descriptive parameters among the tumor subtypes for all patients. Supplementary tables 4, 5 and 6 illustrate the distribution of the ADNEX descriptive parameters among the tumor subtypes in each study center.
The calibration plots for the ADNEX model with and without CA-125 are presented in figures 2 3.The results indicate that there is a near perfect agreement between observed and predicted probabilitiesfor the model with CA-125. Hence, the predicted probabilities of a malignant tumor almost perfectly correspond to the observed probabilities. In comparison, the model without CA-125 is less well calibrated. As can be seen in figure 2, small risks are underestimated and high risks are overestimated. In addition, when we look at the predicted risks in general, though relatively small, both models show an overestimation of the risk of malignancy.
A high AUC for the diagnostic performance of the ADNEX model to differentiate between benign and malignant masses was observed whether CA125 was included (0.937, 95% CI: 0.915-0.954) or excluded from the model calculation (0.925, 95% CI: 0.902-0943)(see figure 4 and table 2). The model with CA-125 showed slightly better performance (a difference of 0.012 (95% CI: 0.011-0.013)).Subgroup analysis showing the AUC results for each center, pre vs. postmenopausal women and for the performance of doctors compared to sonographers are shown in table 2.
Table 3 presents the specificity and sensitivity of both models when different cut-offs were used. When a cutoff of 1% was used, the models both with and without CA125, correctly classified all patients with malignant tumors, although at the cost of the sensitivity being extremely low. When higher cutoff valueswere used, sensitivity becomes slightly lower and there is a sharp increase in specificity. When a cutoff of 30% was used, when CA125 was included, a relatively high sensitivitywas still achieved (86.3 %) with a specificity of 83.9% (table 3).
When tumors were classified into benign, borderline, stage I invasive, stage II-IV, invasive, and secondary metastatic, the model showed good discrimination between the different subtypes although this varied depending on how they were paired (table 4). For example, discrimination between benign and Stage II-IV tumors was near perfect for the model with CA-125. In comparison, the model had more difficulties discriminating between borderline and stage I tumors though its performance is still good. The polytomous discrimination index (PDI) showed that the model, when presented with five patients, one with each tumor type, correctly identified a fair proportion of patients. The performance of the model with CA-125 was approximately 3 times better than random performance (PDI=0.2=1/k with k being the number of categories) (table 4).
Discussion
In this study, we have shown that in the hands of level II ultrasound examiners, the ADNEX model was able to discriminate between benign and malignant masses with a very similar level of performance to that achieved by experienced ultrasound examiners in the original ADNEX temporal validation study published by the IOTA group (15). In our external validation study using a 10% cut-off to define malignancy, the ADNEX model achieved a sensitivity of 97.3% and a specificity of 67.7% compared to 96.5% and 71.3% in the original study (15).We also found in the current study, that the ADNEX model discriminated well between benign tumors and each of four subtypes of malignancy, and test performance was very similar to the original publication (15) (table 4).
To the best of our knowledge, this is the first external validation study of the IOTA ADNEX model. Furthermore the validation was carried out by level II ultrasound examiners, whereas in the previous IOTA development and temporal validation study (15), the ultrasound scan parameters were collected by experienced level III examiners. A strength of our study is that it ismulticenter, and as it includes level II examiners with varied training and experience (sonographers and medical doctors), we think the performance of the ADNEX model in this study is likely to be generalizable. Another strength of our study is the robust selection of the reference test, as only cases with a histological outcome were included. A potential weakness in the study is that all three participating hospitals arereferral centers of gynecological malignancies, resulting in there being a relatively high prevalence of disease in the study population. However in the original ADNEX study the prevalence of malignancy ranged from 0 to 66% in the twenty-four participating centers. Whilst the use of the histological examination of surgically removed tissue in all cases as a reference test may be seen as a strength of the study, it may also be seen as a weakness in relation to the potential performance of the ADNEX model for masses that are selected for conservative management as these were not included in the study. The use of different assay kits for serum CA-125 measurements is a further study limitation, however the inconsistency in CA 125 levels resulting from this is thought to be minimal (26). Furthermore the variance in CA 125 assay kits used in the study is a reflection of clinical reality and again means results are more likely to be reproducible (15). Finally, having no centralized histopathology review in our study may have led to bias. For example, distinguishing borderline tumors from benign tumors or even stage I cancer may be challenging for pathologists, where disagreement can occur and this may give inaccurate diagnostic performance results for the ADNEX model in these cases (15). However, as all the histopathology departments involved in this study were tertiary referral centers for gynecological cancers, in the event of a discrepancy (including discrepancies in the referring units) a local review at the tertiary center would have been held to resolve the disagreement. Furthermore, centralized review of pathology was discontinued in IOTA studies as it was shown in initial studies that there were no significant differences betweenlocalandcentral reports (27).