Supplementary Methods and Materials.
Developing the pan-cancer EMT signature
We adopted an approach similar to that used in Byers et al(1) to derive the pan-cancer EMT signature. We used four established EMT markers, namely CDH1 (epithelial marker, E type), CDH2 (mesenchymal marker, M type), VIM (M type) and FN1 (M type) as seeds to derive the pan-cancer EMT signature on the basis of TCGA pan-cancer RNAseq data. In particular, we computed correlations (Pearson’s correlation, r) between all mRNAs in the RNAseq data and each of the established EMT markers for each individual tumor type. The bimodality index had been used previously to select genes with switch-like changes in expression in the cell line data(2, 3). In this analysis, however, we found only a few genes to have bimodal expression in the tumor samples, thus we did not use the bimodality index. Since kidney renal clear cell carcinoma (KIRC) was primarily mesenchymal and rectum adenocarcinoma (READ) was primarily epithelial, we excluded these two tumor types in the signature development with two considerations: (1) correlation with EMT markers could be compromised in these tumors and thus affect signature development; and (2) these tumors could serve as internal controls. We included both tumor types in the signature evaluation and other analyses. We used the top 25 genes correlating with each of the four established markers that showed consistent correlations (correlations with the same sign) among all the tumors as the pan-cancer EMT signature. Since some mRNAs strongly correlated with multiple established markers, the final signature consisted of 77 unique genes. Twenty-five of these genes were epithelial (showing positive correlation with the CDH1 gene) and 52 were mesenchymal (showing positive correlation with either the CDH2, VIM or FN1 genes). Among these genes, 18 were identified by multiple seeds and 14 overlapped with the lung cancer EMT signature. To obtain a quantitative measure of the EMT potential, we calculated an EMT score for each sample on the basis of the derived pan-cancer EMT signature.
Letting the standardized expression for gene g in sample i be egi, the set of M markers in the signature be GM (a total of GM genes), and the set of E markers be GE (a total of GE genes), the EMT score for sample i (Si) can be calculated as
Si=1GMg∈GMegi-1GEg∈GEegi.
Since the EMT score is calculated as the mean expression of the M markers subtracted by the mean expression of the E markers, a higher EMT score means a more mesenchymal phenotype. Of note, we selected the top 25 genes that correlated with each established marker so that we could obtain a manageable number of signature genes as well as focus on only genes that showed strong correlation. A different cutoff might lead to a different set of signature genes. However, the EMT score remained robust to a reasonable cutoff choice since it only relied on the estimate of the mean expression, and the actual gene entity did not matter much.
Identification of EMT-associated genomic features
We computed the Spearman correlation between the EMT score and individual genomic features. The genomic features per gene included the mRNA expression, RPPA expression (either total protein or phosphorylated protein) and miRNA expression levels (5p and 3p mature strands). We used GISTIC 2 or -2 to define copy number alterations (CNAs). We used a t-test to identify EMT-associated mutations among genes with a mutation rate of at least 2% overall and a Q value < 0.1 in at least one of the cancer types by MutSig analysis. We corrected for multiple testing using the beta-uniform mixture model to identify significantly associated features(4).
To summarize the correlation across different tumors, we defined the correlation gap as follows: (1) If all correlations are positive, the correlation gap equals the minimum correlation across all tumors. (2) If all correlations are negative, the correlation gap equals the maximum correlation across all tumors. (3) If there are both positive and negative correlations, the correlation gap is 0. Of note, the correlation gap metric is a conservative summary for correlations across different tumors and hence ensures that the correlation selected is biologically meaningful.
To evaluate whether the miRNA targets were enriched with genes strongly correlated with EMT, we defined the rank percent for each mRNA as the rank of the average correlation across different tumors divided by the total number of genes. If the targets were not enriched with genes correlating with EMT, the distribution of the rank percent from all targets would be random and hence distributed as a uniform distribution. Thus, the Kolmogorov–Smirnov test can be used to test whether the distribution of the rank percent is the same as a uniform distribution.
Testing the correlation between miRNA targets and EMT score
To test whether miRNA targets with strong correlation to EMT are enriched, we first computed rank percent values for each mRNA calculated as the standardized rank of average correlation among all tumors (a value between 0 and 1 where 0 means a strong correlation). We then compared the rank percent values of miRNA targets to those from all mRNAs by using a one-sided Kolmogorov-Smirnov (KS) test. The KS test is a nonparametric test used to compare a distribution with a reference distribution (a uniform distribution in our case mimicking a random sampling of mRNAs from the whole population). The analysis is illustrated in the following plot. If the miRNA targets are not enriched, the rank percent values should be distributed uniformly (the blue line in the plot). On other hand, if the miRNA targets are indeed enriched, then the distribution will be significantly different from uniform distribution (the green line).
Pathway analysis
We performed a functional annotation of the pan-cancer EMT signature as well as pathway enrichment analysis using QIAGEN’s Ingenuity Pathway Analysis® (5). To evaluate the gene expression associated with EMT across all tumor types, we selected 254 genes on the basis of an absolute cutoff r 0.3 in all 11 datasets and subjected the genes to pathway analysis.
Association with drug sensitivity data
For each drug sensitivity database, we calculated the EMT scores for the cell lines and associated them with IC50 values using the Spearman rank correlation. In the GDSC data, there were multiple drugs targeting the same gene. To achieve greater statistical power, we merged the drugs that targeted the same gene and computed an overall Spearman rank correlation.
References
1. Byers LA, Diao L, Wang J, Saintigny P, Girard L, Peyton M, et al. An epithelial-mesenchymal transition gene signature predicts resistance to EGFR and PI3K inhibitors and identifies Axl as a therapeutic target for overcoming EGFR inhibitor resistance. Clin Cancer Res. 2013;19:279-90.
2. Wang J, Wen S, Symmans WF, Pusztai L, Coombes KR. The bimodality index: a criterion for discovering and ranking bimodal signatures from cancer gene expression profiling data. Cancer Inform. 2009;7:199-216.
3. Tong P, Chen Y, Su X, Coombes KR. SIBER: systematic identification of bimodally expressed genes using RNAseq data. Bioinformatics. 2013;29:605-13.
4. Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics. 2003;19:1236-42.
5. http://www.ingenuity.com/.