Supporting Information
Building models of difficult sets of mutagenicity data based on quantum mechanical description of reactivity
Patrick McCarrena, Clayton Springera, and Lewis Whiteheada,*
a Novartis Institutes for Biomedical Research, 100 Technology Square, Cambridge, MA 02139, USA.
Table S1. A selection of reported Ames test classification models.
Reference / Substructures / Model type / Training set accuracy / N / N_ames+ / Ntest / NtestAmes+ / Accuracy / AUC testbenigni 2007[1] / Aryl amines TA98 / linear regression / 0.89 / 111 / 86 / 0.69
benigni 2008[2] / Aryl amines TA100 / linear regression / 0.87 / 111 / 64 / 0.81
Hansen 2009[3] / All / svm / 5525-5528 / 3503 / 984-987 / 570-584 / 0.86
Ferrari and Gini 2010[4] / All / svm / 0.90 / 3367 / 1883 / 837 / 451 / 0.81
Matthews, Kruhlak,Cimino, Benz, Contrera[5] / All / MC4PC / 1403 / 633 / 0.81
Zhang and Aires-deSousa[6] / All / random forest / 0.84 / 4083 / 2308 / 472 / 305 / 0.85
Langham and Jain[7] / All / ensemble learning / 0.79 / 4337 / 2401 / 400 / 174 / 0.75 / 0.839
Helma, Cramer,Kramer, deRaedt[8] / All / PART / 0.94 / 684 / 341 / leave10%out / 0.76
Saiakhov[9] / All / MC4PC / 0.79 / 984-987 / 570-584 / 0.73
Figure S1. Substructure Counts for Sets C, D, and F full sets.
Table S2. Comparison to Kazius 2005 substructure counts to counts derived from present substructure queries.
Toxicophore / Kazius count / This workaromatic nitro / 644 / 649
aromatic amine / 508 / 526
aromatic nitroso / 32 / 29
alkyl nitrite / 6 / 6
nitrosamine / 80 / 81
epoxide / 196 / 196
aziridine / 33 / 33
azide / 14 / 14
diazo / 7 / 7
triazene / 20 / 20
aromatic azo / 88 / 95
unsubstituted heteroatom-bonded heteroatom / 128 / 129
aromatic hydroxylamine / 53 / 53
aliphatic halide / 416 / 416
carboxylic acid halide / 26 / 26
nitrogen or sulfur mustard / 67 / 67
bay-region in polycyclic aromatic hydrocarbons / 125 / 125
K-region in polycyclic aromatic hydrocarbons / 128 / 128
polycyclic aromatic system / 660 / 707
Figure S2. Summary of query structures to construct Figure 2, Figure S1, and Table S2.
Figure S3. Comparison of performance of all QM descriptors shown in Table 3 and the performance of two PLS models.
References
1. Benigni R, Bossa C, Netzeva T, Rodomonte A, Tsakovska I: Mechanistic QSAR of aromatic amines: New models for discriminating between homocyclic mutagens and nonmutagens, and validation of models for carcinogens. Environ Mol Mutagen 2007, 48:754-771.
2. Benigni R, Bossa C: Predictivity and Reliability of QSAR Models: The Case of Mutagens and Carcinogens. Toxicology Mechanisms and Methods 2008, 18:137-147.
3. Hansen K, Mika S, Schroeter T, Sutter A, ter Laak A, Steger-Hartmann T, Heinrich N, Müller K-R: Benchmark Data Set for in Silico Prediction of Ames Mutagenicity. Journal of Chemical Information and Modeling 2009, 49:2077-2081.
4. Ferrari T, Gini G: An open source multistep model to predict mutagenicity from statistical analysis and relevant structural alerts. Chem Cent J 2010, 4:S2.
5. Matthews EJ, Kruhlak NL, Cimino MC, Benz RD, Contrera JF: An analysis of genetic toxicity, reproductive and developmental toxicity, and carcinogenicity data: II. Identification of genotoxicants, reprotoxicants, and carcinogens using in silico methods. Regulatory Toxicology and Pharmacology 2006, 44:97-110.
6. Zhang Q-Y, Aires-de-Sousa J: Random Forest Prediction of Mutagenicity from Empirical Physicochemical Descriptors. Journal of Chemical Information and Modeling 2006, 47:1-8.
7. Langham JJ, Jain AN: Accurate and Interpretable Computational Modeling of Chemical Mutagenicity. Journal of Chemical Information and Modeling 2008, 48:1833-1839 %U http://dx.doi.org/1810.1021/ci800094a.
8. Helma C, Cramer T, Kramer S, De Raedt L: Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds. J Chem Inf Comput Sci 2004, 44:1402-1411 %U http://dx.doi.org/1410.1021/ci034254q.
9. Saiakhov RD, Klopman G: Benchmark Performance of MultiCASE Inc. Software in Ames Mutagenicity Set. Journal of Chemical Information and Modeling 2010, 50:1521-1521.