Exercises on QSAR/QSPR
The goal is to present state of the art QSAR/QSPR methodology. Most commercial software solutions do not provide rigorous enough methods. This is the case in particular for MOE. In these exercises, CODESSA PRO or ISIDA are much safer to be used.
The first cases studied are the Alkanes. Two datasets are provided: ALKAN.SDF and ALKAN_15.SDF. The second one is a subset of the first one. The database contains several fields:
- boiling point (bp) in °C
- melting point (mp) in °C
- molar volume (MV) in cm3/mol at 20 °C
- molar refractivity (MR) in cm3/mol at 20 °C
- heat capacity (HC) in J.K/mol at 300 K
- critical temperature (Tc) in K
- critical pressure (Pc) in atm
- surface tension (ST) in dyn/cm at 20 °C
The exercises focus on modeling the boiling point, next on the melting point of the alkanes. Particular attention should be paid on:
- differences between the training set and the test set
- Statistical parameters to assess the model utility
- Cross-validation
- Ensemble modeling
The melting point is an “easy” property to model. The melting point is more difficult.
Next exercises will present model building of Thrombin inhibition value. The corresponding dataset is thr_pKiStd.sdf. The file contains the structure and the pKi inhibition value of each one, in the field pKi. The name of each compound is replaced by this value also, which can be disturbing. This is a real life case: the dataset is characteristic of QSAR/QSPR problems in size, diversity and difficulty to find “good” models. In fact, calculations are fairly lengthy so the modeling has been performed in advances and the results will be displayed and discussed.
The last QSAR example is the TUB.SDF dataset. It is a small dataset of active and inactive compounds against tuberculoses. It contains a field named activity which contains either 1 or 0 which stand for active or inactive, respectively. The exercise proposes to build a linear model to fit the binary activity of the compounds as an introduction to classification problems.
Note: linear regression on binary values, as in the exercise is an improper method for classification. In this particular case, it gives and impression of “success”. But the models possesses by construct flows: for instance, they do not learn any statistical feature of the active or the inactive, since they are all equivalent. The proper way to perform classification using linear regression methods is logistic regression.
The second part of the exercises is focused on Data Mining. In fact most QSAR/QSPR methods borrow methods to Data Mining. These methods range from type of mathematical models to validation procedures and success assessment. The exercises focus on Weka; a Java based software for Data Mining, from the WaikatoUniversity in New Zeeland. The software is free and open source. Since it is not a chemoinformatics application, it is not possible to work directly with chemical structures and it is not possible neither to analyze results in terms of chemical structures. Nonetheless, it is very useful to illustrate typical situations and to build models; but all the chemoinformatics analysis have to be performed apart: generation of molecular descriptors and chemical analysis.
The Data Mining exercises are using the following set of files:
- NBdist0.csv
- NBdist6.csv
- NBdist3.csv
- thrombinBig.arff
- thrombin.arff
- thrombin_all.sdf
- thrombin.hdr
The first three files are artificial 2D datasets. They are designed to illustrate the intuitive concept of classification, through clustering and supervised classification. They represent two sets of points with different centered distribution. The datasets NBdist6.csvillustrate a difficult case, when the distributions are overlapping largely overlapping.
The other files are dealing with the Thrombin dataset. Here, all know active compounds are labeled and a set of inactive compounds have been added. The SDF of the dataset is named thrombin_all.sdf. ISIDA fragment descriptors have been computed on these compounds. The nature of the each fragment is found in the file thrombin.hdr, while the other two files,thrombinBig.arff andthrombin.arffrepresent the set of molecular descriptor values in the native file format of the Weka software. The second file is formed by a selected subset of the descriptors of the first.
The first part of the exercises will consist in getting familiar with the Weka software while illustrating clustering and classification on the artificial datasets.
The second part will use the Naïve Bayes algorithm to build and validate powerful predictive models with of the anti-thrombin activity.