Table S2. Number of pairs with similarity values of one, zero and greater than 0.9

Representation / Sim =1
(Identical) / Sim > 0.9 / Sim = 0
(Completely different)
Number of pairs / % / Number of pairs / % / Number of pairs / %
Property similarity / 1 / 0.02 / 117 / 2.36 / 1 / 0.02
MACCS keys / 0 / 0 / 8 / 0.16 / 4 / 0.08
GpiDAPH3 / 2 / 0.04 / 3 / 0.06 / 2500 / 50.5
TGT / 2 / 0.04 / 21 / 0.42 / 387 / 7.8

According to Property similarity, only one molecular pair (methanol-cyclosporine)had similarity value of zero. This is not surprising since their properties differ considerably. In contrast, the only pair with similarity value of one was estradiol-testosterone, which has very similar values of the four physicochemical properties (see Table S1). We did not identify pairs of compounds with identical MACCS keys representation. However, there were eight compounds with similarity values higher than 0.9 being the pair dexamethasone-β-D-glucoside_dexamethasone-β-D-glucuronide the one with the maximum MACCS keys similarity value. On the other hand, only four pairs have MACCS keys/Tanimoto similarity of zero: clonidine-methanol, methanol-phencyclidine, methanol-PNU200603 and methanol-guanabez. These results suggest that MACCS keys could focus only on the discriminant features but neglect other relevant chemistry. However, this is in agreement with the design of MACCS keys that were developed for diversity purposes.

Similarity values computed with GpiDAPH3 showed the lowest median value. This can be visualized in the CDF plot (Figure 2) since most of the data points are shifted to low similarity values and a large number of pairs have similarity values of zero. Despite of this fact, several pairs are shifted to high similarity values. Between them, two pairs have GpiDAPH3 similarity values of one: aminopyrine-antipyrine and meloxicam-piroxicam.

TGT is a topological three-point pharmacophore fingerprint that monitors all triplets of predefined pharmacophore features with a given bond distance in a molecule and consists of 1704 bits. As mentioned above, TGT is less discriminatory than GpiDAPH3, which is reflected on the more spread data point distribution. With a median value of 0.599, this representation identified 21 pairs with similarity values greater than 0.9 and two pairs of compounds with identical TGT representation (alprenolol ester-propanolol ester and chlorpromazine-imipramine). According to this measure, 387 pairs of compounds (7.82%) have similarity value of zero.