Supplementary Material s38

Supplementary Material

Integrating Multiple Evidence Sources to Predict Adverse Drug Reactions Based on Systems Pharmacology

Dong-Sheng Cao1, Nan Xiao2, Yuan-Jian Li1, Wen-Bin Zeng1, Yi-Zeng Liang3, Ai-Ping Lu4, Qing-Song Xu2, Alex F. Chen1

1School of Pharmaceutical Sciences, Central South University, Changsha, 410013, P. R. China

2School of Mathematics and Statistics, Central South University, Changsha 410083, P. R. China

3Research center of modernization of traditional Chinese medicines, Central South University, Changsha, 410083, P. R. China

4Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, P. R. China

Section 1: The detailed description for Materials and methods section

Materials and methods

Data sources

Drugs and their associated ADRs were obtained from SIDER (http://sideeffects.embl.de/, as of October 2009), an online database containing drug-ADR associations extracted from package inserts using text mining methods (1). This data set consists of 880 drugs, 1382 ADRs, and 61102 drug-ADR associations. The ADRs in the database were mapped to the Medical Dictionary for Regulatory Activities (MedDRA) taxonomy, version 16.0. For a very small number of ADR names (less than 1%), we were not able to find a mapping at the MedDRA preferred term (PT). We excluded those ADR names from our analysis. Moreover, drugs and ADRs vary greatly in their number of associations. Some ADRs are present in almost all drugs, while other are associated with very few drugs, and similarly for drugs. Thus, we filtered from the association data drugs and ADRs that lie at the top 5%, as well as ADRs and drugs having less than two associations. The resulting drug-ADR network contained 746 drugs, 817 ADRs and 24803 associations.

Generic names were used to uniquely represent drugs and perform data integration. Canonical simplified molecular input line entry specification (SMILES) of the drug molecules were obtained from DrugBank (2) and STITCH (3). All ATC codes of drugs were obtained from DrugBank and STITCH. Drug targets were extracted from DrugBank, Matador (4) and KEGG DRUG databases (5). Protein sequences and GO annotations were downloaded from UniProt (6). Human protein-protein interactions (PPIs) were obtained from the BioGRID database (7). Pathway and disease knowledge of drugs were extracted from the Comparative Toxicogenomics (CTD) (8) and OMIM databases (9).

Similarity measures

We used node attribute-based and network structure-based similarity measures. For node attribute-based similarities, we defined and computed 8 drug-drug similarity measures and 5 ADR-ADR similarity measures, respectively. For network structure-based similarity, we defined and computed 3 drug-drug and ADR-ADR similarity measures, respectively. For drugs associated with more than one term (e.g., target, ATC level, disease), we found that averaging all similarities between the associated terms is most predictive.

Node attribute-based similarity

Node attribute-based similarity can be defined by using the essential attributes of nodes, which greatly depends on the research object under consideration. Two nodes (i.e., Drugs or ADRs) are considered to be similar if they have several common attributes.

We constructed the following 8 drug-drug similarity measures:

(1) Chemical-based (ECFP): Canonical SMILES of drugs were downloaded from DrugBank. Extended connectivity fingerprints (ECFP) were calculated using RDKit with the diameter of the atom environments equals to 4 (i.e., ECFP4) (10). The similarity score between two drug molecules is calculated based on their fingerprints according to two-dimensional Tanimoto similarity measure, which is equivalent to the Jaccard similarity measure of their fingerprints, that is, the size of the interaction over the union.

(2) ATC-based (ATC): The World Health Organization (WHO) ATC classification system is employed to represent drugs. This hierarchical classification system categorizes drugs according to the organ or system on which they act, their therapeutic effects and their chemical characteristics. All ATC codes were extracted from DrugBank and STITCH. To define the similarity between ATC terms, we used semantic similarity algorithm proposed by Resnik (11). This algorithm assigns probabilities p(x) to all the nodes x (i.e., ATC levels) in the ATC hierarchy by calculating the number of levels below x (i.e., the number of drugs appearing in the given node). Thus, the similarity between two drugs is calculated as the maximum over all their common ancestors ATC level c of -log(p(c)).

(3) Sequence-based: (ProSeq) Following the method suggested by Bleakley and Yamanishi (12), we calculated the Smith-Waterman sequence alignment score between the corresponding drug targets, and then normalized this score by dividing the geometric mean of the scores obtained from aligning each sequence against itself.

(4) Closeness in a PPI network (PPI): The distances between each pair of targets were computed by an all-pairs shortest paths algorithm on the human PPI network. Following the suggestions from Perlman et al (13), the distances were transformed to similarity values using the following formula:

S(p1, p2) = Ae-D(P1, P2)

where S(p1, p2) is the computed similarity value between two proteins. D(p1, p2) is the shortest path between these proteins in the PPI network. A was chosen to be 0.9×e. Self-similarity was assigned to be 1.

(5) GO-based (ProGO): Gene ontologies of drug targets including biological process, cellular component and molecular function were used. Semantic similarity scores between drug targets were calculated according to Yu (14), using the GOSemSim package selecting the option to use all three ontologies.

(6) Pathway-based (Pathway): The pathways associated with modes of action of drugs were used for representing drug molecules. Pathway knowledge of drugs was extracted from Comparative Toxicogenomics (CTD) database. To define the similarity of pathway terms, we obtained a list of relevant pathway families for each drug. Finally, the similarity between a pair of drugs was computed as the Jaccard score between the corresponding sets of pathways.

(7) Disease-based (Disease): The diseases associated with drugs were used for representing drug molecules. Disease knowledge of drugs was extracted from Comparative Toxicogenomics (CTD) database, and the corresponding code for each disease was obtained from the OMIM database. To define the similarity between disease terms, we used the hierarchal structure of the HPO together with the mapping provided by HPO between ontology nodes and OMIM diseases to construct a semantic similarity score based on Resnik’s algorithm. Applying this semantic similarity on the HPO data was previously shown to provide consistent clustering of diseases (15).

(8) CMap-based (CMap): Gene expression responses to drugs were retrieved from the Connectivity Map project. We calculated drug similarity from Connectivity Map ranked gene expression profiles: calculating a Jaccard score between the 500 most differentially expressed genes (250 most up-regulated and 250 most down-regulated genes). The DvD package proposed by Pacini (16) is used for obtaining gene expression profiles of drugs in the study.

We constructed the following 5 ADR-ADR similarity measures:

(1-2) UMLS-based (UMLSLin and UMLSJCN): We used UMLS-Similarity package in perl to measure the semantic similarity of ADRs. Three sub-databases related to ADRs in the UMLS database were used, including Coding Symbols for Thesaurus of Adverse Reaction Terms (COSTART), 5th edition 1995; WHO Adverse Drug Reaction Terminology (WHOART), Uppsala (Sweden), 1997; Medical Dictionary for Regulatory Activities Terminology (MedDRA) Version 15.1, September, 2012. Two semantic similarity algorithms, proposed by Lin and Jiang Conrath, respectively, were used to calculate the similarity measures based on a parent/child hierarchy relationship.

(3) ADR coexist-based (Coexist): Maybe some ADRs coexist in the drug treatment. For instance, 90% of drugs that cause nausea also lead to vomiting. We therefore constructed an ADR coexist-based similarity to reflect the situation. For each ADR, it is represented as a vector for which each element is a proportion of the other ADRs simultaneously appearing in a drug with this ADR. The similarity is finally calculated as the cosine value between two vectors.

(4) MedDRA-based (MedDRA): We used Medical Dictionary for Regulatory Activities (MedDRA, version 16.0), developed by ICH. MedDRA is a clinically validated international medical terminology used for ADR reporting as a standard and throughout the entire regulatory process, from premarketing to post-marketing activities, for data entry, retrieval, evaluation, and presentation. To define the similarity between MedDRA terms, we used semantic similarity algorithm proposed by Resnik. This algorithm assigns probabilities p(x) to all the nodes x (i.e., MedDRA levels) in the MedDRA hierarchy by calculating the number of levels below x (i.e., the number of drugs appearing in the given node). Thus, the similarity between two drugs is calculated as the maximum over all their common ancestors MedDRA level c of -log(p(c)).

(5) ADR-related-protein-based (APro): We used ADR-causing proteins to characterize ADRs. Currently, Kuhn reported a large-scale analysis to systematically predict and characterize proteins that cause ADRs. To define the similarity of ADR terms, we obtained a list of relevant proteins that are likely to elicit ADRs. Finally, the similarity between a pair of ADRs was computed as the Jaccard score between the corresponding sets of proteins.

Network structure-based similarity

The similarity between drugs or ADRs could also be computed based on various topological characteristics of the underlying network graphs (i.e., drug-ADR interaction network). Several studies have reported that the features from network topology are informative for link prediction(17,18). For drugs and ADRs, we computed 3 popular network structure-based similarities, respectively.

(1) Network neighbor-based (DNN and ANN): We used the concept that two drugs are similar if they have common neighbors (i.e., ADRs) in the drug-ADR network. The similarity between a pair of drug or ADR nodes is determined as a function of the intersection of their adjacency lists, which takes into account all two-edge paths connecting these nodes. Specifically, the similarity between ci and cj with respect to drug-ADR network graph is computed as a Jaccard similarity score:

where and are the adjacency lists of ci and cj in graph G, respectively. Apparently, the similarity between a drug and an ADR is 0 because they do not have the common neighbors. We could construct a drug-drug similarity and an ADR-ADR similarity in the drug-ADR network graph, respectively.

(2) SimRank-based (DSimRank and ASimRank): We extended the neighbor notion to calculate a more general structural context similarity by the SimRank algorithm, proposed by Jeh et al (19). For the bipartite graph composed by drugs and ADRs, we could definite the drug-drug and ADR-ADR similarity as follows:

where N(a) and N(b) denote the neighbors of node a and b (drug or ADR), respectively. C is the constant between 0 and 1. Following the suggestion from Jeh, it is manually chosen to be 0.8.

(3) Path-based (DKatz and AKatz): The path distance in the drug-ADR network can influence the formation of a link (i.e., drug-ADR interaction) between them. The shorter the distance, the higher the chance that it could happen. We used Katz index to reflect path-based information (20). Katz index directly sums over all paths that exist between the drug-ADR pair.

where is the set of all paths of length from x to y. is a regularization parameter controlling the path weights. A very small value yields a measurement close to network neighbor. To reflect the long path information, it was set in our study to be 0.0001. Katz generally works much better than the shortest path since it is based on the ensemble of all paths between nodes x and y.

(4) Preferential attachment score (PAS): The preferential attachment concept is akin to the well-known "rich get richer" model. In short, the probability that a new link is connected to node x is proportional to the degree of node x. The probability that this new link will connect node x and node y is proportional to kx × ky (kx and ky are the node degrees of node x and y, respectively). We therefore used the multiplication of the node degree of drug and ADR as the preferential attachment score (21). This score could be directly used as the classification feature.

Assessing correlation from multiple evidence sources

In order to test whether the similarity measures from the evidence are similar to each other, we aligned these similarity measures to calculate their correlation. The alignment of two similarity matrices S1 and S2 can be calculated as follows:

where is the Frobenius inner product between S1 and S2, which is mathematically equal to the trace of S1tS2 (22). Based on this alignment score, we compared the similarity between any two similarity matrices from multiple evidence sources to assess information overlap.

Generating classification features

The classification features that we used were constructed from drug-drug and ADR-ADR similarity measures, resulting in 12 features when using node attribute-based similarities and 7 features when using network structure-based similarities. Herein, we extended neighborhood-based collaborative filtering recommendation methods to generate drug/ADR-based recommendation scores as classification features. The neighborhood-based collaborative filtering methods are the widely used recommender systems in various electronic commerce applications. Generally, it can be divided into user-based, item-based, content-based and knowledge-based collaborative filtering. These different methods applied different information from user, item and their associations to construct recommender models. In our methods, we attempt to integrate these methods to construct the mixed algorithmic framework by collecting different levels of information from drugs, ADRs and network. In our methods, we made three aspects of recommendations (i.e., drug, ADR and network) to determine whether this query drug-ADR pair interacts or not. These three aspects of recommendations are similar to the user-based, item-based, content-based and knowledge-based recommendations to some extent. For a given similarity measure, the score of a given drug-ADR association (di-aj) is calculated by considering the similarity, according to the given pair, of k most similar to known drugs or ADRs to those in this association. Our algorithm follows the following basic idea: If a drug interacts with an ADR, other drugs similar to the drug will be recommended to the ADR, and vice versa. For a drug-ADR pair di-aj, a linkage between di and aj is determined by the following two predicted scores: