The expanded human disease network combining the protein-protein interaction information

Xuehong Zhang1,†, Ruijie Zhang1,†,*, Yongshuai Jiang1,†, Peng Sun1,†, Guoping Tang1, Xing Wang1, Hongchao Lv1, Xia Li1, 2,**

Supplementary Information

1. Disease-gene association and protein-protein interaction data

The Genetic Association Database (GAD)1 collected, standardized and archived almost all of the genetic association study data stored in many published literatures. Each entry of the GAD is composed of fifteen fields, disease phenotype (broad phenotype, narrow phenotype and molecular phenotype), MeSH disease terms, disease expert, disease class, official gene symbols, association tag ("Y" or "N"), chromosomal location (chromosome and chromosome band), genomic DNA position, p value, reference and its corresponding PubMed id and OMIM id. We downloaded the disorder-gene association file as of June 6, 2009. Out of 39,930 GAD entries, we selected 11,571 entries with "Y" tag, for which there is positive association with the disorder. Then, we removed the records whose gene symbol is not mapped onto an Entrez ID. Finally, we obtained 6,350 entries by merging multiple records which reported the same disease gene associated with same disease published in different literatures into one disease-gene association. We then parsed these 6,350 disorder terms into 1,336 distinct disorders by merging disease subtypes of a single disease, based on their given disorder names as Goh et al in 20062. Each disease was assigned a unique disease ID.

To obtain comprehensive and highly reliable human protein-protein interaction (PPI) data, we selected the Human Protein Reference Database (HPRD)3, in which PPIs are obtained from literature by manual curation. HPRD_Release_7_09012007 file was downloaded on July 14, 2009, which contains 37,107 interactions. After eliminating the self-interacting, and the interactions whose corresponding Entrez ID are tagged as "None", 35,000 interactions between 9,303 genes used in our study are available.

2. Construction of the eHDN

We classified each disorder into 19 disorder classes, following the classification scheme shown in Fig. 1. The classification is based on the "disease class" tag provided in GAD. Each gene symbol was mapped onto an Entrez ID, generating the list of disease-gene associations, which is defined as GAD diseasome. Starting from the GAD diseasome bipartite graph, we generated the original human disease network (oHDN) by projecting onto disease spaces2. In the oHDN, nodes represent disorders, and two disorders are connected to each other if they shared at least one gene.

We constructed the expanded diseasome map by combining the disease-gene information in GAD with the protein-protein interaction information in HPRD. First, we obtained 559 genes (227 genes belong to the GAD diseasome map) whose corresponding protein products interact with at least two proteins encoded by genes associated with a disorder. Second, the remaining 332 genes were added to GAD diseasome, generating the expanded list of disease-gene associations which is defined GAD-HPRD2 diseasome available as Supplementary Table S1. Similar to the oHDN, we also got the eHDN projection (Fig. 1).

3. Gene expression microarray data

The Gene Expression Omnibus (GEO) repository4 at the National Center for Biotechnology Information (NCBI) archived and freely disseminated microarray and other forms of high-throughput data generated by the scientific community. To calculate the coexpression correlation between wild-type human gene transcripts, we used microarray data available for normal human tissues. GSE7307 and GSE3526 were used in follow-up analysis.

GSE7307: Normal and diseased human tissues were profiled for gene expression using the Affymetrix U133 plus 2.0 arrays. It is composed of 677 samples, representing over 90 distinct tissue types. We selected 504 samples involved in 83 healthy tissues to carry out subsequent analysis. First of all, we used ArrayTools (3.7.1 versions)5 to match Entrez gene ID, and 41,558 microarray probes (76% of total probes) were assigned to 20,080 Entrez gene ID. In the GAD and GAD-HPRD2 diseasome, 1,594 (97% of 1,639 human disease genes) and 1,925 (98% of 1,971 genes) genes have corresponding microarray probes, respectively. Secondly, we followed the way in the previous works6 to determine the tissue-selective genes using the Significance Analysis of Microarrays (SAM) algorithm7.

GSE3526: Normal human tissue samples from ten post-mortem donors were processed to generate total RNA, which was subsequently analyzed for gene expression using Affymetrix U133 plus 2.0 arrays. Donor information: Donor 1-25 year old male; donor 2 - 38 year old male; donor 3-39 year old female; donor 4-30 year old male; donor 5-35 year old male; donor 6-52 year old male; donor 7-50 year old female; donor 8-48 year old female; donor 9-53 year old female; donor 10-23 year old female. It was composed of 353 samples, representing 65 distinct tissue types. We carried out subsequent analysis using 63 healthy tissues which contain at least two samples. Similarly, we used ArrayTools to match Entrez gene ID and determined the tissue-selective genes using SAM algorithm, and 34,000 microarray probes (76% of total probes) were assigned to 17,906 Entrez gene ID. In the GAD and GAD-HPRD2 diseasome, 1,471 (89% of 1,639 human disease genes) and 1,781 (90% of 1,971 genes) genes have corresponding microarray probes, respectively.

4. Computation of dyadicity and heterophilicity

Similar to Jiang, X., et al.8, we defined the value of a phenotype/disease based on whether it belongs (1) or does not belong (0) to a disease class. Thus three types of links between phenotype/disease exist: 1-1, 1-0, and 0-0; the number of these links are termed, and , respectively.

The two parameters dyadicity and heterophilicity are defined as:

and ,

where , represent the expected values of , respectively. These parameters can successfully characterize the modular structure of protein-protein interaction networks9.

The expected value of and is computed next. If we take cancer as an example, we can call the number of phenotypes/diseases belonging to cancer and the number of other phenotypes/diseases. is the total number of phenotypes/diseases and is the total number of edges in the network. Let represent the competence that indicates the average probability that two phenotypes/diseases are connected in the network. The value of a phenotype/disease depends on whether it belongs to a cancer class (1), or does not (0). The three varieties of link styles between phenotypes are 1-1, 1-0, and 0-0, and the number of these links can be labeled as , and respectively.

If any phenotype/disease in the network has an equal chance of being cancer, the expected values of and are and respectively 9.

,

.

Statistically significant deviations of and from their expected values of and imply that cancer are not distributed randomly in the HDN network.

Dyadicity () indicates that phenotypes/diseases in the disease class tend to connect more (less) densely among themselves than expected for a random configuration. Similarly, heterophilicity () means that phenotypes/diseases in the disease class have more (fewer) connections to phenotypes in other classes than expected randomly.

5. Gene ontology homogeneity analysis

The HDN based on the Online Mendelian Inheritance in Man (OMIM)10 has been reported previously2 that it showed modular organization then a group of genes associated with the same common disorder share similar cellular and functional characteristics, as annotated in Gene Ontology11. Here, we also investigated the GO homogeneity (GH) for the original and expanded HDN. The GH of a disorder i is defined as the maximum fraction of genes in the same disorder that have the same GO terms,

GHi = maxj [nji/ni],

where ni denotes the number of genes in the disorder i that have any GO annotations, and nji the number of genes that have the specific GO term j. We calculated GHi not only considering the GO annotations as a whole (GW), but also calculated GHi separately for each branch of GO, biological process (BP), molecular function (MF), and cellular component (CC).

To obtain the random control of the GO homogeneity distribution for each disorder, we picked the same number of genes randomly in the GO annotation data and calculated their GO homogeneity. We generated 104 random instances to reach statistical significance.

6. KEGG (Kyoto Encyclopedia of Genes and Genomes) homogeneity analysis

Functionally related genes generally exhibited genes which belong to the same cellular pathways8,12, such as annotated in Kyoto Encyclopedia of Genes and Genomes (KEGG)13. To investigate whether a group of genes associated with the same disorder have a tendency to share the same cellular pathways, we measured the KEGG homogeneity (KH) of each disorder here the maximum fraction of genes in the same disorder that have the same KEGG terms. Similar to GH, it is defined as

KHi = maxj [nji/ni],

where in this case ni denotes the number of genes in the disorder i that have any KEGG annotations, and nji the number of genes that have the specific KEGG term j. To reduce the bias, we removed the corresponding disease pathways from KEGG when the KEGG homogeneity is calculated for a certain disease.

To obtain the random control of the KEGG homogeneity distribution for each disorder, we picked the same number of genes randomly in the KEGG annotation data and calculated their KEGG homogeneity. We generated 104 random instances to reach statistical significance.

7. Subcellular location analysis

The function of a protein and its role in a cell are closely correlated with its subcellular locations or environments14. For example, the target proteins and non-target proteins are different in subcellular locations15. Here, we further analyzed whether the corresponding protein products of the genes in the common disease module have a tendency to gather in the same subcellular location based on another annotation. Swiss-Prot16 is a manually annotated protein sequence and knowledge database that is valued for its high quality annotation, the usage of standardized nomenclature, integration with other databases and minimal redundancy. We extracted the subcellular location information from the Swiss-Prot database on September 10, 2009. However, only 45% of the proteins have subcellular location annotations derived from the experimental observation for the all human proteins stored in the Swiss-Prot database. We used the Hum-mPLoc Classifier14 to predict their subcellular location for those proteins which have not annotations or annotated with uncertain labels such as probable’’, ‘‘potential’’, ‘‘perhaps’’, and ‘‘by similarity’’.

Then, we measured the subcellular location homogeneity (SLH) of each disorder here the maximum fraction of genes in the same disorder that have the same subcellular location. Similarly, it is defined as

SLHi = maxj [nji/ni],

where in this case ni denotes the number of genes in the disorder i that have any subcellular location annotations, and nji the number of genes that have the specific subcellular location j.

To obtain the random control of the subcellular location homogeneity distribution for each disorder, we picked the same number of genes randomly in the subcellular location annotation data and calculated their subcellular location homogeneity. We generated 104 random instances to reach statistical significance.

8. Tissue homogeneity

The tissue homogeneity (TH) coefficient proposed by Goh, et al.2 quantified whether genes that are implicated in the same disorders tend to be expressed in similar human tissues. The TH of a disorder i, is defined as

THi = maxj [nji/ni],

where ni denotes the number of genes in the disorder i that are expressed in at least one tissue, nji the number of genes that are expressed in the tissue j among them, and maxj [×] denotes the function returning the maximum-value argument across j. TH has the maximal value 1 if all of the genes are expressed together in at least one tissue, and takes the minimum value 1/n when all are expressed in different tissues. To obtain the random control of the tissue homogeneity distribution, we picked the same number of genes randomly in the microarray data for each disorder and calculated their tissue homogeneity. We generated 104 random instances to reach statistical significance.

9. Random controls for gene expression analysis

For the eHDN, to obtain the random control of the Pearson correlation coefficient (PCC) distributions for the gene expression in Fig. 2F and G, we calculated the distribution of all gene pairs in the microarray data (Fig. 2F) and the average PCC between the same number of genes chosen randomly from the microarray data (Fig. 2G). Similarly, we also obtained the random controls of the cosin correlation distance and the corresponding average for the eHDN. We performed 104 independent runs to obtain significant statistics.

10. Comparative analysis

To illustrate the credibility of our eHDN, we carried out a comparative analysis of eHDN and oHDN. We found that the eHDN is highly consistent with the oHDN over all topological and functional characteristics. To further demonstrate the reliability of eHDN, we eliminated the real disease genes from the GAD-HPRD2 diseasome and obtained the HPRD2 diseasome and its HDN projection. We then analyzed the topological and functional properties of the HDN projection and found it to be consistent with the oHDN. Firstly, the distribution of k and C(k) are also both significantly different from the random controls (p value <5.7e-6, p value <2.2e-16). For the GO and KEGG homogeneity, we also found a significant elevation with respect to random expectations (p value <2.0e-11, p value <1.0e-5). In the HDN projection of HPRD2 diseasome, subcellular location homogeneity and tissue homogeneity all significantly differ from the random controls (p value <1.0e-5, p value <2.7e-5). In addition, the coexpression and synchronized expression features are significantly different from the random controls (p value <1.0e-5). This result, to some extent, conforms that the properties of HDN projection obtained by eliminating the true disease genes from GAD-HPRD2 diseasome are similar to oHDN. Therefore, we feel confident that our eHDN constructed by combining disease gene information with PPI information is reliable.

SUPPLEMENTAL REFERENCES