Supplemental Information
Computational modeling of the Plasmodium falciparum interactome reveals protein function on a genome-wide scale
Shailesh V. Date and Christian J. Stoeckert, Jr
Center for Bioinformatics, Department of Genetics
University of PennsylvaniaSchool of Medicine,
Philadelphia, PA19104
Filtering of in silico data sets
Phylogenetic profiles (Pellegrini et al. 1999) for 5,334 P. falciparum proteins were constructed by comparing their amino acid sequences with 163 genomes. These data, including data from one additional genome (Aeropyrum pernix K1; 164 genomes in all), were also examined for presence of Rosetta stone (RS) fusion proteins (Marcotte et al. 1999). Phylogenetic profiles of proteins present in P. falciparum and at least one other organism were retained. This filtering reduced our data set to 2,813 protein profiles.
Linkages for which evidence of a fusion protein was available only in P. falciparum were excluded from our analysis, leaving us with 5,176 RS linkages between 993 proteins.
Creating gold standards using KEGG and GO annotations
Gold standards were created using pathway annotations available from the KEGG database (Kanehisa et al. 2004) and gene ontology process (GO Process) (The Gene Ontology Consortium, 2004) annotations as supplied by the sequencing centers (the Wellcome Trust Sanger Institute, The Institute for Genomic Research (TIGR) and the Stanford Genome Technology Center), downloaded from PlasmoDB (). KEGG annotations have proven reliable benchmarks, as have GO annotations (Lehner and Fraser, 2004), as well as combinations of both KEGG and GO (Lee et al. 2004; Ramani et al. 2005).
Given the non-redundant nature of KEGG and GO annotations (Bork et al. 2004; Lee et al. 2004), we used GO annotations to filter KEGG negatives, which lead to improvement in their quality. When using GO, references to 3 broad GO categories (GO:0000004 – biological_process unknown, 0008151 – metabolism, 0008152 – cell growth and/or maintenance) were excluded, as were 4 evidence lines (NR – not recorded, ND – no biological data available, NAS – non-traceable author statement, RCA – reviewed computational analysis). The most prominent evidence type observed after removing the above-mentioned categories was the ISS type (Inferred from Sequence or Structural Similarity). A minimum GO term depth of 5 was maintained, and the term at the lowest level was retained for proteins with 2 or more associated terms.
We confirmed the utility of including annotation up to 7 levels by measuring the gain in the number of true positives in our predicted set. These levels were chosen dynamically, i.e., for each term, we ascended local branches of the hierarchy, rather than using a fixed level based on the overall GO hierarchy, thereby accounting for non-uniformities and differences in GO hierarchies. In this way, we could search the neighborhood of each protein for other proteins with similar assignments, while still maintaining specificity. In all, 1,146,970 pairs with GO information satisfied our criteria. Negative pairs that overlapped with this set of GO pairs were excluded. Finally, our gold standard data sets contained 10,267 positive pairs (GP), and 44,812 negative pairs (GN).
The use of GO annotations for filtering the reference standards resulted in a boost in model performance over all likelihood score thresholds (Figure S1), and allowed more number of links to be included in the result set at lower thresholds, as evidenced by a comparison of values obtained by 7-fold cross-validation. In addition, we independently tested the effect of including annotations from different GO levels, to ascertain the most useful level of the hierarchy for filtering negatives. As seen in Figure S1, successive increases in GO levels improve model performance till a level of 7 is reached, after which gains are minimal. We also assured the homogeneity of the gold standards by randomly withholding a subset during model training and using this subset for testing the predicted linkages (Figure S2). This reveals that a careful selection of annotation data for creating reference standards is feasible, and the resulting integration is of a sufficiently high quality to permit reconstruction of accurate functional interaction maps.
Choice of reference priors based on an assumed number of linkages.
Based on Bayesian inferences (Perl 1998; Ewens and Grant, 2005), calculation of likelihood ratios is independent of any requirement for prior knowledge. However, choosing a likelihood score threshold for accepting functional linkages generated by the model is necessary to isolate a result set, where the prior odds of finding a pair of linked genes are greater than 1. Given the absence of functional interaction data for the Plasmodium falciparum genome, we adopted a strategy that lets us utilize arbitrary sets of links as our reference or non-informative priors.
For each step, links for each uniform reference prior set R were chosen such that |R| > 0 and RN, N being the total number possible links given all proteins in the P. falciparum genome. For each change in our number of probable linkages, we measured the global impact of addition of new evidence by calculating the ratio of true positives to false positives at the corresponding likelihood score threshold. For instance, a belief that only 40% of all proteins are functionally linked with each other, leads to a likelihood score threshold of ~6, yielding posterior odds of finding a true pair of functionally linked proteins that are greater than 1. Similarly, a belief that involves linkages between 50% of all proteins leads to a corresponding score threshold of ~4, and so on.
As evidenced by the quality of the network produced (Figure 2), this strategy proves useful when dealing with genomes with little or no useful data on protein-protein interactions. For the P. falciparum genome, assumptions regarding the number of possible functional links also corresponded closely with the actual number of links found in the result sets above the derived thresholds. This correspondence, however, is not maintained when dealing with beliefs that assume very large or very small numbers of proteins as likely interactors. As seen in Figure S3, genome coverage does not exceed 80% for a set of links derived at any likelihood threshold. Conversely, the prior beliefs suggest an increase in genome coverage in sets generated at a much higher likelihood threshold. Thus, for both very high and very low number of possible functional interactions, this strategy overestimates the degree of coverage afforded by the model, given a set of reference priors. This inability to effectively reflect true coverage is a result of the limitations of the individual functional genomics methods, our filtering methods and our gold standards.
Testing linkage quality and network robustness
Initial quality of the predicted linkages was tested using sets of shuffled linkages. More rigorous tests involved cross-validation and comparison with shuffled sets of input data. For cross-validation, gold standards were divided in to seven subsets, each of which was withheld once for testing, while the remaining 6 were used for training the data. Results from the 7 individual runs were then summed for each likelihood score threshold (see Figure 1).
For the test using shuffled input data, values in phylogenetic and expression profiles were maintained, while shuffling the identifying labels attached to each profile. Mutual information and Pearson correlation was used to measure profile similarity for the phylogenetic and expression profile sets, respectively. Rosetta stone protein linkages present input data in the form of ‘A-B-RS’, where ‘A’ & ‘B’ are individual proteins, while ‘RS’ represents a Rosetta stone fusion protein. For this set, we maintained the ‘B-RS’ relationship in the triplet, while randomizing all ‘A’ proteins. For all sets, the randomization procedure was repeated 1000 times. Functional linkages within each shuffled set were also subject to the same filters that were applied to the normal sets. Table S1 shows a comparison of likelihood scores for normal and shuffled sets of functional genomics data.
Robustness of the network to changing benchmarks was tested using functional genomics data not used in training the network. This included expression measurements of genes in all life-cycle stages (Le Roch et al. 2003), and high-throughput mass spectrometric measurements of the proteome (Florens et al. 2002, Lasonder et al. 2002; referred to as sets M1 and M2 respectively).
For each data set, proteins in the same life-cycle stage are more likely to be functionally linked, as they are more likely to take part in stage-specific pathways or be a part of protein complexes, than are proteins in different life-cycle stages. We therefore paired proteins from the same life cycle stage to create positive sets, and different life-cycle stages to create negative sets. While trends in each data set are accurately captured by the model, we observed some differences when dealing with the two mass spectrometry data sets- for data set M1, we found that the ratio of positives to negatives remains below 1 for all sampled thresholds as opposed to M2, where the ratio of positives to negatives is greater than 1 for all likelihood score thresholds. While these differences cannot be readily explained in our analysis, we do introduce a moderate degree of error in these tests, as we use only simple combinations to generate protein pairs that were detected in the same stage, without regard to spatial or temporal constraints. Further, it is likely that at least some of our predicted interactions are missed when compared to such proteomic data, given the inclusion of gene expression data in our model and the known delay between mRNA expression and protein accumulation (Le Roch et al. 2004).
Visualizing spatial separation of predicted linkages based on life-cycle stages
In addition to functional properties, we were also interested in observing spatial clustering based on nodal properties shared by individual pair-wise linkages. We expected reasonable spatial separation of linkages based on participation in a particular life-cycle stage, since candidates sharing a life-cycle stage are more likely to be functionally linked than candidates from different life-cycle stages. To test this hypothesis, we measured the distribution of candidates and their shared linkages using assembled protein pairs from different life-cycle stages, by superimposing them on the P. falciparum functional linkage map.
For ascertaining nodal properties, we chose the largest life-cycle specific positive reference set from these three individual data sets. Pairs within this set that appeared as a positive pair in any other life-cycle stage were discarded. Data from studies of Florens et al. (2002) for the sporozoite stage, Lasonder et al. (2002) for the gamete/gametocyte stages and Le Roch et al. (2003) for the erythrocyte-associated stages were found to be the largest sets for each stage.
Individual proteins modeled as nodes, and functional linkages modeled as edges were graphed in two-dimensional (2-D) space using the minimal spanning tree approach as implemented in the large graph layout (LGL) algorithm (Adai et al. 2004, see ‘Using the LGL package for visualizing networks’). When visualized with LGLView, functional linkages from gamete/gametocytic and erythrocyte-associated life-cycle stages are seen to possess distinct spatial profiles, and mostly occupy different areas of the map, albeit with a degree of overlap (Figure S4). Linkages between proteins restricted to the sporozoite stage are fewer in number as compared to linkages between the other two stages, and appear scattered throughout the network. An examination of overlapping linkages from the gamete/gametocytic stages and erythrocyte-associated stages revealed the presence of mostly hypothetical proteins, majority of which are functionally linked with proteins that interact with DNA, such as Helicases and proteins containing the zinc-finger motif. While the observed overlap is primarily due to the limitations of our assembled query set and 2-D graphing, it can be speculated that at least some of these components may be shared between different life-cycle stages of the parasite, and therefore appear to share nodal properties.
Exploring the interactome for functional associations
The interactome map provides several interesting functional associations. The hypothetical protein PF10_0214 is seen linked with members of the mRNA-processing machinery, while PF14_0729 is likely involved in protein transport. The hypothetical protein PFI1180w is seen linked with members of the ADP-ribosylation factor (Arf)-family of small GTP-binding proteins, suggesting a role for this protein in membrane trafficking and actin remodeling (Nie et al. 2003).
Functional information from a subset of high confidence linkages
Phase transitions in our data were prominent at two points- the first was seen when the threshold was decreased from 14 to 13, representing a five-fold increase in interaction number from ~12,000 to ~60,000, and a two-fold increase in the number of proteins, from ~1,400 to ~2,900; a second less abrupt transition was noted between threshold change from 3 to 2, where the number of links increased from ~126,000 to include the entire set of 388,969 links, and the number of proteins involved increased from ~3,200 to 3,667.
We have taken advantage of these transitions to delineate a set of high confidence links, above a likelihood score threshold of 14 or greater. This subset also provides biological insights into the functions of several proteins. A distinct sub-network composed entirely of hypothetical proteins- PF11_0122, MAL8P1.151, PF13_0285, PF07_0024 is worth noting. All of these proteins are either similar to proteins that contain the SacI domain, or contain the Sac1 domain themselves, based on information available from PlasmoDB. Sac1 containing proteins are thought to play a variety of roles in yeast, including exocytosis and endocytosis and, are also known to interact with the actin cytoskeleton (Hughes et al. 2000). Additionally, PF07_0024 has been recently annotated as the enzyme Inositol-polyphosphate 5-phosphatase. Along with experimental evidence from yeast, these hypothetical proteins are therefore likely to play a role in protein trafficking and secretion (Schorr et al. 2001).
Retention of linkages in other apicomplexa
We also superimposed the P. falciparum interaction map on the Plasmodium yoelii, Toxoplasma gondii and Cryptosporidium parvum genomes, to examine retention of linkages across the species. Such analyses reveal interactions that are a part of the apicomplexan signature, and are the first step towards understanding interactions and identifying components unique to the P. falciparum genome. Given the current quality of some genomes, for instance missing genes in C. parvum, or low coverage, it is possible that these comparisons are incomplete, or contain some error. This includes 409 C. parvum genes that are not a part of comparative analyses between P. falciparum and C. parvum. These genes are not assigned any sequence in the currently available version of the genome (Abrahamsen, et al. 2004), due to undetermined start/stop or intron positions. However, it should be possible to refine these observations as sequence quality improves.
Observed links that are common among the genomes, or exclusive to P. falciparum are outlined in Table 2. One group of retained linkages includes the Phosphatidylinositol signaling system, with members of which include GPI-anchored moieties. Plasmodium and Toxoplasma GPIs are known to induce proinflammatory responses (Debierre-Grockiego et al. 2003; Zhu et al. 2005), and antibodies against these have been suggested as a possible anti-glycolipid vaccine against malaria (Schofield and Hackett, 1993); it may be possible to further extend versions of this strategy as a generalized way of combating apicomplexan pathogenesis, specifically by using knowledge of shared steps obtained from this analysis.
Examination of the network for linkages exclusive to P. falciparum, based on comparison with the apicomplexan genomes, reveals 87,798 linkages represented by 3,307 proteins (see Table 2). We identified 593 interactions between members of the rif/stevor families and other proteins, which may prove pharmaceutically useful on further experimental study. Additionally, links between hypothetical proteins such as PF10_0128 and MAL8P1.145, which are orthologous to G-protein subunit proteins in other organisms, and PFL0080c, a putative serine/threonine-protein kinase, point to interesting and unique signaling cascades.
Accuracy and coverage of individual and combined functional genomics data
We also measured ratio of true positives to false positives in terms of the percentage of genome covered by individual functional genomics data sets, compared to the integrated data. The results of this test are illustrated in Figure S5. As expected, based on the properties of our integration schema, the integrated data set shows higher accuracy in terms of the ratio of true positives to false positives over all portions of the genome covered, as opposed to the accuracy of individual functional genomics data sets. Besides using Bayesian logic, simpler methods for integrating data based on unions or voting can also be imagined. However, such methods suffer from several drawbacks, such as requiring a user-defined cutoff to be chosen for each set. Further, given their dependence on the size of each set, such methods are extremely restricted in their coverage, and may be biased towards a set. The ability to weigh evidences, and combine them to yield probabilistic scenarios makes a Bayesian model more useful, its utility only increased given the fact that the included data represents different aspects of the life-cycle of the parasite.
Using the LGL package for visualizing networks
The networks were laid out and visualized using the LGL package (Adai et al. 2004). LGL uses a mass-spring algorithm, where edges are treated as springs, which pull together nodes acting as masses. Coordinates are first generated for the vertices for each independent set of interconnected links, using a guide tree generated by the minimum spanning tree of the network (the layout phase). Each independent set is then connected to reconstruct the complete network (see also the ‘Links of interest’ section for website information and links to other graphing software).
For visualizing P. falciparum networks, we supplied LGL with a set of unique node identifiers without any weights. LGL was allowed to arbitrarily designate a root vertex, and was run with default parameters. The edge and coordinate files generated were visualized with LGLView, a companion program that plots the coordinates in 2-dimensional space.
For visualizing areas of the network specific to a life-cycle stage, we chose to color vertices that belonged to a particular stage. Edges between two vertices that belonged to the same stage are also assigned the same color.