Mining Viral Protease Data to Extract Cleavage Knowledge

TITLES, AUTHORS, ABSTRACTS FOR ISMB02

Paper01, Presentation02

Title:

Mining viral protease data to extract cleavage knowledge

Authors:

Ajit Narayanan, Xikun Wu and Z. Rong Yang

Abstract:

Motivation: The motivation is to identify, through machine learning techniques, specific patterns in HIV and HCV viral polyprotein amino acid residues where viral protease cleaves the polyprotein as it leaves the ribosome. An understanding of viral protease specificity may help the development of future anti-viral drugs involving protease inhibitors by identifying specific features of protease activity for further experimental investigation. While viral sequence information is growing at a fast rate, there is still comparatively little understanding of how viral polyproteins are cut into their functional unit lengths. The aim of the work reported here is to investigate whether it is possible to generalise from known cleavage sites to unknown cleavage sites for two specific viruses - HIV and HCV. An understanding of proteolytic activity for specific viruses will contribute to our understanding of viral protease function in general, thereby leading to a greater understanding of protease families and their substrate characteristics.

Results: Our results show that artificial neural networks and symbolic learning techniques (See5) capture some fundamental and new substrate attributes, but neural networks outperform their symbolic counterpart.

Availability: Publicly available software was used (Stuttgart Neural Network Simulator - , and See5 - The datasets used (HIV, HCV) for See5 are available at:

Contact: ,

Paper02, Presentation03

Title:

The metric space of proteins - comparative study of clustering algorithms

Authors:

Ori Sasson, Nati Linial, Michal Linial

Abstract:

Motivation: A large fraction of biological research concentrates on individual proteins and on small families of proteins. One of the current major challenges in bioinformatics is to extend our knowledge also to very large sets of proteins. Several major projects have tackled this problem. Such undertakings usually start with a process that clusters all known proteins or large subsets of this space. Some work in this area is carried out automatically, while other attempts incorporate expert advice and annotation.

Results: We propose a novel technique that automatically clusters protein sequences. We consider all proteins in SWISSPROT, and carry out an all-against-all BLAST similarity test among them. With this similarity measure in hand we proceed to perform a continuous bottom-up clustering process by applying alternative rules for merging clusters. The outcome of this clustering process is a classification of the input proteins into a hierarchy of clusters of varying degrees of granularity. Here we compare the clusters that result from alternative merging rules, and validate the results against InterPro.

Our preliminary results show that clusters that are consistent with several rather than a single merging rule tend to comply with InterPro annotation. This is an affirmation of the view that the protein space consists of families that differ markedly in their evolutionary conservation.

Availability: The outcome of these investigations can be viewed in an interactive Web site at

Supplementary information: Biological examples for comparing the performance of the different algorithms used for classification are presented in

Contact:

Paper03, Presentation04

Title:

Authors:

Nicholas Steffen, Scott Murphy, Lorenzo Tolleri, Wesley Hatfield, Richard Lathrop

Abstract:

Motivation: Direct recognition, or direct readout, of DNA bases by a DNA-binding protein involves amino acids that interact directly with features specific to each base. Experimental evidence also shows that in many cases the protein achieves partial sequence specificity by indirect recognition, i.e., by recognizing structural properties of the DNA. (1) Could threading a DNA sequence onto a crystal structure of bound DNA help explain the indirect recognition component of sequence specificity? (2) Might the resulting pure-structure computational motif manifest itself in familiar sequence-based computational motifs?

Results: The starting structure motif was a crystal structure of DNA bound to the integration host factor protein (IHF) of {\it E.~coli}. IHF is known to exhibit both direct and indirect recognition of its binding sites. (1) Threading DNA sequences onto the crystal structure showed statistically significant partial separation of 60 IHF binding sites from random and intragenic sequences and was positively correlated with binding affinity. (2) The crystal structure was shown to be equivalent to a linear Markov network, and so, to a joint probability distribution over sequences, computable in linear time. It was transformed algorithmically into several common pure-sequence representations, including (a) small sets of short exact strings, (b) weight matrices, (c) consensus regular patterns, (d) multiple sequence alignments, and (e) phylogenetic trees. In all cases the pure-sequence motifs retained statistically significant partial separation of the IHF binding sites from random and intragenic sequences. Most exhibited positive correlation with binding affinity. The multiple alignment showed some conserved columns, and the phylogenetic tree partially mixed low-energy sequences with IHF binding sites but separated high-energy sequences. The conclusion is that deformation energy explains part of indirect recognition, which explains part of IHF sequence-specific binding.

Availability: Code and data on request.

Contact: Nick Steffen for code and Lorenzo Tolleri for data. ,

Paper04, Presentation05

Title:

Beyond tandem repeats: complex pattern structures and distant regions of similarity

Authors:

Amy M. Hauth, Deborah A. Joseph

Abstract:

Motivation: Tandem repeats (TRs) are associated with human disease, play a role in evolution and are important in regulatory processes. Despite their importance, locating and characterizing these patterns within anonymous DNA sequences remains a challenge. In part, the difficulty is due to imperfect conservation of patterns and complex pattern structures. We study recognition algorithms for two complex pattern structures: variable length tandem repeats (VLTRs) and multi-period tandem repeats (MPTRs).

Results: We extend previous algorithmic research to a class of regular tandem repeats (RegTRs). We formally define RegTRs, as well as, two important subclasses: VLTRs and MPTRs. We present algorithms for identification of TRs in these classes. Furthermore, our algorithms identify degenerate VLTRs and MPTRs: repeats containing substitutions, insertions and deletions. To illustrate our work, we present results of our analysis for two difficult regions in cattle and human which reflect practical occurrences of these subclasses in GenBank sequence data. In addition, we show the applicability of our algorithmic techniques for identifying Alu sequences, gene clusters and other distant regions of similarity. We illustrate this with an example from yeast chromosome I.

Availability: Algorithms can be accessed at

Contact: Amy M. Hauth , 608-831-2164) or Deborah A. Joseph 608-262-8022), FAX: 608-262-9777.

Paper05, Presentation06

Title:

The POPPs: Clustering and Searching Using Peptide Probability Profiles

Author:

Michael J. Wise

Abstract:

The POPPs is a suite of inter-related software tools which allow the user to discover what is statistically "unusual" in the composition of an unknown protein, or to automatically cluster proteins into families based on peptide composition. Finally, the user can search for related proteins based on peptide composition. Statistically based peptide composition provides a view of proteins that is, to some extent, orthogonal to that provided by sequence. In a test study, the POPP suite is able to regroup into their families sets of approximately 100 randomised Pfam protein domains. The POPPs suite is used to explore the diverse set of late embryogenesis abundant (LEA) proteins.

Availability: Contact the author.

Contact:

Paper06, Presentation08

Title:

A sequence profile based HMM for predicting and discriminating beta barrel membrane proteins

Authors:

Pier Luigi Martelli, Piero Fariselli, Anders Krogh, Rita Casadio

Abstract:

Motivation: Membrane proteins are an abundant and functionally relevant subset of proteins that putatively include from about 15 up to 30% of the proteome of organisms fully sequenced. These estimates are mainly computed on the basis of sequence comparison and membrane protein prediction. It is therefore urgent to develop methods capable of selecting membrane proteins especially in the case of outer membrane proteins, barely taken into consideration when proteome wide analysis is performed. This will also help protein annotation when no homologous sequence is found in the database. Outer membrane proteins solved so far at atomic resolution interact with the external membrane of bacteria with a characteristic _ barrel structure comprising different even numbers of _ strands (_ barrel membrane proteins). In this they differ from the membrane proteins of the cytoplasmic membrane endowed with alpha helix bundles (all alpha membrane proteins) and need specialised predictors.

Results: We develop a HMM model, which can predict the topology of _ barrel membrane proteins, using as input evolutionary information. The model is cyclic with 6 types of states: two for the _ strand transmembrane core, one for the _ strand cap on either side of the membrane, one for the inner loop, one for the outer loop and one for the globular domain state in the middle of each loop. The development of a specific input for HMM based on multiple sequence alignment is novel. The accuracy per residue of the model is 82% when a jack knife procedure is adopted. With a model optimisation method using a dynamic programming algorithm seven topological models out the twelve proteins included in the testing set are also correctly predicted. When used as a discriminator, the model is rather selective. At a fixed probability value, it retains 84% of a non-redundant set comprising 145 sequences of well-annotated outer membrane proteins. Concomitantly, it correctly rejects 90% of a set of globular proteins including about 1200 chains with low sequence identity (< 30%) and 90% of a set of all alpha membrane proteins, including 188 chains.

Availability: The program will be available on request from the authors.

Contact: ,

Paper07, Presentation09

Title:

Fully automated ab initio protein structure prediction using I-SITES, HMMSTR and ROSETTA

Authors:

Christopher Bystroff, Yu Shao

Abstract:

Motivation: The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction. Meeting reasonable service goals required improvements in the efficiency, in particular for the ROSETTA algorithm.

Results: The new server was used for blind predictions of 40 protein sequences as part of the CASP4 blind structure prediction experiment. The results for 31 of those predictions are presented here. 61% of the residues overall were found in topologically correct predictions, which are defined as fragments of 30 residues or more with a root-mean-square deviation in superimposed alpha carbons of less than 6Å. HMMSTR 3-state secondary structure predictions were 73% correct overall. Tertiary structure predictions did not improve the accuracy of secondary structure prediction.

Availability: The server is accessible through the web at Programs are available upon requests for academics. Licensing agreements are available for commercial interests.

Supplementary information:

Contacts: ,

Paper08, Presentation10

Title:

Prediction of Contact Maps by GIOHMMs and Recurrent Neural Networks Using Lateral Propagation From All Four Cardinal Corners

Authors:

Gianluca Pollastri, Pierre Baldi

Abstract:

Motivation: Accurate prediction of protein contact maps is an important step in computational structural proteomics. Because contact maps provide a translation and rotation invariant topological representation of a protein, they can be used as a fundamental intermediary step in protein structure prediction.

Results: We develop a new set of flexible machine learning architectures for the prediction of contact maps, as well as other information processing and pattern recognition tasks. The architectures can be viewed as recurrent neural network parameterizations of a class of Bayesian networks we call generalized input-output HMMs. For the specific case of contact maps, contextual information is propagated laterally through four hidden planes, one for each cardinal corner. We show that these architectures can be trained from examples and yield contact map predictors that outperform previously reported methods. While several extensions and improvements are in progress, the current version can accurately predict 60.5% of contacts at a distance cutoff of 8Å and 45% of distant contacts at 10Å, for proteins of length up to 300.

Availability and Contact: The contact map predictor will be made available through

as part of an existing suite of proteomics predictors.

Email: {gpollast,pfbaldi}@ics.uci.edu

Paper09, Presentation11

Title:

Rate4Site: An Algorithmic Tool for the Identification of Functional Regions on Proteins by Surface Mapping of Evolutionary Determinants within Their Homologues

Authors:

Tal Pupko, Rachel Bell, Itay Mayrose, Fabian Glaser, Nir Ben-Tal

Abstract:

Motivation: A number of proteins of known three-dimensional (3D) structure exist, with yet unknown function. In light of the recent progress in structure determination methodology, this number is likely to increase rapidly. A novel method is presented here: "Rate4Site", which maps the rate of evolution among homologous proteins onto the molecular surface of one of the homologues whose 3D-structure is known. Functionally important regions correspond to surface patches of slowly evolving residues.

Results: Rate4Site estimates the rate of evolution of amino acid sites using the maximum likelihood (ML) principle. The ML estimate of the rates considers the topology and branch lengths of the phylogenetic tree, as well as the underlying stochastic process. To demonstrate its potency, we study the Src SH2 domain. Like previously established methods, Rate4Site detected the SH2 peptide-binding groove. Interestingly, it also detected inter-domain interactions between the SH2 domain and the rest of the Src protein that other methods failed to detect.

Availability: Rate4Site can be downloaded at:

Contact: ; ; ;

Supplementary Information: multiple sequence alignment of homologous domains from the SH2 protein family, the corresponding phylogenetic tree and additional examples are available at

Paper10, Presentation12

Title:

Inferring sub-cellular localization through automated lexical analysis

Authors:

Rajesh Nair, Burkhard Rost

Abstract:

Motivation: The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the sub-cellular localization is only available for few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns sub-cellular localization. With the rapid growth in sequence data, the biochemical characterisation of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available.

Results: The method reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for less than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8000 new annotations of sub-cellular localization for these eukaryotes.

Availability: Annotations of localization for eukaryotes at:

Contact:

Paper11, Presentation14

Title:

Support vector regression applied to the determination of the developmental age of a Drosophila embryo from its segmentation gene expression patterns

Authors:

Ekaterina Myasnikova, Anastassia Samsonova, John Reinitz, Maria Samsonova

Abstract:

Motivation: In this paper we address the problem of the determination of developmental age of an embryo from its segmentation gene expression patterns in Drosophila.

Results: By applying support vector regression we have developed a fast method for automated staging of an embryo on the basis of its gene expression pattern. Support vector regression is a statistical method for creating regression functions of arbitrary type from a set of training data. The training set is composed of embryos for which the precise developmental age was determined by measuring the degree of membrane invagination. Testing the quality of regression on the training set showed good prediction accuracy. The optimal regression function was then used for the prediction of the gene expression based age of embryos in which the precise age has not been measured by membrane morphology. Moreover, we show that the same accuracy of prediction can be achieved when the dimensionality of the feature vector was reduced by applying factor analysis. The data reduction allowed us to avoid over-fitting and to increase the efficiency of the algorithm.

Availability: This software may be obtained from the authors.

Contact:

Paper12, Presentation15

Title:

Variance stabilization applied to microarray data calibration and to the quantification of differential expression

Authors:

Wolfgang Huber, Anja von Heydebreck, Holger Sueltmann, Annemarie Poustka, Martin Vingron

Abstract:

We introduce a statistical model for microarray gene expression data that comprises data calibration, the quantification of differential expression, and the quantification of measurement error. In particular, we derive a transformation $h$ for intensity measurements, and a difference statistic *h whose variance is approximately constant along the whole intensity range. This forms a basis for statistical inference from microarray data, and provides a rational data pre-processing strategy for multivariate analyses. For the transformation $h$, the parametric form h(x) = arsinh(a+bx) is derived from a model of the variance-versus-mean dependence for microarray intensity data, using the method of variance stabilizing transformations. For large intensities, h coincides with the logarithmic transformation, and *h with the log-ratio. The parameters of h together with those of the calibration between experiments are estimated with a robust variant of maximum-likelihood estimation. We demonstrate our approach on data sets from different experimental platforms, including two-color cDNA arrays and a series of Affymetrix oligonucleotide arrays.