Our group works broadly in computational molecular biology and bioinformatics. We are interested in questions relating to predicting protein function, interactions and specificity, as well as in analyzing cellular interaction networks in order to uncover cellular organization, functioning, and pathways. An appreciation of protein structure guides much of our research.
Organization of cellular networks. Large-scale networks of protein-protein interactions provide a view into the workings of the cell. However, these interaction maps do not come with a key for interpreting them, so it is necessary to develop methods that shed light on their functioning and organization. We have introduced the language of network schemas to elucidate shared mechanisms within interactomes. A network schema consists of descriptions of proteins (e.g., their putative structural domains or molecular functions) along with the desired topology and types of interactions among them. Schemas can thus describe domain-domain interactions, signaling and regulatory pathways, or more complex network patterns. We have developed a computational framework for identifying network schemas that are recurrent and over-represented in the network, even given the distributions of their constituent components. We have applied this methodology to the physical interaction network in the model organism baker's yeast and have begun to build a hierarchy of schemas starting with the four simplest topologies. We have identified hundreds of recurring and over-represented network schemas of various complexity, and demonstrated via graph-theoretic representations how more complex schemas are organized in terms of their lower-order constituents. The uncovered schemas span a wide-range of cellular activities, with many signaling and transport related higher-order schemas. We have established the functional importance of the schemas by showing that they correspond to functionally cohesive sets of proteins, are enriched in the frequency with which they have instances in the human interactome, and are useful for predicting protein function. We have also developed a system NetGrep for searching protein interaction networks for matches to user-supplied network schemas. Netgrep has an advanced graphical interface for specifying schemas and fast algorithms for extracting their matches.
Predicting protein function via analysis of protein interaction maps. Our group has also begun development of computational methods for analyzing protein interaction networks in order to uncover protein functions. We have developed a novel algorithm based on network flow that is highly effective in predicting the biological processes of proteins. Our method FunctionalFlow outperforms previously described approaches in predicting the function of proteins with few (or no) known physical interactions with annotated proteins. Our method exploits the topological structure of interaction networks in order to make predictions, and can be applied to either experimentally or computationally determined protein interaction maps. The key insight of our method is its integration of both network topology and locality considerations. A paper describing this work (by Elena Nabieva et al. ) was awarded a Best Student Paper award in ISMB 2005.
Predicting protein physical interactions. Many protein interactions are mediated by well-characterized structural domains that exhibit wide-ranging specificity. We have been developing a general structural bioinformatics approach for predicting protein interactions that is designed to be applied to specific structural domains. We have introduced an optimization framework, exploiting support vector machines, for predicting protein interactions that can exploit both genomic sequence data and quantitative biophysical data. In collaboration with Amy Keating at MIT, we have demonstrated the effectiveness of our method in predicting coiled-coil protein interactions; it is the first demonstration of an interaction interface for which such large-scale, high-confidence computational predictions of direct physical interactions can be made. Our approach has been tested on a dataset characterizing nearly all possible bZIP coiled-coil pairings in the human genome. bZIPs are a large class of eukaryotic transcription factors that can ``mix and match'' in a way that provides combinatorial regulation. We have found that it is possible to identify a significant fraction (70%) of bZIP coiled-coil interactions, while maintaining that >90% of the predictions are correct. Similarly, it is possible to eliminate, with high confidence, the vast majority of pairings that do not interact. We have also utilized this basic framework to predict protein-DNA interactions mediated by C2H2 zinc finger domains, which comprise the largest class of eukaryotic transcription factors. We have developed the first approach that utilizes information about relative binding affinity.
Side-chain positioning. Side-chain positioning is a key component in characterizing protein interaction interfaces, as well as in predicting protein structures and interactions. Here, we are given a fixed backbone template and a protein sequence, and the goal is to predict the best conformation of the sequence's amino acids on this backbone. Since side chains tend to occupy one of a small number of conformations, this is formulated as a discrete problem where the total energy of the molecule is expressed as a sum of pairwise energies. With Bernard Chazelle, we have shown that while it is NP-complete to obtain even an approximate solution to the side-chain positioning problem, mathematical programming approaches are stunningly effective in practice, solving to optimality large problem sizes on a standard desktop. Besides its apparent speed, our method's advantage is that it exploits highly-optimized algorithmic machinery while remaining simple and flexible---allowing us, for example, to incorporate constraints that obtain successive, near-optimal solutions that may be further required to differ in at least a certain fraction of restricted (e.g., core residue) positions, or that disallow certain rotamer pairs.
Amino acid frequencies in ancestral proteins. With Jacques Fresco, my group has been looking at the problem of how protein sequences and structures may have evolved over deep time. We have developed two alternative methods for estimating the amino acid composition of ancestral genomes, and have applied them to infer the amino acid composition of a large protein set in the last universal ancestor (LUA) of all extant species. The first uses the composition of conserved residues in modern sequences as well as an empirical knowledge of each amino acid's tendency to remain conserved. The second exploits probabilisitic models of amino acid substitution more fully, and develops an expectation-maximization approach for inferring ancestral composition. Relative to the modern protein set, both methods predict that LUA proteins are generally richer in those amino acids that are believed to have been most abundant in the prebiotic environment and poorer in those amino acids that are believed to have been unavailable or scarce. Our findings provide clues as to the order in which amino acids were introduced into the genetic code and thus into primitive proteins, with amino acids of declining frequencies being the first to be incorporated into the genetic code and those of increasing frequencies being late recruits.