Research

Our group works broadly in computational molecular biology and bioinformatics. We are particularly interested in questions relating to protein structure, function, and interactions.

Protein structural motif recognition. Because of the bewildering complexity of the general protein structure prediction problem, one approach we have taken in our work on protein structure is "bottom up." That is, we have focused on specific local 3D structures, or structural motifs, and have developed fast, sequence-based methods for recognizing them within protein sequences. Much of our work has focused on recognizing the coiled coil motif, an important structural motif that is found in proteins that participate in transcription, oncogenesis and cell structure. Over a series of papers, we made novel predictions of coiled-coil and coiled-coil-like structures, many of which have subsequently been verified. In one line of work (with Bonnie Berger and Peter Kim), we identified trimer-of-hairpins motifs consisting of coiled-coil-like regions in the viral membrane fusion proteins of many diverse viruses, including retroviruses (e.g., HIV and HTLV), paramyxoviruses (e.g., human respiratory syncytial virus) and filoviruses (e.g., ebola). In follow-up experimental work, the predicted coiled-coil regions of human respiratory syncytial virus and visna virus membrane fusion proteins were crystallized and their x-ray structures provided spectacular confirmation of our predictions.

Side-chain positioning. An alternate way to simplify the general protein structure prediction problem is to consider side-chain positioning: given a fixed backbone template and a protein sequence, predict the best conformation of the sequence's amino acids on this backbone. Since side chains tend to occupy one of a small number of conformations, this is formulated as a discrete problem where the total energy of the molecule is expressed as a sum of pairwise energies. With Bernard Chazelle, we have shown that while it is NP-complete to obtain even an approximate solution to the side-chain positioning problem, mathematical programming approaches are stunningly effective in practice, solving to optimality large problem sizes on a standard desktop. Besides its apparent speed, our method's advantage is that it exploits highly-optimized algorithmic machinery while remaining simple and flexible---allowing us, for example, to incorporate constraints that obtain successive, near-optimal solutions that may be further required to differ in at least a certain fraction of restricted (e.g., core residue) positions, or that disallow certain rotamer pairs.

Predicting protein physical interactions. Many protein interactions are mediated by well-characterized structural domains that exhibit wide-ranging specificity. We have been developing a general structural bioinformatics approach for predicting protein interactions that is designed to be applied to specific structural domains. We have introduced an optimization framework for predicting protein interactions that can exploit both genomic sequence data and quantitative biophysical data. In collaboration with Amy Keating at MIT, we have demonstrated the effectiveness of our method in predicting coiled-coil protein interactions; it is the first demonstration of an interaction interface for which such large-scale, high-confidence computational predictions of direct physical interactions can be made. Our approach has been tested on a dataset characterizing nearly all possible bZIP coiled-coil pairings in the human genome. bZIPs are a large class of eukaryotic transcription factors that can ``mix and match'' in a way that provides combinatorial regulation. We have found that it is possible to identify a significant fraction (70%) of bZIP coiled-coil interactions, while maintaining that >90% of the predictions are correct. Similarly, it is possible to eliminate, with high confidence, the vast majority of pairings that do not interact. We are currently extending the methodology to predict protein-protein and protein-DNA interactions mediated by other structural domains.

Predicting protein function via analysis of protein interaction maps. Our group has also begun development of computational methods for analyzing protein interaction networks in order to uncover protein function and pathways. We have developed a novel algorithm based on network flow that is highly effective in predicting the biological processes of proteins. Our method FunctionalFlow outperforms previously described approaches in predicting the function of proteins with few (or no) known physical interactions with annotated proteins. Our method exploits the topological structure of interaction networks in order to make predictions, and can be applied to either experimentally or computationally determined protein interaction maps. The key insight of our method is its integration of both network topology and locality considerations. A paper describing this work (by Elena Nabieva et al. ) was awarded a Best Student Paper award in ISMB 2005.

Amino acid frequencies in ancestral proteins. With Jacques Fresco, my group has been looking at the problem of how protein sequences and structures may have evolved over deep time. We have developed two alternative methods for estimating the amino acid composition of ancestral genomes, and have applied them to infer the amino acid composition of a large protein set in the last universal ancestor (LUA) of all extant species. The first uses the composition of conserved residues in modern sequences as well as an empirical knowledge of each amino acid's tendency to remain conserved. The second exploits probabilisitic models of amino acid substitution more fully, and develops an expectation-maximization approach for inferring ancestral composition. Relative to the modern protein set, both methods predict that LUA proteins are generally richer in those amino acids that are believed to have been most abundant in the prebiotic environment and poorer in those amino acids that are believed to have been unavailable or scarce. Our findings provide clues as to the order in which amino acids were introduced into the genetic code and thus into primitive proteins, with amino acids of declining frequencies being the first to be incorporated into the genetic code and those of increasing frequencies being late recruits.