COALESCE: An integrative framework to uncover metazoan transcriptional networks

Hilary A. Coller
Computer Science, Princeton University

While the genome sequence of an organism describes its complement of potential proteins, it is the controlled expression, translation, and modification of these proteins that allows cells to survive and grow. At the level of tran-scription and mRNA stability, a complex regulatory network of transcription factors, RNA binding proteins, and microRNAs are important determinants of a cell’s response to intracellular and extracellular signals. Understanding the elements of this regulatory network and the stimuli to which it responds is central to understanding the structure of biological systems in higher organisms, and to understanding how misregulation of this network causes human disease.

Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction (COALESCE) is an algorithm that allows regulatory network discovery from large collections of genomic data. COALESCE takes advantage of Bayesian integration of multiple data types on a large scale to predict coregulated gene modules, the conditions under which they are coregulated, and the consensus binding motifs responsible for their regulation. Through a novel synthesis of gene expression biclustering, and motif prediction, COALESCE can successfully find coregulated modules for organisms ranging from E. coli to human beings and from data collections as large as 15,000 experimental conditions.

We have applied COALESCE to data from a wide range of organisms, including H. sapiens, M. musculus, C. elegans, S. cerevisiae, H. pylori, and E. coli. Using ~2,200 yeast expression conditions, we recapitulate many known regulatory interactions (e.g. AFT2 in iron transport, STE12 activating mating genes) and highlight the importance of PUF family 3' UTR binding in a wide variety of targets. In an analysis of ~15,000 human gene expression conditions, we extract a wide variety of putative upstream binding sites and potential 3' miRNA sites. On synthetic data comprising 5,000 genes and 100 conditions with 10 "activators" and "repressors" generated from a randomized model, COALESCE successfully recovered 60-90% of the affected genes, conditions, and binding motifs. In five sets of synthetic data containing no such regulators, COALESCE generated zero false positives. We are currently in the process of testing several novel transcriptional regulators of quiescence in human fibroblasts as predicted by COALESCE, as its ability to probabilistically leverage large collections of heterogeneous data is particularly suited to un-raveling complex metazoan regulatory networks.