Finding Transcription Modules from Large Gene-Expression Data Sets

Professor Ned Wingreen

Molecular Biology, Princeton University

NA microarrays ("gene chips") have led to a rapid accumulation of gene-expression data. One of the major challenges in analyzing this data is the diversity in both size and signal strength of the various transcriptional modules, i.e. sets of coregulated genes along with the sets of conditions for which the genes are strongly coregulated. One method that has proven successful at identifying large and/or strong modules is the Iterative Signature Algorithm (ISA) [1]. I'll discuss a modified version of the ISA algorithm that sequentially eliminates transcriptional modules as they are identified. The resulting algorithm, the Progressive Iterative Signature Algorithm (PISA), has two main advantages over ISA. First, PISA is able to separate a weak module from an overlapping strong module: PISA identifies and removes the strong module, and then the weak module is readily recovered. Second, by successively removing large modules the total background noise is reduced, and even unrelated small modules become easier to identify.

We tested PISA on a large gene-expression data set for the yeast Saccharomyces cerevisiae. For the yeast data set of 1012 experimental conditions for 6206 genes, PISA identified a large number of modules, most of which could be readily assigned to specific biological functions. These included many small modules (with as few as five genes) that could not be easily found by ISA. We compared the set of modules we found to the Gene Ontology annotation database and found many significant overlaps. The modules identified by PISA also compare favorably to experimentally and theoretically determined sets of genes regulated by individual transcription factors.

[1] Bergmann, S., Ihmels, J., and Barkai, N., Phys. Rev. E 67, 031902 (2002).

Click here for slides