Query-Driven Search Methods for Large Microarray Databases

The availability of a large and growing collection of microarray datasets from diverse experimental contexts provides hope of broadly characterizing gene expression and regulation in a variety of conditions. However, such vast amounts of data can be unwieldy to analyze and misleading conclusions can be drawn due to biases of functional coverage. Searches and analyses of these data must be targeted and interactive in order to allow expert biologists to leverage their own knowledge to quickly formulate and test new hypotheses and conclusions.

We have built a database of S. cerevisiae microarray datasets from over 80 publications, totaling roughly 2400 microarray conditions. Traditional analysis methods, including many forms of clustering, can lead to misleading results given such a large and diverse compendium of data. We propose a technique that leverages query-based analysis and pre-processing of the compendium to allow for fast, targeted searches through a large amount of data. Using this approach we can quickly recapitulate known pathways and functions given a small seed set of related genes, as well as predict novel players in many specific contexts.