A Combinatorial Optimization Approach to Motif Finding

Discovering approximately repeated patterns, or motifs, in genomic sequences is an important problem in computational molecular biology. Most frequently, motif finding applications arise when identifying shared regulatory signals within DNA sequences or shared functional and structural elements within protein sequences. Due to the diversity of contexts in which motif finding is applied, several variations of the problem are commonly studied.

We introduce a versatile combinatorial optimization framework for the motif finding problem, which couples graph pruning techniques with a novel integer linear programming formulation. Our method is flexible and robust enough to accommodate several variants of the motif finding problem, and we extend it to discover multiple motifs, as well as motifs found in evolutionarily related sequences of varying phylogenetic distance. In contrast to commonly-used stochastic search methods for the problem, our combinatorial approach typically yields optimal solutions. We apply our method to numerous DNA and protein datasets, as well as to synthetic data, and in all cases it performs very well, identifying either known motifs or motifs of high conservation. Finally, we assess statistical significance of the discovered motifs, and find that in the vast majority of cases such a motif is unlikely to have arisen by chance alone.