Computational Methods for Predicting Transcription Factor Binding Sites (thesis) | Computer Science Department at Princeton University

Report ID:

TR-765-06

Authors:

Osada, Robert

Date:

August 2006

Pages:

Download Formats:

[PDF]

Abstract:

A major challenge in computational biology is to understand the mechanisms that control gene expression. Transcription factor proteins mediate this process by interacting with a cell's DNA. Here the problem of identifying sequence-specific DNA binding sites of transcription factors is studied, taking two complementary approaches, one based primarily on identifying sequence features and the other exploiting a transcription factor's structure.

The first approach considers the problem of developing a representation for DNA binding sites known to be bound by a particular transcription factor, in order to recognize its other binding sites. The effectiveness of several commonly used approaches is compared, including position-specific scoring matrices, consensus sequences and match-mismatch based methods, showing that there are statistically significant differences in their performances. Furthermore, the use of per-position information content improves all basic approaches, and including local pairwise nucleotide dependencies within binding site models results in statistically significant improvements for approaches based on nucleotide matches. Based on the analysis, the best results when searching for DNA binding sites of a transcription factor are obtained by methods that use both information content and local pairwise correlations.

The second approach focuses on a particular structural class of transcription factors, the CCHH zinc fingers, that comprise the largest family of eukaryotic transcription factors. Zinc finger protein-DNA interactions are modeled by their pairwise residue-base interactions that make up their structural interface using a modified support vector machine framework to find the favorability of each residue-base interaction. Unlike previous approaches, this framework includes not only examples of known interactions but also quantitative information about the relative binding affinities between different protein-DNA configurations. The resulting classifier performs well in a variety of cross-validation testing.