HAPLOFREQ - Estimating Haplotype Frequencies Efficiently
Abstract:
We present a new method HAPLOFREQ to estimate haplotype
frequencies over a short genomic region given the genotypes or
haplotypes with missing data. Our approach incorporates a maximum
likelihood model based on a simple random generative model which assumes
that the genotypes are independently sampled from the population.
We first show that if the phased haplotypes are given, possibly with missing data,
we can estimate the frequency of the haplotypes in the population by finding
the global optimum of the likelihood function in polynomial time.
If the haplotypes are not phased, finding the maximum value of the likelihood
function is NP-hard. In this case we define an alternative
likelihood function which can be thought of as a relaxed likelihood function. We show
 that the maximum relaxed likelihood can be found in polynomial time, and that
 the optimal solution of the relaxed likelihood approaches asymptotically to the
haplotype frequencies in the data.In contrast to previous approaches, our algorithms are guaranteed to
converge in polynomial time to a global maximum of the different likelihood functions.
Preliminary experiments on biological data show that our estimates are
about 10% more accurate than the popular program PHASE and about three to ten times faster.Our techniques involve new algorithms in convex optimization. These
algorithms may be of independent interest. Furthermore, the hardness
proof involves a generalization of Turan's theorem, which may also be
of independent interest.