Published on *Computer Science Department at Princeton University* (https://www.cs.princeton.edu)

We present a new method HAPLOFREQ to estimate haplotype

frequencies over a short genomic region given the genotypes or

haplotypes with missing data. Our approach incorporates a maximum

likelihood model based on a simple random generative model which assumes

that the genotypes are independently sampled from the population.

We first show that if the phased haplotypes are given, possibly with missing data,

we can estimate the frequency of the haplotypes in the population by finding

theglobaloptimum of the likelihood function in polynomial time.

If the haplotypes are not phased, finding the maximum value of the likelihood

function is NP-hard. In this case we define an alternative

likelihood function which can be thought of as a relaxed likelihood function. We show

Â that the maximum relaxed likelihood can be found in polynomial time, and that

Â the optimal solution of the relaxed likelihood approaches asymptotically to the

haplotype frequencies in the data.In contrast to previous approaches, our algorithms are guaranteed to

converge in polynomial time to a global maximum of the different likelihood functions.

Preliminary experiments on biological data show that our estimates are

about 10% more accurate than the popular program PHASE and about three to ten times faster.Our techniques involve new algorithms in convex optimization. These

algorithms may be of independent interest. Furthermore, the hardness

proof involves a generalization of Turan's theorem, which may also be

of independent interest.

**Links**

[1] https://www.cs.princeton.edu/research/techreps/author/303

[2] https://www.cs.princeton.edu/research/techreps/author/419

[3] ftp://ftp.cs.princeton.edu/techreports/2004/706.pdf