Quick links

Perfect Phylogeny and Haplotype Assignment

Report ID:
TR-678-03
Date:
September 2003
Pages:
11
Download Formats:

Abstract:

This paper is concerned with the reconstruction of perfect phylogenies
from binary character data with missing values, and related problems
of inferring complete haplotypes from haplotypes or genotypes with
missing data. In cases where the problems considered are NP-hard we
assume a rich data hypothesis under which they become
tractable. Natural probabilistic models are introduced for the
generation of character vectors, haplotypes or genotypes with missing
data, and it is shown that these models support the rich data
hypothesis. The principal results include:

- A near-linear time algorithm for inferring a perfect phylogeny from
binary character data (or haplotype data) with missing values, under
the rich data hypothesis;

- A quadratic-time algorithm for inferring a perfect phylogeny from
genotype data with missing values with high probability, under certain
distributional assumptions;

- Demonstration that the problems of maximum-likelihood inference of
complete haplotypes from partial haplotypes or partial genotypes can
be cast as minimum-entropy disjoint set cover problems;

- In the case where the haplotypes come from a perfect phylogeny, a
representation of the set cover problem as minimum-entropy covering of
subtrees of a tree by nodes;

- An exact algorithm for minimum-entropy subtree covering, and
demonstration that it runs in polynomial time when the subtrees have
small diameter;

- Demonstration that a simple greedy approximation algorithm solves
the minimum-entropy subtree covering problem with relative error
tending to zero when the number of partial haplotypes per complete
haplotype is large;

- An asymptotically consistent method of estimating the frequencies of
the complete haplotypes in a perfect phylogeny, under an iid model for
the distribution of missing data;

- Computational results on real data demonstrating the effectiveness
of a the greedy algorithm for inferring haplotypes from genotypes with
missing data, even in the absence of a perfect phylogeny.

Follow us: Facebook Twitter Linkedin