DNA Hash Pooling and its Applications

Dennis Shasha

Computer Science, New York University

In this paper we describe a new technique for the characterisation of populations of DNA strands. Such tools are vital to the study of ecological systems, at both the micro (e.g., individual humans) and macro (e.g., lakes) scales. Existing methods make extensive use of DNA sequencing and cloning, which can prove costly and time consuming. The overall objective is to address questions such as: (i) (Genome detection) Is a known genome sequence present at least in part in an environmental sample? (ii) (Sequence query) Is a specific fragment sequence present in a sample? (iii) (Similarity Discovery) How similar in terms of sequence content are two unsequenced samples?

We propose a method involving multiple filtering criteria that result in ``pools" of DNA of high or very high purity. Because our method is similar in spirit to hashing in computer science, we call the method "DNA hash pooling". To illustrate this method, we describe examples using pairs of restriction enzymes. The "in silico" empirical results we present reflect a robustness to experimental error. The method requires minimal DNA sequencing and, when sequencing is required, little or no cloning.