COS 341, November 10, 1997Handout Number 14
DNA Sequencing
The material is from a paper by P. Pevzner, 1-tuple DNA sequencing: computer analysis,'' Journal of Biomoleculare Structure 7 (1989), pp. 63-73.
1 A Sufficient Condition for Unique Eulerian Path
Let G be a general digraph such that there exist
vertices x,y satisfying
,
, and
for all
other vertices v.
Construct a graph G' as follows.
Add an edge (y,x) into the edge set of G.
(Now the indegree of every
vertex is the same as its outdegree.)
Partition the edges of G into simple cycles
(such that all cycles are disjoint in their edges). Define
G'=(V',E'), where
and
between two vertices
, there are k edges in E' if the
two cycles have exactly k points in common in G. Note that
depending on the choice of simple cycles, there may be
several different G'.
For example, the following general digraph G has G' as shown.
The following fact was stated in Pevzner's paper.
Lemma 1 Let G be connected, and there exist
vertices x,y such that
,
, and
for all
other vertices v. If
for all v, and if G' is a tree, then
G has a unique Eulerian path.
2 Hybridization Method for DNA Sequencing
Let
. Define
,
,
, and
.
A (single-stranded) DNA fragment is a string
; n is the length of
the fragment. Define
if
.
Given a DNA fragment
of length n, the hybridization method to
determine
works as follows. Let
be a
parameter. Construct a chip with
cells, each
containing copies of one distinct string
.
If we wash a bottle of solution containing many copies of
over the chip, then those cells containing
as
substrings of
get some copies of
attached
to the cells. The spectrum of
is the set of
those activated
s. In other words, the spectrum is
the set of all possible length-
substrings of
.
The algorithmic question is: Given the spectrum, can we
reconstruct
and hence
?
By this, we mean firstly, how to find a
that
can generate exactly this spectrum, and secondly, is this
a unique solution?
We shall be only concerned with the case when all the
length-
substrings of
(hence also
) are distinct. This is a reasonable
assumption when
is
quite a bit larger than
(such as
).
Under this assumption,
.
3 Hybridization and Eulerian Path
Let S be the spectrum of
. Construct a general
digraph
as follows. For any string
, call
and
the prefix and suffix
of
. Let V be the set of all
length-
strings that are either a prefix or a suffix
of some element in the spectrum.
For each element
in the spectrum, create an edge
from its prefix to its suffix.
For any path in
, there is a natural associated
string. We illustrate it with the following example.
Let
, and
S consists of ATG, TGT, TGC, GTG, GCA, GCC,
CGC, CCG. Then
has a Eulerian path
AT-TG-GT-TG-GC-CC-CG-GC-CA. The string associated
with this Eulerian path is ATGTGCCGCA.
It is clear that two different paths give two different associated
strings. Also, the string associated with any Eulerian path
is a string
such that S is the spectrum of
. If
has a unique Eulerian path, then
we have found a
and at the same time know that this is the
unique solution.
Pevzner reported for the case
, a statistical
experiment shows that for 94% of the strings
,
the general digraph
obtained from the spectrum S
satisfies the conditions in Lemma 1, and hence
can
be reconstructed from the spectrum by this method.