next up previous
Next: About this document

 COS 341, November 10, 1997

Handout Number 14

DNA Sequencing

The material is from a paper by P. Pevzner, 1-tuple DNA sequencing: computer analysis,'' Journal of Biomoleculare Structure 7 (1989), pp. 63-73.

1 A Sufficient Condition for Unique Eulerian Path

Let G be a general digraph such that there exist vertices x,y satisfying tex2html_wrap_inline103 , tex2html_wrap_inline105 , and tex2html_wrap_inline107 for all other vertices v.

Construct a graph G' as follows. Add an edge (y,x) into the edge set of G. (Now the indegree of every vertex is the same as its outdegree.) Partition the edges of G into simple cycles tex2html_wrap_inline119 (such that all cycles are disjoint in their edges). Define G'=(V',E'), where tex2html_wrap_inline123 and between two vertices tex2html_wrap_inline125 , there are k edges in E' if the two cycles have exactly k points in common in G. Note that depending on the choice of simple cycles, there may be several different G'.

For example, the following general digraph G has G' as shown.

The following fact was stated in Pevzner's paper.

Lemma 1 Let G be connected, and there exist vertices x,y such that tex2html_wrap_inline103 , tex2html_wrap_inline105 , and tex2html_wrap_inline107 for all other vertices v. If tex2html_wrap_inline153 for all v, and if G' is a tree, then G has a unique Eulerian path.

2 Hybridization Method for DNA Sequencing

Let tex2html_wrap_inline161 . Define tex2html_wrap_inline163 , tex2html_wrap_inline165 , tex2html_wrap_inline167 , and tex2html_wrap_inline169 . A (single-stranded) DNA fragment is a string tex2html_wrap_inline171 ; n is the length of the fragment. Define tex2html_wrap_inline175 if tex2html_wrap_inline177 .

Given a DNA fragment tex2html_wrap_inline179 of length n, the hybridization method to determine tex2html_wrap_inline179 works as follows. Let tex2html_wrap_inline185 be a parameter. Construct a chip with tex2html_wrap_inline187 cells, each containing copies of one distinct string tex2html_wrap_inline189 . If we wash a bottle of solution containing many copies of tex2html_wrap_inline179 over the chip, then those cells containing tex2html_wrap_inline193 as substrings of tex2html_wrap_inline195 get some copies of tex2html_wrap_inline179 attached to the cells. The spectrum of tex2html_wrap_inline179 is the set of those activated tex2html_wrap_inline201 s. In other words, the spectrum is the set of all possible length- tex2html_wrap_inline203 substrings of tex2html_wrap_inline195 .

The algorithmic question is: Given the spectrum, can we reconstruct tex2html_wrap_inline195 and hence tex2html_wrap_inline179 ? By this, we mean firstly, how to find a tex2html_wrap_inline195 that can generate exactly this spectrum, and secondly, is this a unique solution?

We shall be only concerned with the case when all the length- tex2html_wrap_inline203 substrings of tex2html_wrap_inline179 (hence also tex2html_wrap_inline195 ) are distinct. This is a reasonable assumption when tex2html_wrap_inline203 is quite a bit larger than tex2html_wrap_inline221 (such as tex2html_wrap_inline223 ). Under this assumption, tex2html_wrap_inline225 .

3 Hybridization and Eulerian Path

Let S be the spectrum of tex2html_wrap_inline179 . Construct a general digraph tex2html_wrap_inline231 as follows. For any string tex2html_wrap_inline233 , call tex2html_wrap_inline235 and tex2html_wrap_inline237 the prefix and suffix of tex2html_wrap_inline179 . Let V be the set of all length- tex2html_wrap_inline243 strings that are either a prefix or a suffix of some element in the spectrum. For each element tex2html_wrap_inline193 in the spectrum, create an edge from its prefix to its suffix.

For any path in tex2html_wrap_inline247 , there is a natural associated string. We illustrate it with the following example. Let tex2html_wrap_inline249 , and S consists of ATG, TGT, TGC, GTG, GCA, GCC, CGC, CCG. Then tex2html_wrap_inline247 has a Eulerian path AT-TG-GT-TG-GC-CC-CG-GC-CA. The string associated with this Eulerian path is ATGTGCCGCA.

It is clear that two different paths give two different associated strings. Also, the string associated with any Eulerian path is a string tex2html_wrap_inline195 such that S is the spectrum of tex2html_wrap_inline179 . If tex2html_wrap_inline247 has a unique Eulerian path, then we have found a tex2html_wrap_inline179 and at the same time know that this is the unique solution.

Pevzner reported for the case tex2html_wrap_inline275 , a statistical experiment shows that for 94% of the strings tex2html_wrap_inline179 , the general digraph tex2html_wrap_inline247 obtained from the spectrum S satisfies the conditions in Lemma 1, and hence tex2html_wrap_inline179 can be reconstructed from the spectrum by this method.




next up previous
Next: About this document

Andrew Yao
Mon Nov 10 18:22:44 EST 1997