Published on *Computer Science Department at Princeton University* (http://www.cs.princeton.edu)

This thesis introduces a new representation for genetic sequences in

the form of geometric points or vectors. It is based on the relative

frequencies of the various (small) fixed length substrings of

$k-tuples$ of the DNA or protein sequences. This effectively

transforms the sequence from a variable sized string to fixed size

vector. We show that this transformation preserves, under certain

circumstances, the $edit^distince$ between sequences, a widely used

measure for comparing genetic sequences. This fact is used to develop

a linear time heuristic for sequence comparison, an improvement over

the quadratic time dynamic programming based algorithms currently in

widespread use. The transformation from a variable sized sequence to a

fixed size vector representation allows computational geometric

techniques to be applied to the study of genetic sequences. In

particular, we develop a method of comparing several sequences

simulatneously without having to compare each pair of sequences

separately. This results in a substantial reduction in the complexity

of the problem of multiple sequence comparison and clustering. This

can be applied as a filter to extract interesting sets of sequences

from a large database for more thorough study as well as an indexing

method, to locate the database sequences most likely to be related to a

query sequence.

**Links**

[1] http://www.cs.princeton.edu/research/techreps/author/393

[2] ftp://ftp.cs.princeton.edu/techreports/1991/332.pdf