Quick links

An Euclidean Metric for Genetic Sequence Comparison (Thesis)

Report ID:
September 1991
Download Formats:


This thesis introduces a new representation for genetic sequences in
the form of geometric points or vectors. It is based on the relative
frequencies of the various (small) fixed length substrings of
$k-tuples$ of the DNA or protein sequences. This effectively
transforms the sequence from a variable sized string to fixed size
vector. We show that this transformation preserves, under certain
circumstances, the $edit^distince$ between sequences, a widely used
measure for comparing genetic sequences. This fact is used to develop
a linear time heuristic for sequence comparison, an improvement over
the quadratic time dynamic programming based algorithms currently in
widespread use. The transformation from a variable sized sequence to a
fixed size vector representation allows computational geometric
techniques to be applied to the study of genetic sequences. In
particular, we develop a method of comparing several sequences
simulatneously without having to compare each pair of sequences
separately. This results in a substantial reduction in the complexity
of the problem of multiple sequence comparison and clustering. This
can be applied as a filter to extract interesting sets of sequences
from a large database for more thorough study as well as an indexing
method, to locate the database sequences most likely to be related to a
query sequence.

Follow us: Facebook Twitter Linkedin