COS 226 Programming Assignment: Redundancy Detector

COS 226 Programming Assignment

Redundancy Detector

Implement a redundancy detector. Design and implement an algorithm for finding the longest repeated sequence in a long string of characters. For example, abra is the longest repeated sequence in the string abracadabra.

Perspective. Although the problem was originally motivated by English text, it now has dramatic applications to molecular biology. DNA has a remarkable amount of repetitive structure, and this repetitive structure plays a crucial role in various biological functions and their evolutionary role. The ongoing discovery of efficient algorithms for finding repetitive structures (e.g., the redundancy detection problem) enables scientists to unravel some of the great mysteries of molecular biology.

Getting started. You will first want to implement a brute-force method to solve the problem, to get some idea of how much computation is required and to have a reliable check for short strings when debugging. The brute force algorithm checks the length of the match at every possible pair of starting positions, so it uses a quadratic number of character comparisons, with the constant of proportionality depending on the match lengths. After getting the brute-force method working, you may wish to proceed by improving that method (remove obvious inefficiencies) or by devising some completely different method. This problem is amenable to attack with a variety of the algorithmic tools that we have covered in class, though it's not at all clear which approach will yield the fastest solution. You may use a reasonable amount of extra space to help speed up your algorithm, but, as usual, don't throw space away.

Input and output. Read the input string from standard input and print the results to standard output. Filter out any non-printable characters and convert any newlines to spaces with the following code snippet:

int ch, N = 0;
while ((ch = getchar()) != EOF) {
    if (isspace(ch))
        a[N++] = ' ';
    else if (isprint(ch))
        a[N++] = ch;
}

Your program should output the length of the repeated substring on the first line and the substring itself on the second line, followed by a newline (and nothing else). Do not base your algorithm on the string being purely random, but, on the other hand, do not worry about pathological cases (for example, strings consisting of all blanks). We provide a variety of real-world inputs for your testing.

Analysis. Run your algorithm on some of the sample inputs we provide, and give the running time in seconds. Also, analyze the asymptotic running time and memory consumption as a function of the string length N and the length of the longest match L.

Advice. We have studied some very complex algorithms that might do well on this problem, but there are also some relatively simple ideas that are powerful.

Challenge for the bored. Write a program that takes as input two strings and finds the longest common substring that appears in both. Your program should still be efficient even if there is a huge common substring, e.g., "a.out mobydick.txt mobydick.txt" should identify a repeated substring of over a million characters.