Programming Assignment Checklist: DNA Sequence Alignment

Frequently Asked Questions

What are the main goals of this assignment? You will (i) solve a fundamental problem in computational biology, (ii) learn about the analysis of algorithms, and (iii) learn about a powerful programming paradigm known as dynamic programming.

How do I tell Java to use more of my computer's memory? By default, Java will only use 64MB of memory (not very much for this application). You must explicitly ask for more by executing with

java -Xmx300m EditDistance < input.txt
The 300m meand 300MB, and you should adjust this number depending on the amount of memory your computer has.

How do I compute the length of a string? Use x.length().

How do I access the ith character of a Java string x? Use x.charAt(i). As with arrays, indices start at 0.

What's a StringIndexOutOfBoundsException? It's just like an ArrayOutOfBoundsException. You might be invoking charAt(i) with an illegal value of i.

How could I get a NullPointerException? Did you forget to intialize the data member x?

It seems strange to define x[i..M] to be the substring consisting of x[i] through x[M-1] instead of x[M]. Is this a typo? Yes, it's a little strange, but no, it's not a typo. This corresponds exactly with x.substring(i, M) in Java. However, you probably won't need to use the substring method on this assignment.

How do I declare an initialze a two dimensional array in Java? You can declare an (M+1)-by-(N+1) integer array as a data member with int[][] opt; and you can initialize it in the constructor with opt = new int[M+1][N+1]; Warning: do not make the mistake of re-declaring it in the constructor.

What does the keyword final mean? When you declare a variable to be final, you are agreeing to never change its value once it's been initialized. It is often used with fixed constants.

Input, Output, and Testing

Input.   There are many sample data files (with extension .txt) available in the sequence directory. To help you check your work, the edit distances of gene57.txt, stx1230.txt, and ftsa1272.txt are 8, 521, and 758, respectively.

Output.   When you invoke showAlignment, it should print out the alignment either horizontally or vertically as below.

% java EditDistance < example10.txt     % java EditDistance < example10.txt
Edit distance = 7                       Edit distance = 7
A T  1                                  A A C A G T T A C C
A A  0                                  T A   A G G T   C A
C    2                                  1 0 2 0 0 1 0 2 0 1 
A A  0
G G  0
T G  1
T T  0
A    2
C C  0
C A  1

Execution.   We will use the following command to execute your program:

java EditDistance < input.txt
so be sure the main is in a program named EditDistance.java.

Submission

Submission. Submit readme.txt and EditDistance.java. There is no need to submit StdIn.java.

readme.txt. Use the following readme file template. You will lose points if you don't address these questions.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download the directory sequence to your system. It contains sample data files.

  2. Write the main method in EditDistance.java to read in the two strings from standard input, using the method StdIn.readString. For debugging, print them to standard output.

  3. Create a data type EditDistance that has the following data members:
    public class EditDistance {
        final int COST_PER_MISMATCH =  1;
        final int COST_PER_GAP      =  2;
    
        final char ALIGNED   = '\\';
        final char GAP_IN_X  =  '-';
        final char GAP_IN_Y  =  '|';
    
        private String x;          // the first string
        private String y;          // the second string
        private int M;             // length of first string
        private int N;             // length of second string
        private int[][] opt;       // edit distances
        private char[][] sol;      // optimal solution choices
    }
    
    Write a constructor that takes two strings as arguments, and initializes the corresponding data members. Make the two dimensional matrices of size (M+1)-by-(N+1) so that opt[M][N] is not out-of-bounds. Modify main to call the constructor. Do not think of continuing until you are convinced that everything works.

  4. For debugging, you will need a method to print out the contents of the optimal solution matrix and the optimal solution matrix. Write helper methods printOpt and printSol to print out the contents of opt[][] and sol[][] to standard output. Modify main to call the helper function.

  5. Now, it's time to write distance and implement the crucial dynamic programming part. First read Sedgewick 5.3 and think careful about what is going on. Think carefully about the order in which you will perform the computation since this is critical. Hint: first fill in the base case of opt[i][j], e.g., when i = M or j = N. Now, fill in the remaining values using a nested loop. Feel free to use the method Math.min to take the minimum of two integers (or use it twice to take the minimum of three). The distance method should return opt[0][0] since this is the edit distance of x and y. Test that it works by using your printOpt method.

  6. Modify the method distance to fill in the solution matrix sol[i][j]. Test that it works by using your printSol method.

  7. Now, write the method showAlignment to print the optimal alignment. We recommend a single while loop.

  8. To analyze the running time of your program, use the supplied client program EditDistanceTimer.java. It uses the method System.currentTimeMillis to calculate the elapsed time between the beginning and ending of the computation. If your computer is running other processor intensive jobs, the results will be meaningless. Note also that it does not print out the alignment itself to the screen, since this might skew the results as well.

Enrichment