Programming Assignment Checklist: DNA Sequence Alignment

Frequently Asked Questions

What are the main goals of this assignment? You will (i) solve a fundamental problem in computational biology, (ii) learn about the analysis of algorithms, and (iii) learn about a powerful programming paradigm known as dynamic programming.

How do I tell Java to use more of my computer's memory? By default, Java will only use 64MB of memory (not very much for this application). You must explicitly ask for more by executing with

java -Xmx300m EditDistance < input.txt
The 300m means 300MB, and you should adjust this number depending on the amount of memory your computer has and the size of the arrays you will need for the data set you running. (300m should get you through ecoli7000.txt)

How do I access the length of a string? Use x.length().

How do I access the ith character of a Java string x? Use x.charAt(i). As with arrays, indices start at 0.

What's a StringIndexOutOfBoundsException? It's just like an ArrayOutOfBoundsException. You might be invoking charAt(i) with an illegal value of i.

How could I get a NullPointerException? Did you forget to allocate memory for opt[][]?

How do I declare and initialze a two dimensional array in Java? Use int[][] opt = new int[M+1][N+1]; for an (M+1)-by-(N+1) array. All entries are initialized to 0.

It seems strange to define x[i..M] to be the substring consisting of x[i] through x[M-1] instead of x[i] through x[M]. Is this a typo? It's a little strange, but no, it's not a typo. This corresponds exactly with x.substring(i, M) in Java. However, you shouldn't need to use the substring method on this assignment.

Which alignment should I print out if there are two or more optimal ones? Output any one you like.

How do I determine how much memory I have? On Mac, select About this Mac from the Apple menubar. On Windows, type Ctrl-Alt-Delete, select the Taskbar, and look in the physical memory entry. It's also instructive to use the Activity Monitor (Mac) or taskbar (Windows) to observe the CPU and memory usage of your computer, as your program is running.

The asymptotic running time of my program is much better than my mathematical analysis predicts. What could I be doing wrong? If you are running your program and accessing the data files from the H: drive, especially if via a wireless network, the bottleneck for medium N might be the network latency instead of the dynamic programming algorithm! Copy all the files to your local hard drive and run from there.

The asymptotic running time of my program is a bit worse than my mathematical analysis predicts. What could I be doing wrong? When you run out of physical memory, your operating system may start using your hard drive as another form of storage. Accessing information from the hard drive is substantially slower than main memory, and you may be observing this effect.

Who discovered the dynamic programming algorithm we are using? The idea of dynamic programming was first advanced by Bellman (1957). Levenshtein (1966) formalized the notion of edit distance. Needleman-Wunsch (1970) were the first to apply edit distance and dynamic programming for aligning biological sequences, and our algorithm is essentially the one proposed in their seminal paper. The widely-used Smith-Waterman (1981) algorithm is quite similar, but solves a slightly different problem (local sequence alignment instead of global sequence alignment).

Input, Output, and Testing

Input.   There are many sample data files (with extension .txt) available in the sequence directory. To help you check your work, the edit distances of gene57.txt, stx1230.txt, and ftsa1272.txt are 8, 521, and 758, respectively.

Output.   When you execute your program, it should print out the alignment either horizontally or vertically as below, along with the individual penalties.

% java EditDistance < example10.txt     % java EditDistance < example10.txt
Edit distance = 7                       Edit distance = 7
A T  1                                  A A C A G T T A C C
A A  0                                  T A   A G G T   C A
C    2                                  1 0 2 0 0 1 0 2 0 1 
A A  0
G G  0
T G  1
T T  0
A    2
C C  0
C A  1

Execution.   We will use the following command to execute your program:

java EditDistance < input.txt
so be sure the main is in a program named EditDistance.java.

Submission

Submission. Submit readme.txt and EditDistance.java.

readme.txt. Use the following readme file template. You will lose points if you don't address these questions.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download the directory sequence to your system. It contains sample data files.

  2. Write the main method in EditDistance.java to read in the two strings from standard input, using the method StdIn.readString. For debugging, print them to standard output.

  3. Declare and initialize the (M+1)-by-(N+1) array opt[][]. Include the base cases. Print out the 2D array to check your work.

  4. Now, it's time to implement the crucial dynamic programming part. First read the dynamic programming portion of Section 9.6 and make sure you understand how dynamic programming works. Think carefully about the order in which you will perform the computation since this is critical. Hint: first fill in the base case of opt[i][j], e.g., when i = M or j = N. Now, fill in the remaining values using a nested loop. Feel free to use the method Math.min to take the minimum of two integers (or use it twice to take the minimum of three). Test that it works by printing out the contents of opt.

  5. Now, figure out how to recover the optimal alignment by backtracing.

  6. To analyze the running time of your program, execute with the following command
    java -Xmx300m EditDistance < input.txt > output.txt
    and use a stopwatch. We redirect the output to a file to prevent printing text from becoming a computational bottleneck. Alternatively, use the command
    java -Xprof -Xmx300m EditDistance < input.txt > output.txt
    The -Xprof flag tells the compiler to report timing information. Note that the timing information will appear at the end of the file output.txt.

Enrichment