COS 126 Programming Assignment: Sequence Alignment

COS 126

Global Sequence Alignment

Programming Assignment

Due: Wednesday, 11:59pm

Write a program to compute the optimal sequence alignment of two DNA strings. This program will introduce you to the emerging field of computational biology in which computers are used to do research on biological systems. Further, you will be introduced to a powerful algorithmic design paradigm known as dynamic programming.

Biology review. A genetic sequence is a string formed from a four-letter alphabet {Adenosine (A), Thymidine (T), Guanosine (G), Cytidine (C)} of biological macromolecules referred to together as the DNA bases. A gene is a genetic sequence that contains the information needed to construct a protein. All of your genes taken together are referred to as the human genome, a blueprint for the parts needed to construct the proteins that form your cells and, by extension, your body. Each new cell produced by your body receives a copy of the genome. This copying process, as well as natural wear and tear, introduces a small number of changes into the sequences of many genes. Among the most common changes are the substitution of one base for another and the deletion of a substring of bases; such changes are generally referred to as point mutations. As a result of these point mutations, the same gene sequenced from closely related organisms will have slight differences.

The problem. Through your research you have found the following sequence of a gene in a previously unstudied organism.

A A C A G T T A C C

What is the function of the protein that this gene encodes? You could immediately begin a series of uninformed experiments in the lab to determine what role this gene plays. However, there is a good chance that it is a variant of a known gene in a previously studied organism. Since biologists and computer scientists have laboriously determined (and published) the genetic sequence of many organisms (including humans), you would like to leverage this information to your advantage. We'll compare the above genetic sequence with one which has already been sequenced and whose function is well understood.

T A A G G T C A

If the two genetic sequences are similar enough, we might expect them to have similar functions. We would like a way to quantify "similar enough".

Edit-distance. We measure the similarity of two genetic sequences by using a very popular method known as the edit distance, a concept which is also widely used in spell checking, speech recognition, plagiarism detection, file revisioning, and computational linguistics. We align the two sequences, but we are permitted to insert gaps in either sequence (e.g., to make them have the same length). We pay a penalty for each gap that we insert and also for each pair of characters that mismatch in the final alignment. Intuitively, these penalties model the relative likeliness of point mutations arising from deletion/insertion and substitution. We produce a numerical score according to the following simple rule, which is widely used in biological applications:

Penalty per gap 2
Penalty per mismatch 1
Penalty per match 0

As an example, two possible alignments of aacagttacc and taaggtca are:

Sequence 1 A A C A G T T A C C
Sequence 2 T A A G G T C A - -
Penalty 1 0 1 1 0 0 1 0 2 2

Sequence 1 A A C A G T T A C C
Sequence 2 T A - A G G T - C A
Penalty 1 0 2 0 0 1 0 2 0 1

The first alignment has a score of 8, while the second one has a score of 7. The edit-distance is the score of the best possible alignment between the two genetic sequences over all possible alignments. In this example, the second alignment is in fact optimal, so the edit-distance between the two strings is 7. Computing the edit-distance is a nontrivial computational problem because we must find the best alignment among exponentially many possibilities. For example, if both strings are 100 characters long, then there are more than 10^75 possible alignments.

A solution. Your job is to write a program to compute the edit-distance and the optimal alignment of two genetic sequences. We will calculate the edit-distance by breaking up the sequence alignment problem on the two original strings x and y into many alignment problems on the suffixes of the two strings. We use the notation x[i] to refer to character i of the string. We also use the notation x[i..M] to refer to the suffix of x consisting of the characters x[i], x[i+1], ..., x[M-1]. Finally, we use the notation opt[i][j] to denote the edit distance of x[i..M] and y[j..N]. For example, consider the two strings x = "AACAGTTACC" and y = "TAAGGTCA" of length M = 10 and N = 8, respectively. Then, x[2] is 'C', x[2..M] is "CAGTTACC", and y[8..N] is the empty string. The edit distance of x and y is opt[0][0].

Consider the first pair of characters in an optimal alignment of x[i..M] with y[j..N]. There are three possibilities:

The optimal alignment matches x[i] up with y[j]. In this case, we pay a penalty of either 0 or 1, depending on whether (x[i] == y[j]), plus we still need to align x[i+1..M] with y[j+1..N]. What is the best way to do this? This subproblem is exactly the same as the original sequence alignment problem, except that the two inputs are each suffixes of the original inputs. Using our notation, this quantity is opt[i+1][j+1].
The optimal alignment matches the x[i] up with a gap. In this case, we pay a penalty of 2 for a gap and still need to align x[i+1..M] with y[j..N]. This subproblem is identical to the original sequence alignment problem, except that the first input is a proper suffix of the original input.
The optimal alignment matches the y[j] up with a gap. In this case, we pay a penalty of 2 for a gap and still need to align x[i..M] with y[j+1..N]. This subproblem is identical to the original sequence alignment problem, except that the second input is a proper suffix of the original input.

The key observation is that all of the resulting subproblems are sequence alignment problem on suffixes of the original inputs. To summarize, we can compute opt[i][j] by taking the minimum of three quantities:

opt[i][j] = min { opt[i+1][j+1] + 0/1, opt[i+1][j] + 2, opt[i][j+1] + 2 }

This equation works assuming i < M and j < N. Aligning an empty string with another string of length k requires inserting k gaps, for a total cost of 2k. Thus, in general we should set opt[M][j] = 2(N-j) and opt[i][N] = 2(M-i). For our example, the final matrix is:

       |  0  1  2  3  4  5  6  7  8
   x\y |  T  A  A  G  G  T  C  A  -
-----------------------------------
 0  A  |  7  8 10 12 13 15 16 18 20
 1  A  |  6  6  8 10 11 13 14 16 18
 2  C  |  6  5  6  8  9 11 12 14 16
 3  A  |  7  5  4  6  7  9 11 12 14
 4  G  |  9  7  5  4  5  7  9 10 12
 5  T  |  8  8  6  4  4  5  7  8 10
 6  T  |  9  8  7  5  3  3  5  6  8
 7  A  | 11  9  7  6  4  2  3  4  6
 8  C  | 13 11  9  7  5  3  1  3  4
 9  C  | 14 12 10  8  6  4  2  1  2
10  -  | 16 14 12 10  8  6  4  2  0

By examining opt[0][0], we conclude that the edit distance of x and y is 7.

A dynamic programming approach. A direct implementation of the above recursive scheme will work, but it is spectacularly inefficient. If both input strings have N characters, then the number of recursive calls will exceed 2^N. To overcome this performance bug, we use dynamic programming. (Read Sedgewick 5.3 for an introduction to this technique.) Dynamic programming is a powerful algorithmic paradigm that forms the core computational engine of many programs, including BLAST (the sequence alignment program almost universally used by molecular biologist in their experimental work). The key idea of dynamic programming is to break a large computational problem up into smaller subproblems, store the answers to those smaller subproblems, and, eventually, using the stored answers to solve the original problem. This avoids recomputing the same quantity over and over again. Instead of using recursion, use a nested loop that calculates opt[i][j] in the right order so that opt[i+1][j+1], opt[i][j+1], and opt[i+1][j+1] are all computed before we try to compute opt[i][j].

Finding the alignment itself. The above procedure above indicates how to compute the value of the optimal alignment. We now describe how to find the optimal alignment itself. In order to reconstruct the optimal alignment, maintain a character matrix, say sol[i][j], to keep track of where the minimum value for aligning x[i..M] with y[j..N] came from. For example, if the minimum came from aligning x[i] with y[j], then we can record this fact by drawing an arrow from (i, j) to (i+1, j+1). We can obtain a crude ASCII picture of such an arrow by storing one of the three characters '\', '-', or '|' into sol[i][j]. We interpret the three symbols as arrows emanating from (i, j) and terminating at (i+1, j+1), (i+1, j), and (i, j+1), respectively. For the example above, we get the the following solution matrix:

       |  0  1  2  3  4  5  6  7  8
   x\y |  T  A  A  G  G  T  C  A  -
-----------------------------------
  0  A |  \  \  \  \  |  \  |  \  | 
  1  A |  \  \  \  \  |  \  |  \  | 
  2  C |  \  \  |  \  |  |  \  |  | 
  3  A |  -  \  \  \  |  |  \  \  | 
  4  G |  \  \  \  \  \  |  \  |  | 
  5  T |  \  \  \  \  \  \  \  |  | 
  6  T |  \  \  \  \  \  \  \  |  | 
  7  A |  -  \  \  \  \  \  |  \  | 
  8  C |  \  \  \  \  \  \  \  \  | 
  9  C |  -  -  -  -  -  -  \  \  | 
 10  - |  -  -  -  -  -  -  -  -  .

Note that all arrows point from top to bottom and left to right. In order to reconstruct the alignment, we follow the arrows from the upper left corner opt[0][0] until we arrive at the lower right corner opt[M][N]. When we encounter '-', we insert a gap in x; when we encounter '|', we insert a gap in y; and, when we encounter '\', we align the two characters. The optimal alignment is the second candidate alignment in the edit-distance section.

Program organization. Create an EditDistance ADT with the following interface:

public EditDistance(String x, String y)  // constructor takes the two strings as input
public int distance()                    // return the edit distance of x and y
public void showAlignment()              // print the optimal alignment of x and y

The following client program should compute the edit distance of x and y, where the two strings come from standard input.

public static void main(String[] args) {
    String x = StdIn.readString();
    String y = StdIn.readString();
    EditDistance ed = new EditDistance(x, y);
    System.out.println("Edit distance = " + ed.distance());
    ed.showAlignment();
}

Analysis. Test your program using the example provided above, as well as the genomic data sets referred to in the checklist. Estimate the running time and memory usage of your program as a function of the lengths of the two input strings M and N. For simplicity, assume M = N.

This assignment was created by Thomas Clarke, Robert Sedgewick, Scott Vafai and Kevin Wayne.

Sequence 1	`A`	`A`	`C`	`A`	`G`	`T`	`T`	`A`	`C`	`C`
Sequence 2	`T`	`A`	`A`	`G`	`G`	`T`	`C`	`A`	`-`	`-`
Penalty	1	0	1	1	0	0	1	0	2	2