COS 126 Programming Assignment: Sequence Alignment

COS 126

Global Sequence Alignment

Programming Assignment

Due: Wednesday, 11:59pm

Write a program to compute the optimal sequence alignment of two DNA strings. This program will introduce you to the emerging field of computational biology in which computers are used to do research on biological systems. Further, you will be introduced to a powerful algorithmic design paradigm known as dynamic programming.

Biology review. A genetic sequence is a string formed from a four-letter alphabet {Adenosine (A), Thymidine (T), Guanosine (G), Cytidine (C)} of biological macromolecules referred to together as the DNA bases. A gene is a genetic sequence that contains the information needed to construct a protein. All of your genes taken together are referred to as the human genome, a blueprint for the parts needed to construct the proteins that form your cells and, by extension, your body. Each new cell produced by your body receives a copy of the genome. This copying process, as well as natural wear and tear, introduces a small number of changes into the sequences of many genes. Among the most common changes are the substitution of one base for another and the deletion of a substring of bases; such changes are generally referred to as point mutations. As a result of these point mutations, the same gene sequenced from closely related organisms will have slight differences.

The problem. Through your research you have found the following sequence of a gene in a previously unstudied organism.

A A C A G T T A C C

What is the function of the protein that this gene encodes? You could immediately begin a series of uninformed experiments in the lab to determine what role this gene plays. However, there is a good chance that it is a variant of a known gene in a previously studied organism. Since biologists and computer scientists have laboriously determined (and published) the genetic sequence of many organisms (including humans), you would like to leverage this information to your advantage. We'll compare the above genetic sequence with one which has already been sequenced and whose function is well understood.

T A A G G T C A

If the two genetic sequences are similar enough, we might expect them to have similar functions. We would like a way to quantify "similar enough".

Edit-distance. We measure the similarity of two genetic sequences by using a very popular method known as the edit distance, a concept which is also widely used in spell checking, speech recognition, plagiarism detection, file revision, and computational linguistics. We align the two sequences, but we are permitted to insert gaps in either sequence (e.g., to make them have the same length). We pay a penalty for each gap that we insert and also for each pair of characters that mismatch in the final alignment. Intuitively, these penalties model the relative likeliness of point mutations arising from deletion/insertion and substitution. We produce a numerical score according to the following simple rule, which is widely used in biological applications:

Penalty per gap 2
Penalty per mismatch 1
Penalty per match 0

As an example, two possible alignments for the genetic sequences discussed above are:

Sequence 1 A A C A G T T A C C
Sequence 2 T A A G G T C A - -
Penalty 1 0 1 1 0 0 1 0 2 2

Sequence 1 A A C A G T T A C C
Sequence 2 T A - A G G T - C A
Penalty 1 0 2 0 0 1 0 2 0 1

The first alignment has a score of 8, while the second one has a score of 7. The edit-distance is the score of the best possible alignment between the two genetic sequences over all possible alignments. In this example, the second alignment is in fact optimal, so the edit-distance between the two strings is 7. Computing the edit-distance between two strings is a nontrivial computational problem because we must find the best alignment among exponentially many possibilities. For example, if both strings are 100 characters long, then there are more than 10^75 possible alignments.

A recursive solution. Your job is to write a program to compute the edit-distance and the optimal alignment of two genetic sequences. We can compute the edit-distance recursively by breaking up the sequence alignment problem on the two original strings x and y into many alignment problems on the suffixes of the two strings. We use the notation x[i..M] to refer to the suffix of x consisting of the characters x[i], x[i+1], ..., x[M]. Note that x[M] is the string termination character '\0'. For example, consider the two strings x = "AACAGTTACC" and y = "TAAGGTCA" of length M = 10 and N = 8, respectively. Then, x[2..M] is "CAGTTACC" and y[8..N] is the empty string.

Now, we return to our recursive solution technique. Consider the zeroth column in an optimal alignment of x[0..M] with y[0..N]. There are three possibilities:

The optimal alignment matches x[0] = 'A' up with y[0] = 'T'. In this case, we pay a penalty of 1 for a mismatch and still need to align the string x[1..M] with y[1..N]. What is the best way to do this? This subproblem is exactly the same as the original sequence alignment problem, except that the two inputs are each suffixes of the original inputs. We can solve this subproblem recursively.
The optimal alignment matches the x[0] = 'A' up with a gap. In this case, we pay a penalty of 2 for a gap and still need to align the string x[1..M] with y[0..N]. This subproblem is identical to the original sequence alignment problem, except that the first input is a suffix of the original input.
The optimal alignment matches the y[0] = 'T' up with a gap. In this case, we pay a penalty of 2 for a gap and still need to align the string x[0..M] with y[1..N]. This subproblem is identical to the original sequence alignment problem, except that the second input is a suffix of the original input.

All of the resulting subproblems are sequence alignment problem on suffixes of the original inputs. Thus, we need a function, say opt(i, j) that returns the optimum value of aligning the input string x[i..M] with y[j..N]. Using this notation, the optimal value of our original sequence alignment problem is opt(0, 0). In general, to compute opt(i, j), we need to compute the minimum of the following three quantities:

opt(i+1, j+1) + 0/1 by aligning x[i] with y[j], add 0 if a match, 1 if a mismatch

opt(i+1, j) + 2 by aligning x[i] with a gap

opt(i, j+1) + 2 by aligning y[j] with a gap

For our recursive scheme to bottom-out, we need a base case. Aligning an empty string with another string of length j requires inserting j gaps, for a total cost of 2j. Thus, in general we should set opt(M, j) = 2(N-j) and opt(i, N) = 2(M-i). For our example, the final matrix is:

       |  0  1  2  3  4  5  6  7  8
   x\y |  T  A  A  G  G  T  C  A  -
-----------------------------------
 0  A  |  7  8 10 12 13 15 16 18 20
 1  A  |  6  6  8 10 11 13 14 16 18
 2  C  |  6  5  6  8  9 11 12 14 16
 3  A  |  7  5  4  6  7  9 11 12 14
 4  G  |  9  7  5  4  5  7  9 10 12
 5  T  |  8  8  6  4  4  5  7  8 10
 6  T  |  9  8  7  5  3  3  5  6  8
 7  A  | 11  9  7  6  4  2  3  4  6
 8  C  | 13 11  9  7  5  3  1  3  4
 9  C  | 14 12 10  8  6  4  2  1  2
10  -  | 16 14 12 10  8  6  4  2  0

The score of the optimal alignment is opt(0, 0) = 7.

A dynamic programming approach. A direct implementation of the above recursive scheme will work, but it is spectacularly inefficient. If both input strings have N characters, then the number of recursive calls will exceed 2^N.

To overcome this performance bug, we will use dynamic programming. (Read Sedgewick 5.3 for an introduction to this technique.) Dynamic programming is a powerful algorithmic paradigm that forms the core computational engine of many programs, including BLAST (the sequence alignment program almost universally used by molecular biologist in their experimental work). The key idea of dynamic programming is to break a large computational problem up into smaller subproblems, store the answers to those smaller subproblems, and, eventually, using the stored answers to solve the original problem. This avoids recomputing the same quantity over and over again. Specifically, we maintain an array, say knownOpt[i][j], that contains a table (or cache) of the solutions to all of the subproblems that the recursive function has already solved. The recursive function computes and fills in the appropriate table entry when it is presented with a new subproblem. However, when the function is presented with a previously solved subproblem, it simply looks up the appropriate value in the table and returns it.

Finding the alignment itself. The above procedure above indicates how to compute the value of the optimal alignment. We now describe how to find the optimal alignment itself. In order to reconstruct the optimal alignment, we maintain a character matrix, say sol[i][j], to keep track of where the minimum value for aligning x[i..M] with y[j..N] came from. For example, if the minimum came from aligning x[i] with y[j] (in which case we solved the subproblem opt(i+1, j+1), then we can record this fact by drawing an arrow from (i, j) to (i+1, j+1). We can obtain a crude ASCII picture of such an arrow by storing one of the three characters '\', '-', or '|' into sol[i][j]. We interpret the three symbols as arrows emanating from (i, j) and terminating at (i+1, j+1), (i+1, j), and (i, j+1), respectively. For the example above, we get the the following solution matrix:

       |  0  1  2  3  4  5  6  7  8
   x\y |  T  A  A  G  G  T  C  A  -
-----------------------------------
  0  A |  \  \  \  \  |  \  |  \  | 
  1  A |  \  \  \  \  |  \  |  \  | 
  2  C |  \  \  |  \  |  |  \  |  | 
  3  A |  -  \  \  \  |  |  \  \  | 
  4  G |  \  \  \  \  \  |  \  |  | 
  5  T |  \  \  \  \  \  \  \  |  | 
  6  T |  \  \  \  \  \  \  \  |  | 
  7  A |  -  \  \  \  \  \  |  \  | 
  8  C |  \  \  \  \  \  \  \  \  | 
  9  C |  -  -  -  -  -  -  \  \  | 
 10  - |  -  -  -  -  -  -  -  -  .

Note that all arrows point from top to bottom and left to right. In order to reconstruct the alignment, we follow the arrows from the upper left corner opt[0][0] until we arrive at the lower right corner opt[M][N]. When we encounter '-', we insert a gap in x; when we encounter '|', we insert a gap in y; and, when we encounter '\', we align the two characters. The optimal alignment is the second candidate alignment in the edit-distance section.

Input and output. Read the two input strings from standard input, one per line. Print the edit-distance to standard output. If both strings have length at most 40, also print out a table of optimal values, optimal choices, and an optimal alignment. Test your program using the example provided above, as well as the genomic data sets from GenBank.

Analysis. Estimate the running time and memory usage of your program as a function of the lengths of the two input strings M and N.

This assignment was created by Thomas Clarke, Robert Sedgewick, Scott Vafai and Kevin Wayne.

Sequence 1	`A`	`A`	`C`	`A`	`G`	`T`	`T`	`A`	`C`	`C`
Sequence 2	`T`	`A`	`A`	`G`	`G`	`T`	`C`	`A`	`-`	`-`
Penalty	1	0	1	1	0	0	1	0	2	2