COS 126: Deciphering the Genetic Code

COS 126

Deciphering the Genetic Code

Programming Assignment 8

Due: Wednesday, 11:59pm

Write a program to find the genetic encoding of amino acids, given a protein and a genetic sequence known to contain that protein, using a one-pass trial and error method. This program is an example of computational biology, an emerging science that uses computers to do biological research. Also, you'll learn about hashing, a widely used technique for making programs more efficient.

For the purposes of this assignment, we define a genetic sequence to be a string made up of one of the four characters 'a', 'c', 'g', or 't' and an amino acid encoding to be a genetic sequence of length 3. There are 25 amino acids, which we will refer to with the capital letters 'A' through 'Y', and we define a protein to be a sequence of amino acids. There are 4 × 4 × 4 = 64 different amino acid encodings, each of which corresponds to at most one amino acid, but an amino acid might have more than one encoding.

Pattern matching. Now, given a protein and a long genetic sequence known to contain that protein, we want to establish the genetic encodings of the amino acids in the protein. An example will illustrate the process that we use to do this. Suppose that we know that the protein M S I Q H M R is found somewhere in the genetic sequence attgctagcaatgctagcaattgctagcaattcat but we don't know where, and we don't know any encodings for the amino acids. The problem seems hopeless at first, but it turns out that a direct approach works if the protein sequence is long enough. First, try aligning the first characters:

          attgctagcaatgctagcaattgctagcaattcat
          M  S  I  Q  H  M  R

This approach would work if a code for M were att, a code for S were gct, etc. But this cannot be case, because gct cannot be an encoding for both S and H. Having two different genetic encodings for an amino acid (M in this case) is allowed, but each genetic code must represent only one amino acid.

The sequences don't align at the beginning, so we next try aligning the protein sequence with the second character of the genetic sequence:

          attgctagcaatgctagcaattgctagcaattcat
           M  S  I  Q  H  M  R

Again, cta can't encode two different amino acids, so this can't be the alignment. The same thing happens again for aligning the protein with the third character of the genetic sequence; how about the fourth character?

          attgctagcaatgctagcaattgctagcaattcat
             M  S  I  Q  H  M  R

Not only is there a conflict with agc encoding both S and H, but also gct encodes both M and Q. Continuing to slide the protein sequence along the genetic sequence in this way, we finally get an alignment that could work:

          attgctagcaatgctagcaattgctagcaattcat
                    M  S  I  Q  H  M  R

This alignment is evidence that atg and agc are genetic codes for the amino acid M, cta is a code for S, gca for I, etc.

This is an artificial example to illustrate how the method works. In real data, the sequences are much longer, and there's no mistaking a match. Furthermore, the same codes always come out any time the method is run on a genetic sequence known to contain a protein: these may be found in any biochemistry textbook. We have simplified the biochemistry here considerably, and have the benefit of hindsight over decades of experimental work, but one wonders if computational tricks like this could have sped up the process of discovering the relationships between genes and amino acids. This string-processing abstraction for working with the building blocks of life is coming into widespread use, and it seems clear that computational methods will play an important role in new discoveries.

The program. You may start with the following code, which takes two filenames as command line arguments, and prepares the two needed sequences to be read in. Put it in a file named gene.c.

      #include <stdio.h>
      #include <stdlib.h>

      #define MAX_GENE_LEN 100000
      #define MAX_PROT_LEN   5000
      #define CODE_LEN         64

      int hash(char p[]) {
        /* YOUR CODE HERE */
      }

      void unhash(int n) {
        /* YOUR CODE HERE */
      }
      
      int main(int argc, char *argv[]) {
        char geneseq[MAX_GENE_LEN], protseq[MAX_PROT_LEN], genecode[CODE_LEN]; 
        int i;
        FILE *prot, *gene;

        /* YOUR CODE BELOW TO READ IN GENE AND PROTEIN SEQUENCES */
        prot = fopen(argv[1], "r");
        gene = fopen(argv[2], "r");

     
        /* YOUR CODE BELOW TO FIND MATCH */


        /* print out results */
        for (i = 0; i < CODE_LEN; i++) {
           unhash(i);
           printf(" %c\n", genecode[i]); 
        }
      }

You will need to implement the alignment procedure in main(), and two auxiliary procedures: hash(), which produces a different six-bit number (0-63) for each three-character genetic encoding, and unhash(), which prints out the three characters corresponding to each six-bit number. These procedures will allow you to use the 64-entry table genecode[] in the alignment process to keep track of the encoding assignments that have been made. For example, if hash("aat") is 3 and genecode[3] is 'R', then this means that aat is an encoding for R.

Testing and output. If your program detects a match, you should print out the offset at which the match was found (e.g., at offset 10 in the example above) and the 64-entry table genecode[]. If no match is found, print out "NOT FOUND", but do not print out the table. Use the sample data above ( prot.1 and gene.1 ) for debugging. The files prot.2 and gene.2 contain real genetic data: the first is an amino acid sequence for a beta-lactamase protein and the genetic sequence for a human virus. The files prot.3, gene.3, prot.4, and gene.4 contain more real genetic data.

Submission. Submit the files readme and gene.c.

This assignment was developed by R. Sedgewick and Gun Sirer.