COS 126
Cracking the Genetic Code
Due: Wed. 4/23, 11:59pm

For this assignment, you'll write a program, genes.c, to find the genetic encoding of amino acids, given a protein and a genetic sequence known to contain that protein. This program is an example of computational biology, a new science that uses computers to do biological research. Also, you'll learn about hashing, a widely used technique for making programs more efficient.

Background

For the purposes of this assignment, a genetic sequence is a string comprised of one of the four characters 'a', 'c', 'g', or 't', and an amino acid encoding is a genetic sequence of length 3. There are 25 amino acids, which we will refer to with the capital letters 'A' through 'Y'. A protein is a sequence of amino acids. There are 4×4×4 = 64 different amino acid encodings, each of which corresponds to at most one amino acid, but some of the 25 amino acids might have more than one encoding.

Given a protein and a long genetic sequence known to contain that protein, we want to establish the genetic encodings of the amino acids in the protein. An example illustrates this process. Suppose that we know that the protein M S I Q H M R is somewhere in the genetic sequence attgctagcaatgctagcaattgctagcaattcat, but we don't know where, and we don't know any of the encodings for the amino acids. The problem seems hopeless at first, but it turns out that a straightforward approachthe one-pass trial-and-error methodworks if the protein sequence is long enough. First, try aligning the first characters:

attgctagcaatgctagcaattgctagcaattcat
M  S  I  Q  H  M  R

This would work if a code for M were att, a code for S were gct, and so on. But this can't be correct, because gct cannot be an encoding for both S and H, as illustrated by the bold characters in the display above. Note that having two different genetic encodings for an amino acid (M in this case as illustrated by the slanted characters) is allowedwhat is required that each genetic code represent only one amino acid.

The sequences don't align at the beginning, so we next slide the protein one character to the right and try aligning the protein sequence with the second character of the genetic sequence:

attgctagcaatgctagcaattgctagcaattcat
 M  S  I  Q  H  M  R

Again, cta can't encode two different amino acids, so this can't be the alignment. The same thing happens again for aligning the protein with the third character of the genetic sequence; how about the fourth character?

attgctagcaatgctagcaattgctagcaattcat
   M  S  I  Q  H  M  R

Not only is there a conflict with agc encoding both S and H, but also with gct encoding both M and Q. Continuing to slide the protein sequence along the genetic sequence in this way, we finally get an alignment that could work:

attgctagcaatgctagcaattgctagcaattcat
          M  S  I  Q  H  M  R

This alignment is evidence that atg and agc are genetic codes for the amino acid M, cta is a code for S, gca for I, etc.

This example is an artificial one designed to illustrate how the method works. In real data, the sequences are much longer, and there's no mistaking a match. Furthermore, errors never happen: The same codes always emerge every time the method is run on a genetic sequence known to contain a protein. (These may be found in any biochemistry textbook.)

The Assignment

Your program will accept two arguments: The name of the file holding the protein sequence and the name of the file holding the gene sequence, and it prints the position in the gene sequence at which it found a match and the amino acid encodings. For example:

% /u/cs126/bin/lcc genes.c
% a.out /u/cs126/data/prot.1 /u/cs126/data/gene.1
Match found at gene position 10
aaa .    aac .    aat R    aag .    
aca .    acc .    act .    acg .    
ata .    atc .    att Q    atg M    
aga .    agc M    agt .    agg .    
caa .    cac .    cat .    cag .    
cca .    ccc .    cct .    ccg .    
cta S    ctc .    ctt .    ctg .    
cga .    cgc .    cgt .    cgg .    
taa .    tac .    tat .    tag .    
tca .    tcc .    tct .    tcg .    
tta .    ttc .    ttt .    ttg .    
tga .    tgc .    tgt .    tgg .    
gaa .    gac .    gat .    gag .    
gca I    gcc .    gct H    gcg .    
gta .    gtc .    gtt .    gtg .    
gga .    ggc .    ggt .    ggg .

The files /u/cs126/data/prot.1 and /u/cs126/data/gene.1 contain the sample data above for debugging. The output shows all 64 possible amino acid encodings and the encodings for each protein in /u/cs126/data/prot.1; periods identify unused encodings. (Your program might print the 64 possibilities in a different order, but the encodings should be the same.)

Here's an outline of genes.c; copy this code to use as a starting point.

#include <stdio.h>
#include "misc.h"

int hash(char *p) {
	/* return a 6-bit number for the 3-character encoding p */ }

void unhash(int x) {
	/* print the 3-character encoding corresponding
	   to the 6-bit number x */ }

int main(int argc, char *argv[]) {
	char gene[10000], protein[1000], genecode[64]; 
	int j;

	if (argc < 3)
		error("usage: %s protein gene\n", argv[0]);

	/* read the contents of argv[1] into protein[0...] */
	/* read the contents of argv[2] into gene[0...] */

	/* find an encoding for protein[0...] in gene[0...]
	   and record the encoding in gencode[0..63] */

	for (j = 0; j < 64; j++) {
		unhash(j);
		printf(" %c    ", genecode[j]); 
		if ((j + 1)%4 == 0)
			printf("\n");
	}
	return 0;
}

As the outline above suggests, you'll need to implement the alignment procedure in main, and write two auxiliary functions. hash returns a different 6-bit number for each 3-character genetic encoding, and unhash prints the three characters corresponding to each 6-bit number. hash allows you to use the 64-element array genecode in the alignment process to keep track of the encoding assignments. For example, for the example shown above, if hash("aat") is 3, then gencode[3] is 'R'; that is, aat is the encoding for R.

To read a file, you'll need to use fopen, fgetc, and fclose (Deitel and Deitel, Sec. 11.3-5) to open a file, read its contents, and close the file. You should use efopen, declared in misc.h, because it checks the value returned from fopen. error is also declared in misc.h.

Protein sequences are given as uppercase letters and gene sequences are given as lowercase letters; your code can thus read these sequences a character at a time with fgetc and ignore all other characters. Use the functions declared in ctype.h to identify characters (Deitel and Deitel, Sec. 8.3).

The files /u/cs126/data/prot.2 and /u/cs126/data/gene.2 contain real genetic data: the amino acid sequence for a "beta-lactamase'' protein and the genetic sequence for a human virus. Here's what you should get:

% a.out /u/cs126/data/{prot,gene}.2
Match found at gene position 1687
aaa K    aac N    aat N    aag K    
aca T    acc T    act T    acg T    
ata I    atc I    att I    atg M    
aga R    agc S    agt S    agg .    
caa Q    cac H    cat H    cag Q    
cca P    ccc P    cct P    ccg P    
cta L    ctc L    ctt L    ctg L    
cga R    cgc R    cgt R    cgg R    
taa .    tac Y    tat Y    tag .    
tca S    tcc S    tct S    tcg S    
tta L    ttc F    ttt F    ttg L    
tga .    tgc C    tgt C    tgg W    
gaa E    gac D    gat D    gag E    
gca A    gcc A    gct A    gcg A    
gta V    gtc V    gtt V    gtg V    
gga G    ggc G    ggt G    ggg G

Submitting Your Program

Turn in your code and your documentation with the command

/u/cs126/bin/submit 10 readme genes.c

References
R. Sedgewick, Algorithms in C, 2nd ed., Addison-Wesley, 1990, pp. 231-234.


Copyright © 1996 David R. Hanson / drh@cs.princeton.edu
$Revision: 1.3 $ $Date: 1996/12/10 15:43:37 $