Goals

  • Solve a pattern matching problem that arises in computational biology.

  • Learn to use strings.

  • More practice with arrays.

  • Part 0

  • Copy the following files from /u/cs126/files/gene into an empty directory.
    gene.c   prot.1   gene.1   prot.2   gene.2
    prot.3   gene.3   prot.4   gene.4
    
    You can copy all the files to the current directory with the command:
    cp /u/cs126/files/gene/* .
    

  • Part 1    (code and decode)

  • You can use gene.c as starting point. It handles the input and output. After filling in the missing details, you will be able to execute with the following command:
    a.out prot.1 gene.1
    
    Don't accidentally use
    a.out gene.1 prot.1
    

  • First, get the code function working and debugged. It should read in a character array and return an integer corresponding to the first 3 characters in the array. Think of the 3 characters as a integer represented in base 4, with the mapping a=0, c=1, g=2, and t=3. Your job is to convert this to base 10.

  • One approach for debugging the code function is to first comment out the portion of code that prints out the results, and replace it with printf statements like the following:
    printf("%d\n", code(geneseq));
    printf("%d\n", code(geneseq + 3));
    
    The first line should convert the first three characters of the input file gene.1 to the appropriate integer. Similarly, the second line converts the second 3 characters to an integer. Note that geneseq + 3 is the array of gene sequence data starting at the 3rd element; it is equivalent to &geneseq[3]. Since the first 6 characters of gene.1 are attgct, you should get the following output:
    15
    39
    

  • Now, get the decode function working and debugged. The input is an integer between 0 and 63. The function should print out 3 characters corresponding to this integer. Think of the integer converted to base 4. Print out this number using the mapping a=0, c=1, g=2, t=3.

  • To debug, you can replace the printf statements above with:
    decode(15);
    decode(39);
    
    This should prodcue the following output:
    att gct
    

  • To convert from base 10 to base 4, you can use integer division and remainder, or, since the base is a power of 2, right shift and bitwise AND.

  • Part 2    (matching)

  • To aid in debugging, initialize each element of genecode[64] to '-' instead of ' '.

  • You may wish to use the strlen library function.

  • The solution for the example data in gene.1 and prot.1 is gene.1.ans.

  • Submission
  • Use the following submit command:
    /u/cs126/bin/submit 7 readme gene.c
    

  • The readme file should contain:

  • Name, precept number, high level description of code, any problems encountered, and whatever help (if any) your received.

  • Descrbe how you implemented the functions code and decode.

  • Include the output from the prot.2/gene.2 and prot.3/gene.3 data sets, i.e., the position of the match and the mapping of nucleotide triplets to amino acids.

  • Enrichment Links

  • The genetic data is actually cDNA (the coding region of DNA) not DNA; the mapping will be similar to RNA with t replaced by u if you wish to compare with your biology textbook, or the following amino acid table borrowed from EBB 320.

  • The genetic data is taken from the National Center for Biotechnology.



  • Kevin Wayne