### COS 226 Programming Assignment

Write a program that uses a symbol table to break a widely-used password encoding scheme.

For it to be effective, this scheme requires an encryption method with two properties: encrypting a password should be easy (since it has to be done each time a user logs on) and recovering the original password from the encrypted version should be hard.

Subset-sum encryption. A simple method for managing passwords works as follows: The length of all passwords is set to a specific number of bits, say N. The system maintains a table T of N integers that are each N bits long. To encrypt a password, the system uses the password to select a subset of the numbers in T and add them together: the sum (modulo 2^N) is the encrypted password.

The following tiny example for 5-bit keys illustrates the process. Suppose that the table has the following five 5-bit numbers:

```0. 10110      0
1. 01101      0
2. 00101      1
3. 10001      0
4. 01011      1
```

For this table, the password 00101 would be encrypted as 10000, since it says to use the sum of rows two and four in the table, and 00101 + 01011 = 10000. Of course, in practice, the table would be much bigger, as discussed below.

Now, suppose you get access to the system table T and you also capture the user names and the corresponding encrypted passwords. This information doesn't get you into the system: to crack a password, you need to know a subset of T that sums to a given encrypted password. The security of the system depends on the difficulty of this problem (which is known as the subset sum problem). Obviously, the password length has to be set long enough to stop you from trying all possibilities, but short enough so that users can remember and type the passwords. In this assignment, you will see that passwords need to be longer than you might think. With N bits there are 2^N different subsets, so it would seem that 40 or 50 bits should be enough, but that is not the case.

Details.   Rather than using numbers, it is typical to use some convenient translation from what users type to numbers. For this assignment, we use an alphabet of 32 characters (lowercase letters plus first six digits) in passwords, and encode them as arrays of 5-bit integers (chars) with 0 encoding 'a', 1 encoding 'b', and so forth. Thus, N = 5*C where C is the number of characters in the password.

To get started, you can use the code encrypt.c, which is the code that the system administrator would use to encrypt a user's password. It reads in the table T, then uses the bits of the password to select the words from the table to add together, and prints out the encrypted version. You can also use the tables easy5.txt, easy8.txt, rand5.txt, and rand8.txt. The first and third are for 5-char (25-bit) keys; the second and fourth are for 8-char (40-bit) keys. The first two are highly structured and are useful for debugging; the other two are randomly generated.

For example, with the constant C defined to 8 in the code, you should get the following result for the password password. The program prints out a proof that the encryption works, for your use in understanding the process:

```% gcc encrypt.c -o encrypt
password  15  0 18 18 22 14 17  3   0111100000100101001010110011101000100011

1 gobxmqkt   6 14  1 23 12 16 10 19   0011001110000011011101100100000101010011
2 qdrvjxwz  16  3 17 21  9 23 22 25   1000000011100011010101001101111011011001
3 joobqxtz   9 14 14  1 16 23 19 25   0100101110011100000110000101111001111001
4 xnoixmnk  23 13 14  8 23 12 13 10   1011101101011100100010111011000110101010
10 tcixtvem  19  2  8 23 19 21  4 12   1001100010010001011110011101010010001100
13 lqtsdtca  11 16 19 18  3 19  2  0   0101110000100111001000011100110001000000
15 zlptzlfp  25 11 15 19 25 11  5 15   1100101011011111001111001010110010101111
18 gmjuvyqw   6 12  9 20 21 24 16 22   0011001100010011010010101110001000010110
20 uoqrdhwp  20 14 16 17  3  7 22 15   1010001110100001000100011001111011001111
22 ltdkzndz  11 19  3 10 25 13  3 25   0101110011000110101011001011010001111001
23 btezrznq   1 19  4 25 17 25 13 16   0000110011001001100110001110010110110000
26 bujilqno   1 20  9  8 11 16 13 14   0000110100010010100001011100000110101110
27 qgaicljl  16  6  0  8  2 11  9 11   1000000110000000100000010010110100101011
28 yyefwcld  24 24  4  5 22  2 11  3   1100011000001000010110110000100101100011
30 gnvowyjk   6 13 21 14 22 24  9 10   0011001101101010111010110110000100101010
34 aynzobxh   0 24 13 25 14  1 23  7   0000011000011011100101110000011011100111
38 lxwewfhh  11 23 22  4 22  5  7  7   0101110111101100010010110001010011100111
39 aenipbjd   0  4 13  8 15  1  9  3   0000000100011010100001111000010100100011
--------  -----------------------   ----------------------------------------
vbskbezp  21  1 18 10  1  4 25 15   1010100001100100101000001001001100101111
```

If you wanted, you could use bc or another calculator to check that the columns sum to the digits corresponding to the letters in the decrypted password. The table easy8.txt makes this check easier, for debugging.

```% encrypt password < easy8.txt
password  15  0 18 18 22 14 17  3   0111100000100101001010110011101000100011

1 aaaaaaac   0  0  0  0  0  0  0  2   0000000000000000000000000000000000000010
2 aaaaaaae   0  0  0  0  0  0  0  4   0000000000000000000000000000000000000100
3 aaaaaaai   0  0  0  0  0  0  0  8   0000000000000000000000000000000000001000
4 aaaaaaaz   0  0  0  0  0  0  0 25   0000000000000000000000000000000000011001
10 aaaaabaa   0  0  0  0  0  1  0  0   0000000000000000000000000000010000000000
13 aaaaaiaa   0  0  0  0  0  8  0  0   0000000000000000000000000010000000000000
15 aaaabaaa   0  0  0  0  1  0  0  0   0000000000000000000000001000000000000000
18 aaaaiaaa   0  0  0  0  8  0  0  0   0000000000000000000001000000000000000000
20 aaabaaaa   0  0  0  1  0  0  0  0   0000000000000000000100000000000000000000
22 aaaeaaaa   0  0  0  4  0  0  0  0   0000000000000000010000000000000000000000
23 aaaiaaaa   0  0  0  8  0  0  0  0   0000000000000000100000000000000000000000
26 aacaaaaa   0  0  2  0  0  0  0  0   0000000000000100000000000000000000000000
27 aaeaaaaa   0  0  4  0  0  0  0  0   0000000000001000000000000000000000000000
28 aaiaaaaa   0  0  8  0  0  0  0  0   0000000000010000000000000000000000000000
30 abaaaaaa   0  1  0  0  0  0  0  0   0000000001000000000000000000000000000000
34 azaaaaaa   0 25  0  0  0  0  0  0   0000011001000000000000000000000000000000
38 iaaaaaaa   8  0  0  0  0  0  0  0   0100000000000000000000000000000000000000
39 zaaaaaaa  25  0  0  0  0  0  0  0   1100100000000000000000000000000000000000
--------  -----------------------   ----------------------------------------
b0onjjbh   1 26 14 13  9  9  1  7   0000111010011100110101001010010000100111
```

Be sure to familiarize yourself with encrypt.c before proceeding. Note that it uses key.h and key.c to perform base-32 addition over C digit integers.

Brute force solution.  Your first task is to write a brute force decrypt program brute.c that does the opposite of encrypt.c. Given an encrypted password and the table, it should find the password: the C characters that, when converted to an N-bit binary number, specifies the subset that sums to the N-bit binary number that converts to the C characters given as command-line argument. You should be able to easily break 5-char passwords, since there are only 32*32*32*32*32 (about 33 million) different possible subsets.

With the constant C defined to be 5, you should get the following result for the password passw:
```% gcc encrypt.c key.c -o encrypt        % gcc brute.c key.c -o brute
% encrypt passw < rand5.txt             % brute exvx5 < rand5.txt
exvx5                                   i0ocs
passw
```
Note that there many not be a unique solution so your program should print out all of them. Note also that this approach won't work for longer passwords. For example, for 10-char passwords, there are 32^10 (over 1000 trillion) different possible subsets.

Symbol-table solution.  Your next task is to write a faster decryption program (name your program decrypt.c) that is functionally equivalent to brute.c but fast enough to decrypt 10-char passwords in a reasonable amount of time (say, under an hour). To do so, use a symbol table. The basic idea is to take a subset S of the table T, compute all possible subset sums that can be made with S, put those values into a symbol table, then use that symbol table to check all those possibilities with just one lookup.

When you consider the scheme just sketched, several questions immediately come to mind: How big should S be? Precisely how is the one-lookup check going to work? Which symbol-table ADT is appropriate? Which algorithm would be best? Coping with these details is the substance of your work for this assignment.

You should debug your program for 6-char, then move up to 8 and 10-char passwords. Your goal, of course, is to be able to decrypt any password that was encrypted with encrypt.c, as in, for example:

```% gcc decrypt.c -o decrypt
% decrypt vbskbezp < rand8.txt
koaofbmx
xvbyofnz
1p1ngsgg
```

This approach also breaks down as the password length increases, but it does sufficiently cut the work to crack passwords to render vulnerable systems that use this scheme.

Deliverables.   Submit the two programs described above and a readme.txt that documents the approach that you used. When feasible, use your decrypt.c implementation to decrypt the following passwords that we encrypted with encrypt.c (using rand8.txt, rand10.txt, and rand12.txt).

```xwtyjjin          h554tkdzti          uz1nuyric5u3
rpb4dnye          oykcetketn          xnsriqenxw5p
kdidqv4i          bkzlquxfnt          4l4dxa3sqwjx
m5wrkdge          wixxliygk1          wuupk1ol3lbq
```

Include in your readme.txt estimates of the amount of time that it would take your brute.c program to break an 8-character password and the amount of time and space that it would take your decrypt.c program to break a 12-character password.

Extra credit. It is annoying that several passwords may map to the same encrypted password. Create tables for which there is no such ambiguity when decrypting.