COS 226 HW#4: Program Report

COS 226 Programming Assignment #4

Program Report

Your program report should answer the following questions. This should be submitted electronically using whiteboard along with the rest of your program in a file called either report.txt (if in plain ascii text) or report.pdf (if in pdf). Please do not submit using any other formats.

Explain your overall approach.
Estimate using big-Oh notation the maximum possible number of unique k-letter substrings in a sequence of length n over an alphabet of size S. Give the best bound you can think of in terms only of k, n and S. You can regard k as a constant.
Estimate using big-Oh notation how much time and space your program requires to construct a Markov model from a data file of total size N (possibly containing multiple text blocks), where k is the order parameter and S is the alphabet size. Also estimate the time and space requirements to compute the log likelihood of a new test string of length n under such a model. You can regard k as a constant, and you can assume that hash tables require constant time for each operation, and only take up space proportional to the number of keys they contain.
Train your program using the first two debates. Then, using data from the third debate, create a table showing how often your program misclassified a Bush quote (i.e., incorrectly attributed the quote to Kerry), and how often it misclassified a Kerry quote, giving both as a function of the order parameter k. Use a range of value for k, such as: 0, 1, 2, 3, 5, 7, 10, 20, 50.
Write a brief paragraph exploring the effectiveness of this technique for this problem. You can do this however you wish, but try to be creative, thoughtful, perceptive, critical and objective. Your observations can be quantitative and systematic, or they can be more anecdotal. You are welcome (but not required) to do more experiments besides the one in the last question. Feel free to submit graphical plots, if appropriate. You might think about questions like the following: Which values of k seem to be most effective? What about the behavior of this technique seemed to be what you would have expected, and what aspects seemed surprising? At an intuitive level, how does the method seem to be working (or not working)? What features of the text seem to have the greatest influence on its effectiveness? How does this approach to this problem compare to how a human might solve it? What ideas do you have for improving it?
Any known bugs / limitations?
List whatever help (if any) that you received.
Describe any serious problems you encountered.
Any other comments or feedback?
If you created and explored any other datasets for extra credit, please tell us about them and describe what you found.