COS 226 Programming Assignment Checklist: Autocomplete


Please report any bugs or ambiguities to Josh.
Frequently Asked Questions

Why do I get a java.util.InputMismatchException fom readDouble() when I run the sample client? You probably have an old version of stdlib.jar in your classpath. On Mac OS X, be sure to check /Library/Extensions/Java and ~/Library/Extensions/Java folders.

Can I assume that the weights are positive integers? You may assume that the weights are positive, but do not assume that they are integers. In some of the data files, the weights represent populations or frequencies and are stored as integers. But, in general, the weights can be arbitrary real numbers, so you should use the type double. Note that integers between 0 and 2^53 are represented exactly using Java's double type, i.e., with no loss of precision.

Can I assume that the weights are all distinct? No. However, if there are two terms with equal weights, the result of an autocomplete query may not be unique. In such cases, you are free to return any correct result.

Can I assume that the terms are in descending order? No, while this is true for many of the sample input files, do not make this assumption.

Can a term or query be the empty string? Yes and yes. For example, the empty-string query should return the top-k overall terms in the input. It might be harder to think of an application for an empty-string term, but the API does not exclude this possibility.

Can I assume that the alphabet consists of only the 26 uppercase letters A throught Z? No, as stated in the assignment, the characters can be arbitrary Unicode characters (other than tabs and newlines). You may assume that all characters are in the basic multilingual plane (U+0000 to U+FFFF), which includes nearly all commonly used writing systems and symbols. This implies that any call to charAt() will return a full Unicode character (rather than one-half of a supplementary character). If you like, you may also assume that the characters U+0000 and U+FFFF are not used—for example, you may use either character as a sentinel that is smaller than or larger than every other character.

What do I need to do to handle Unicode characters and UTF-8 encoding? In summary, you shouldn't need to do anything special: The Java String data type hanldes Unicode naturally; The supplied test files are encoded using UTF-8. When reading input or writing output, StdIn, In, and StdOut assume UTF-8 encoding.

How can I view the Unicode characters that my program prints out?

Note that even if you can't view Unicode properly, it's likely that your program is still handling Unicode properly. You can use the program UnicodeTest.java to test whether your system is configured to display Unicode.

Should the terms and queries be case sensitive? Yes and yes. They should also be diacritic sensitive: a diacritic is a mark added to a letter, such as an accent. That is, you shouldn't do anything special to deal with case or diacritics. (In real-world applications, you would typically normalize the strings to a canonical representation by removing diacritics using java.text.Normalizer and converting to upper case.)

Can I just implement the solution suggested on this page? Sure. You should be able to explain why this algorithm meets the constraints on the assignments page.

I had a really cool idea but it seems slow. Do I have to start over? Contact Josh, but probably not.

The Dropbox script is taking longer than usual, should I worry? As long as you see all the tests, you're fine. The testing procedure for this problem is tricky because we create a new Autocomplete object for every test, and this can take quite a non-trivial amount of time for large input files.
TST-Specific FAQ
These questions were inspired by Piazza and Office Hours questions. I'm not quite sure how frequent they are, but I've erred on the side of caution and included all of the significant ones.

How do I use the sentinel characters U+0000 and U+FFFF? Example: String s = "\uFFFF" is just a single character equal to U+FFFF.

How do I recover the string corresponding to a leaf node? There are two easy ways. One is to store a reference from each node to its parent, e.g. the parent of the blue "p" node on the checklist would be "s", so that if you reach the "y" node, you can go from "y" -> "p" -> "s". The easier way is to simply store the string in each leaf node.

How do I deal with the case where one prefix match is a substring of another, for example, "spit" and "spite"? The most intuitive way is to use a MinPQ (note: this necessitates creating a container class for weighted terms.). The easiest way is to ensure that no string is a substring of any other (which you can do with sentinel characters!)

I created a helper method to search the subtrie rooted in a prefix but it's behaving weirdly. For example, if I search "s" on the TST shown in the checklist below, it also searches the left child so that I'm getting "buck" back as a match for "s". How can I best handle this? One approach is pass s's mid child (in this case, the node containing "m") to your helper method. Another way is to enqueue the .mid child for the first delMax(), and enqueue .left, .mid, and .right after all subsequent delMax() operations.

I'm doing something strange with my TST, e.g. creating temporary copies of nodes, calling a collect operation every time I delete the max, exchanging nodes, etc. Is this a bad thing? Not necssarily, but an implementation of the idea in this checklist should be pretty straightforward. There's no need to modify the TST once it has been created.

What else should I look out for? The most common bugs seem to be referencing the prefix-root instead of the most recent item in the MaxPQ, and using the value instead of the max-child-value or vice-versa for performing comparisons (this latter problem being especially common for solutions which use two PQs).
Input, Output, and Testing

Input files. Below are some input files for testing. Each input file consists of an integer N followed by N pairs of terms and integer weights, one pair per line, with the term and weight separated by a tab. The terms can be arbitrary strings consisting of Unicode characters, including spaces (but neither tabs nor newlines).

file number term weight source
pu-buildings.txt 166 Princeton buildings age (in Y10K)
fortune1000.txt 1,000 company revenue Fortune 1000
pu-courses.txt 6,771 course enrollment Princeton 2006-2014
wiktionary.txt 10,000 word frequency Wiktionary
baby-names.txt 31,109 first name frequency U.S. Social Security
cities.txt 93,827 city population geoba.se
2grams.txt 277,718 2-gram frequency Google Web Trillion Word Corpus
words-3333333.txt 333,333 word frequency Google Web Trillion Word Corpus
alexa.txt 1,000,000 web site frequency Alexa
movies.txt 229,447 movies revenue (USD) Internet Movie Database
actors.txt 2,875,183 actors revenue (USD) Internet Movie Database
artists.txt 43,849 music artist familiarity Million Song Dataset
songs.txt 922,230 song hotttnesss Million Song Dataset

Possible progress steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

A TST-Based Solution

Here, we suggest one effective strategy that combines a ternary search trie with a priority queue. The TST contains all of the terms, along with their weights. In addition, each node stores the maximum weight in its subtrie—we will exploit this auxiliary information when we implement the search algorithm. The following figure illustrates the data structure for the 6-word input file autocomplete6.txt.

Trie structure

Now, we want to find the top k terms among those that start with a given prefix. Clearly, all of the matching words are in the subtrie correpsonding to the prefix, so the first step is to search for this node, say x. Now, to identify the top k matches, we could examine every node in the subtrie rooted at x, but this takes too long if the subtrie is large. Instead, we will use the auxiliary information to avoid exploration of useless parts of the subtrie.

Specifically, we create a maximum-oriented priority queue of nodes and initialize the priority queue with node x. Then, we repeatedly remove the maximum-weight node from the priority queue; check if the node corresponds to a term (i.e., it has a value associated with it) add, if so, add that term to a list of potential terms; and add to the priority queue any non-null nodes referenced by it (left, mid, and right). We can terminate the search as soon as the weight of the heaviest node in the priority queue is less than (or equal to) the weight of the kth heaviest term in the list because no remaining term in the subtrie has weight larger than the weight of the heaviest node. Warning: it is not safe to terminate the search as soon as the list of potential terms has k terms. Why?

Below is a snapshot of the algorithm when searching for the top-3 terms that start with the prefix s. The next node to be examined is the one containing the letter p. The first matching term that is discovered will be spit, followed by spite, and sad. The subtries rooted at y and o will never be explored.

Search Algorithm
Diagram errata: the node containing the character 'y' should be the right child of the node containing the character 'i' (instead of 'p').

Improvements.

Other Approaches

Range maximum queries. The problem of finding the top term matching a given prefix can be formulated as a range maximum query problem or equivalently as a lowest common ancestor problem. Here are some lecture notes by Erik Demaine on the equivalence.

Priority search trees.

Enrichment

Any way to find the range of words that start with a given prefix in time proportional to the length of the prefix (in the worst case)? Yes, use a suffix tree (or a suffix array with the lcp array), where the terms are separated by a word-separation symbol and concatenated together. To use space proportional to the number of terms consider a suffix array on words. Or consider a compact DAWG.