COS 226 Programming Assignment Checklist: Autocomplete

Please report any bugs or ambiguities to Josh.

Frequently Asked Questions

Why do I get a java.util.InputMismatchException fom readDouble() when I run the sample client? You probably have an old version of stdlib.jar in your classpath. On Mac OS X, be sure to check /Library/Extensions/Java and ~/Library/Extensions/Java folders.

Can I assume that the weights are positive integers? You may assume that the weights are positive, but do not assume that they are integers. In some of the data files, the weights represent populations or frequencies and are stored as integers. But, in general, the weights can be arbitrary real numbers, so you should use the type double. Note that integers between 0 and 2^53 are represented exactly using Java's double type, i.e., with no loss of precision.

Can I assume that the weights are all distinct? No. However, if there are two terms with equal weights, the result of an autocomplete query may not be unique. In such cases, you are free to return any correct result.

Can I assume that the terms are in descending order? No, while this is true for many of the sample input files, do not make this assumption.

Can a term or query be the empty string? Yes and yes. For example, the empty-string query should return the top-k overall terms in the input. It might be harder to think of an application for an empty-string term, but the API does not exclude this possibility.

Can I assume that the alphabet consists of only the 26 uppercase letters A throught Z? No, as stated in the assignment, the characters can be arbitrary Unicode characters (other than tabs and newlines). You may assume that all characters are in the basic multilingual plane (U+0000 to U+FFFF), which includes nearly all commonly used writing systems and symbols. This implies that any call to charAt() will return a full Unicode character (rather than one-half of a supplementary character). If you like, you may also assume that the characters U+0000 and U+FFFF are not used—for example, you may use either character as a sentinel that is smaller than or larger than every other character.

What do I need to do to handle Unicode characters and UTF-8 encoding? In summary, you shouldn't need to do anything special: The Java String data type hanldes Unicode naturally; The supplied test files are encoded using UTF-8. When reading input or writing output, StdIn, In, and StdOut assume UTF-8 encoding.

How can I view the Unicode characters that my program prints out?

Mac OS X Terminal: select Terminal -> Preferences -> Advanced -> International -> Character encoding -> Unicode (UTF-8).
Windows Command Prompt: type chcp 65001. Use the font Lucida Console (or some other font that supports Unicode).
AutocompleteGUI: should work without adjustment.
DrJava: unknown.
Safari: View -> Text Encodings -> Unicode (UTF-8).

Note that even if you can't view Unicode properly, it's likely that your program is still handling Unicode properly. You can use the program UnicodeTest.java to test whether your system is configured to display Unicode.

Should the terms and queries be case sensitive? Yes and yes. They should also be diacritic sensitive: a diacritic is a mark added to a letter, such as an accent. That is, you shouldn't do anything special to deal with case or diacritics. (In real-world applications, you would typically normalize the strings to a canonical representation by removing diacritics using java.text.Normalizer and converting to upper case.)

Can I just implement the solution suggested on this page? Sure. You should be able to explain why this algorithm meets the constraints on the assignments page.

I had a really cool idea but it seems slow. Do I have to start over? Contact Josh, but probably not.

The Dropbox script is taking longer than usual, should I worry? As long as you see all the tests, you're fine. The testing procedure for this problem is tricky because we create a new Autocomplete object for every test, and this can take quite a non-trivial amount of time for large input files.

TST-Specific FAQ
These questions were inspired by Piazza and Office Hours questions. I'm not quite sure how frequent they are, but I've erred on the side of caution and included all of the significant ones.

How do I use the sentinel characters U+0000 and U+FFFF? Example: String s = "\uFFFF" is just a single character equal to U+FFFF.

How do I recover the string corresponding to a leaf node? There are two easy ways. One is to store a reference from each node to its parent, e.g. the parent of the blue "p" node on the checklist would be "s", so that if you reach the "y" node, you can go from "y" -> "p" -> "s". The easier way is to simply store the string in each leaf node.

How do I deal with the case where one prefix match is a substring of another, for example, "spit" and "spite"? The most intuitive way is to use a MinPQ (note: this necessitates creating a container class for weighted terms.). The easiest way is to ensure that no string is a substring of any other (which you can do with sentinel characters!)

I created a helper method to search the subtrie rooted in a prefix but it's behaving weirdly. For example, if I search "s" on the TST shown in the checklist below, it also searches the left child so that I'm getting "buck" back as a match for "s". How can I best handle this? One approach is pass s's mid child (in this case, the node containing "m") to your helper method. Another way is to enqueue the .mid child for the first delMax(), and enqueue .left, .mid, and .right after all subsequent delMax() operations.

I'm doing something strange with my TST, e.g. creating temporary copies of nodes, calling a collect operation every time I delete the max, exchanging nodes, etc. Is this a bad thing? Not necssarily, but an implementation of the idea in this checklist should be pretty straightforward. There's no need to modify the TST once it has been created.

What else should I look out for? The most common bugs seem to be referencing the prefix-root instead of the most recent item in the MaxPQ, and using the value instead of the max-child-value or vice-versa for performing comparisons (this latter problem being especially common for solutions which use two PQs).

Input, Output, and Testing

Input files. Below are some input files for testing. Each input file consists of an integer N followed by N pairs of terms and integer weights, one pair per line, with the term and weight separated by a tab. The terms can be arbitrary strings consisting of Unicode characters, including spaces (but neither tabs nor newlines).

file number term weight source

pu-buildings.txt 166 Princeton buildings age (in Y10K)

fortune1000.txt 1,000 company revenue Fortune 1000

pu-courses.txt 6,771 course enrollment Princeton 2006-2014

wiktionary.txt 10,000 word frequency Wiktionary

baby-names.txt 31,109 first name frequency U.S. Social Security

cities.txt 93,827 city population geoba.se

2grams.txt 277,718 2-gram frequency Google Web Trillion Word Corpus

words-3333333.txt 333,333 word frequency Google Web Trillion Word Corpus

alexa.txt 1,000,000 web site frequency Alexa

movies.txt 229,447 movies revenue (USD) Internet Movie Database

actors.txt 2,875,183 actors revenue (USD) Internet Movie Database

artists.txt 43,849 music artist familiarity Million Song Dataset

songs.txt 922,230 song hotttnesss Million Song Dataset

file	number	term	weight	source
pu-buildings.txt	166	Princeton buildings	age (in Y10K)
fortune1000.txt	1,000	company	revenue	Fortune 1000
pu-courses.txt	6,771	course	enrollment	Princeton 2006-2014
wiktionary.txt	10,000	word	frequency	Wiktionary
baby-names.txt	31,109	first name	frequency	U.S. Social Security
cities.txt	93,827	city	population	geoba.se
2grams.txt	277,718	2-gram	frequency	Google Web Trillion Word Corpus
words-3333333.txt	333,333	word	frequency	Google Web Trillion Word Corpus
alexa.txt	1,000,000	web site	frequency	Alexa
movies.txt	229,447	movies	revenue (USD)	Internet Movie Database
actors.txt	2,875,183	actors	revenue (USD)	Internet Movie Database
artists.txt	43,849	music artist	familiarity	Million Song Dataset
songs.txt	922,230	song	hotttnesss	Million Song Dataset

Possible progress steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

Download the directory autocomplete. It contains sample input files.
Implement BruteAutocomplete.java.
- Implement the constructor and the weightOf() function. These should be easy, but be sure to test before proceding.
- Implement topMatch(String prefix). This should be easy too.
- Implement topMatches(String prefix, int k). Feel free to create a helper class.
- Test from the command line using the provided main() client or graphically using BruteAutocompleteGUI.
Design a data structure for Autocomplete.java.
- It is acceptable to discuss this data structure design with other students or staff members. However, you must cite any such discussion in your readme. You should absolutely not implement a data structure that you don't understand.
- It is also acceptable to use web resources to identify good data structures. Again, you must cite any sources in your readme.
- It may be helpful to test and debug topMatch(String prefix) before moving on to topMatches(String prefix, int k). You can use BruteAutocomplete.java as a correctness oracle and timing benchmark.
- You might find it helpful to include helper methods like: private String getString(Node x): Returns string ending at node x. private Iterable<String> keys(): Returns all terms stored in the trie. Can copy and paste from TST.java. Useful for debugging. private static void printKeys(Node x): Prints all keys and their associated weights starting from Node x. Useful for debugging.
- If you aren't sure where to start, use the sample data structure described below.

A TST-Based Solution

Here, we suggest one effective strategy that combines a ternary search trie with a priority queue. The TST contains all of the terms, along with their weights. In addition, each node stores the maximum weight in its subtrie—we will exploit this auxiliary information when we implement the search algorithm. The following figure illustrates the data structure for the 6-word input file autocomplete6.txt.

Now, we want to find the top k terms among those that start with a given prefix. Clearly, all of the matching words are in the subtrie correpsonding to the prefix, so the first step is to search for this node, say x. Now, to identify the top k matches, we could examine every node in the subtrie rooted at x, but this takes too long if the subtrie is large. Instead, we will use the auxiliary information to avoid exploration of useless parts of the subtrie.

Specifically, we create a maximum-oriented priority queue of nodes and initialize the priority queue with node x. Then, we repeatedly remove the maximum-weight node from the priority queue; check if the node corresponds to a term (i.e., it has a value associated with it) add, if so, add that term to a list of potential terms; and add to the priority queue any non-null nodes referenced by it (left, mid, and right). We can terminate the search as soon as the weight of the heaviest node in the priority queue is less than (or equal to) the weight of the k^th heaviest term in the list because no remaining term in the subtrie has weight larger than the weight of the heaviest node. Warning: it is not safe to terminate the search as soon as the list of potential terms has k terms. Why?

Below is a snapshot of the algorithm when searching for the top-3 terms that start with the prefix s. The next node to be examined is the one containing the letter p. The first matching term that is discovered will be spit, followed by spite, and sad. The subtries rooted at y and o will never be explored.

Diagram errata: the node containing the character 'y' should be the right child of the node containing the character 'i' (instead of 'p').

Improvements.

In which order should you insert the terms into the TST? Alphabetical? Random? Decreasing order by weight?
Use R-way or R²-way branching at the top.
Compress the trie nodes.
Cache the top few matches in each node.

Other Approaches

Range maximum queries. The problem of finding the top term matching a given prefix can be formulated as a range maximum query problem or equivalently as a lowest common ancestor problem. Here are some lecture notes by Erik Demaine on the equivalence.

Build a Cartesian tree (in linearithmic time or linear time) and do the LCA computation directly in the tree (by following parent pointers until they meet). If the tree is reasonably balanced, this ought to work well in practice (but no performance guarantees).
Use segment trees, as in this paper to solve the maximum query problem in log N time per query.
Use an algorithm described by Michael A. Bender and Martin Farach-Colton in this paper to solve the maximum query problem in constant time per query.

Priority search trees.

Enrichment

Any way to find the range of words that start with a given prefix in time proportional to the length of the prefix (in the worst case)? Yes, use a suffix tree (or a suffix array with the lcp array), where the terms are separated by a word-separation symbol and concatenated together. To use space proportional to the number of terms consider a suffix array on words. Or consider a compact DAWG.