COS 226 Programming Assignment Checklist: Autocomplete


UNDER CONSTRUCTION


Frequently Asked Questions

What is meant by an immutable data type? An immutable data type is a data type whose value never changes once it is constructed. For example, Autocomplete make a defensive copy of the terms[] array (in case the client mutates it).

Can my methods have side effects that are not described in the API? No. For example, the Autocomplete data type should not mutate the terms[] array (or it could have a serious side effect on the client). Similarly, the firstIndexOf() and lastIndexOf() should not mutate their argument arrays. Occasionally, we allow methods to have undocumented (but benign) side effects, such as changing the state of a random number generator.

When I compile BinarySearchDeluxe.java I get the compiler warning "unchecked call to compare(T, T) as a member of the raw type java.util.Comparator." Is that OK? Yes. We note that it is possible to eliminate this warning by using the following cumbersome syntax:

public class BinarySearchDeluxe { public static<Key> int firstIndexOf(Key[] a, Key key, Comparator<Key> comparator) public static<Key> int lastIndexOf(Key[] a, Key key, Comparator<Key> comparator) }
However, on this assignment, use the signatures given in the API (which are consistent with the signatures that Java uses in Arrays.binarySearch().

When I compile BinarySearchDeluxe.java I get the compiler warning "Warning: Type safety: The method compare(java.lang.Object, java.lang.Object) belongs to the raw type java.util.Comparator. References to generic type java.util.Comparator should be parameterized." Is that OK? See above. This is the same error message that the Eclipse Compiler provides in the situation above.

Can I use Arrays.sort()? Yes.

What exactly does it mean to compare two string by their first r characters? You should compare the two strings lexicographically, as in the natural order for Java strings, but you should never examine more than r characters of either string. For example, if r = 3, then cat is less than dog, dog is equal to dogcatcher, and do is less than dogcatcher.

Can I assume that the weights are all distinct? No. However, if there are two terms with equal weights, the allMatches() method can return them in arbitrary order.

Can I assume that the Autocomplete constructor receives the terms in order? No, while many of the sample input files are in descending order by weight, do not make any assumptions about the order.

What should allMatches() return if there are no matching terms? An array of length 0.

Can a prefix or query string be the empty string? Yes and yes. For example, the empty-string prefix should return all queries, in descending order by weight. It might be harder to think of an application for an empty-string query, but the API does not exclude this possibility.

What do I need to do to handle Unicode characters and UTF-8 encoding? In summary, you shouldn't need to do anything special: The Java String data type handles Unicode naturally; The supplied test files are encoded using UTF-8. When reading input or writing output, StdIn, In, and StdOut assume UTF-8 encoding.

Why do I get a java.util.InputMismatchException from readDouble() when I run the sample client? You probably have an old version of stdlib.jar in your classpath that doesn't assume UTF-8 encoding. On Mac OS X, be sure to check /Library/Extensions/Java and ~/Library/Extensions/Java folders.

Why do I get a java.util.NoSuchElementException from readInt() when I run the sample client? Did you cut-and-paste the file instead of downloading it? Your text editor may have corrupted the file by changing the encoding from UTF-8 to something else.

How can I view the Unicode characters that my program prints out?

Note that even if you can't view Unicode properly, it's likely that your program is still handling Unicode properly. You can use the program UnicodeTest.java to test whether your system is configured to display Unicode.


Input, Output, and Testing

Input files. Below are some input files for testing. Each input file consists of an integer N followed by N pairs of terms and integer weights, one pair per line, with the term and weight separated by a tab. The terms can be arbitrary strings consisting of Unicode characters, including spaces (but neither tabs nor newlines).

file number term weight source
pu-buildings.txt 166 Princeton buildings age (in Y10K)
pokemon.txt 729 Pokemon popularity Allen Qin
fortune1000.txt 1,000 company revenue Fortune 1000
nasdaq.txt 2,658 NASDAQ market cap nasdaq.com
metal-albums.txt 3,000 metal albums votes Metal Storm
pu-courses.txt 6,771 course enrollment Princeton Registrar
wiktionary.txt 10,000 word frequency Wiktionary
redditors.txt 10,000 reddittor subscribers redditlist.com
baby-names.txt 31,109 first name frequency U.S. Social Security
cities.txt 93,827 city population geoba.se
mandarin.txt 94,339 Mandarin word frequency Invoke IT
2grams.txt 277,718 2-gram frequency Google Web Trillion Word Corpus
3grams.txt 1,020,009 3-gram frequency Corpus of Contemporary American English
4grams.txt 1,034,308 4-gram frequency Corpus of Contemporary American English
5grams.txt 1,044,269 5-gram frequency Corpus of Contemporary American English
words-3333333.txt 333,333 word frequency Google Web Trillion Word Corpus
artists.txt 43,849 music artist familiarity Million Song Dataset
songs.txt 922,230 song hotttnesss Million Song Dataset
alexa.txt 1,000,000 web site frequency Alexa
movies.txt 229,447 movies revenue (USD) Internet Movie Database
actors.txt 2,875,183 actors revenue (USD) Internet Movie Database


Possible progress steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

UNDER CONSTRUCTION

Enrichment

Suppose that I want to find only the top-k matching terms (instead of all matching terms). Are there better algorithms? Yes. The problem of finding the top term matching a given prefix can be formulated as a range maximum query problem or equivalently as a lowest common ancestor problem. Here are some lecture notes by Erik Demaine on the equivalence.

Any way to find the range of words that start with a given prefix in time proportional to the length of the prefix (in the worst case)? Yes, use a suffix tree (or a suffix array with the lcp array), where the terms are separated by a word-separation symbol and concatenated together. To use space proportional to the number of terms consider a suffix array on words. Or consider a compact DAWG.