Programming Assignment: Autocomplete Me

Write a program to implement autocomplete for a given set of N strings and positive weights. That is, given a prefix and an integer k, find the top k strings in the set among those that start with the prefix.

Autocomplete is an important feature of many modern applications. As the user types, the program predicts the complete term (typically a word or phrase) that the user intends to type. Autocomplete is most effective when there are a limited number of likely terms. For example, the Internet Movie Database uses it to display the names of movies as the user types; search engines use it to display suggestions as the user enters web search queries; cell phones use it to speed up text input.

In these examples, the application predicts how likely it is that the user is typing each term and presents to the user a list of the top k matching terms, in decreasing order of weight. These weights are determined by historical data, such as box office revenue for movies, frequencies of search queries from other Google users, or the typing history of a cell phone user. For the purposes of this assignment, you will have access to a list of all possible terms and associated weights (and the terms and weights will not change).

The performance of autocomplete functionality is critical in many systems. For example, consider a search engine which runs an autocomplete application on a server farm. According to one study, the application has only about 50ms to return a list of suggestions for it to be useful to the user. Moreover, in principle, it must perform this computation for every keystroke typed into the search bar and for every user!

In this assignment, you will implement autocomplete in two different ways. The first, BruteAutocomplete.java, is a naïve, brute-force solution. The second, Autocomplete.java, will use a fancier data structure (of your own design) to improve performance substantially.

Brute force. Write an immutable data type BruteAutocomplete.java that provides autocomplete functionality for a given set of terms and weights by implementing the API below. Your brute-force implementation should maintain the terms and weights in two parallel arrays. To find the top k matching terms, scan through all of the terms, identifying those that start with the given prefix, and return the top k such terms.

public class BruteAutocomplete
{
    // Initializes an autocomplete data structure from the given parallel arrays of terms and weights.
    public BruteAutocomplete(String[] terms, double[] weights)

    // Returns the weight of the term, or 0.0 if no such term.
    public double weightOf(String term)

    // Returns a top matching term, or null if no matching term.
    public String topMatch(String prefix)

    // Returns the top k matching terms (in descending order of weight), as an iterable.
    // If fewer than k matches, return all matching terms (in descending order of weight).
    public Iterable<String> topMatches(String prefix, int k)
}

The constructor should throw an IllegalArgumentException unless the arguments are valid: the two arrays are of the same length; the weights are strictly positive; and no term is included more than once. The topMatches() method should throw a IllegalArgumentException if k is negative. Each method and constructor should throw a NullPointerException if any argument is null.

This brute-force approach is easy to implement but, when N is large, the number of autocomplete queries that can be processed per second will be too small to be useful. You can use it as a reference solution and benchmark when developing a faster solution.

Your constructor for BruteAutocomplete should be subquadratic in the number of terms, and your topMatch() and topMatches() methods should be subquadratic in the number of terms (and thus you should not use your weightOf method if it is linear time).

A faster solution. In this part, you will implement an immutable data type with the same API, but using better algorithms and data structures. The number of queries that can be processed per second should be several orders of magnitude better than with the brute-force implementation. Any clever solution that meets this requirement and is well explained in the readme will receive full credit for performance.

To meet this requirement, your algorithm should have a best case runtime that is much less than the number of matching prefixes. Let N_p be the number of terms that match a query prefix p. For a best case input, your program should consider fewer than N_p weights when collecting the top-k matches. What constitutes a best case input depends on your implementation. Put another way, your algorithm should not have the property that it always examines all N_p prefix matches.

As with BruteAutocomplete, your constructor should be subquadratic in the number of terms.

public class Autocomplete
{ 
    // Initializes an autocomplete data structure from the given parallel arrays of terms and weights.
    public Autocomplete(String[] terms, double[] weights)

    // Returns the weight of the term, or 0.0 if no such term.
    public double weightOf(String term)

    // Returns a top matching term, or null if no matching term.
    public String topMatch(String prefix)

    // Returns the top k matching terms (in descending order of weight), as an iterable.
    // If fewer than k matches, return all matching terms (in descending order of weight).
    public Iterable<String> topMatches(String prefix, int k)
}

The training wheels are off on this part of the assignment! It is up to you to decide which data structures and algorithms to design and use. If you need help getting started, the checklist describes one effective strategy. Do not start coding until you've made sure your algorithm can theoretically obey the performance guidelines. If you do end up coding yourself into a dead-end, contact Josh to see if its worth starting over from scratch.

Input format. We provide a number of sample input files for testing. Each file consists of an integer N followed by N pairs of terms and positive weights. There is one pair per line, with the weight and term separated by a tab. The terms can be arbitrary strings consisting of Unicode characters, including spaces (but not newlines).

The file wikitionary.txt contains the 10,000 most common words in Project Gutenberg along with weights equal to their frequencies. The file cities.txt contains nearly 100,000 cities along with weights equal to their populations.

% more wiktionary.txt     
10000
   56271872.00	the
   33950064.00	of
   29944184.00	and
   25956096.00	to
   17420636.00	in
   11764797.00	i
   11073318.00	that
   10078245.00	was
    8799755.00	his
                ...
       3923.23	calves

% more cities.txt
93827
      14608512	Shanghai, China
      13076300	Buenos Aires, Argentina
      12691836	Mumbai, India
      12294193	Mexico City, Distrito Federal, Mexico
      11624219	Karachi, Pakistan
      11174257	İstanbul, Turkey
      10927986	Delhi, India
      10444527	Manila, Philippines
      10381222	Moscow, Russia
                ...
             2	Al Khāniq, Yemen

Below is a sample client that takes the name of an input file and an integer k as command-line arguments. It reads the data from the file, then repeatedly reads autocomplete queries from standard input and prints out the top k matching terms.

public static void main(String[] args) {

    // read in the parallel arrays of terms and weights from a file
    In in = new In(args[0]);
    int N = in.readInt();
    String[] terms = new String[N];
    double[] weights = new double[N];
    for (int i = 0; i < N; i++) {
        weights[i] = in.readDouble();   // read the next weight
        in.readChar();                  // scan past the tab
        terms[i] = in.readLine();       // read the next term
    }

    // read in queries from standard input and print out the top k matching terms
    int k = Integer.parseInt(args[1]);
    Autocomplete autocomplete = new Autocomplete(terms, weights);
    while (StdIn.hasNextLine()) {
        String prefix = StdIn.readLine();
        for (String term : autocomplete.topMatches(prefix, k))
            StdOut.printf("%14.1f  %s\n", autocomplete.weightOf(term), term);
    }
}

Here are a few sample executions:

% java Autocomplete wiktionary.txt 5
auto
        6197.0  automobile
        4250.0  automatic
comp
      133159.0  company
       78039.8  complete
       60384.9  companion
       52050.3  completely
       44817.7  comply
the
    56271872.0  the
     3340398.0  they
     2820265.0  their
     2509917.0  them
     1961200.0  there

% java Autocomplete cities.txt 7
M
    12691836.0  Mumbai, India
    12294193.0  Mexico City, Distrito Federal, Mexico
    10444527.0  Manila, Philippines
    10381222.0  Moscow, Russia
     3730206.0  Melbourne, Victoria, Australia
     3268513.0  Montréal, Quebec, Canada
     3255944.0  Madrid, Spain
Al M
      431052.0  Al Maḩallah al Kubrá, Egypt
      420195.0  Al Manşūrah, Egypt
      290802.0  Al Mubarraz, Saudi Arabia
      258132.0  Al Mukallā, Yemen
      227150.0  Al Minyā, Egypt
      128297.0  Al Manāqil, Sudan
       99357.0  Al Maţarīyah, Egypt

Interactive GUI (optional, but fun and no extra work). Download and compile AutocompleteGUI.java. The program takes the name of a file and an integer k as command-line arguments and provides a GUI for the user to enter queries. It presents the top k matching terms in real time. When the user selects a term, the GUI opens up the results from a google search for that term in a browser.

% java AutocompleteGUI cities.txt 10

Extra credit 1. This is an opportunity to earn extra credit and contribute to future offerings of this assignment. Create a real-world data (preferably large or huge) for which autocomplete would be appropriate and document it in your readme file (including a reference to the original data source). Below are a few possibilities. Note that some of the datasets are massive and you will need to filter them down to appropriate sizes and put them into our file format.

Wikipedia: term = Wikipedia page, weight = number of hits per year.
Google books Ngram Viewer: term = n-gram, weight = frequency of occurrence in corpus of books.
Corpus of Contemporary American English: term = n-gram, weight = frequency of occurrence in corpus.
Wiktionary: term = word, weight = frequency of occurrence in corpus. Pick a language with a non-Latin alphabet such as Hebrew, Arabic, Russian, Greek, or Japanese.
Echo Nest: term = artist, weight = "artist familarity"; term = song, weight = "hottnesss".
The Internet Movie Database: term = movie, weight = number of reviews or average rating.
NASDAQ: term = company, weight = market capitalization.

Be sure that your file is in the prescribed format (tab-separated and UTF-8 encoded). If your file is less than 10MB, submit it as usual; if it is larger, please contact your preceptor for submission instructions.

Extra credit 2. Improve AutcompleteGUI.java in the following (or other) ways:

Make the suggestion box autoupdate when the user enters characters without typing, e.g., by selecting a character from Mac OS X Character Viewer.
Make the search case insensitive and diacritic insensitive, but have it still display the original terms.
Improve the graphical layout, e.g., add more vertical space between terms in the suggestion box.
Display the weights, aligned and in light gray.
Make the window size expand (or scroll) when given a large values of k.

Autograder competition. Optionally, you may submit your code to the Autocomplete Competition. Your program will be timed and the results displayed on a public leaderboard. You should submit Autocomplete.java as well as any necessary files, as well as nickname.txt. Whatever you put in nickname.txt will be used as your name in the leaderboard. This feature will be available late Sunday. You'll have until the 16th to make more submissions. There is no official reward for doing well in the competition.

You can view the current leaderboard at this link.

Deliverables. Submit BruteAutocomplete.java, Autocomplete.java, and any other accompanying files (excluding those in stdlib.jar and algs4.jar). Your may not call any library functions other than those in java.lang, java.util, stdlib.jar, and algs4.jar. Finally, submit a readme.txt file and answer the questions.