Goals

  • Solve a text database searching problem.

  • Learn to consider the performance characteristics of different algorithms that solve the same problem.

  • Learn to use binary search trees.

  • Part 0

  • Copy the following files from /u/cs126/files/text into an empty directory.
    corpus0  query0  corpus1  query1  corpus2  query2
    
    You can copy all the files to the current directory with the command:
    cp /u/cs126/files/text/* .
    

  • Part 1    (reading in input)

  • Design your program so that it executes with the command:
    a.out corpus0 query0
    

  • You should be able to recycle the code from Assignment 7 to read in the two files using the getc library function. Alternatively, the fgets library function may be useful for reading in the query file.

  • Don't forget that when you write
    while((c = getc(file)) != EOF)
    
    that c must be an int and not a char.

  • The largest files corpus2 and query2 contain less than 1,200,000 and 400,000 characters respectively. It's OK to hardwire these constants into your program, so that you don't have to dynamically allocate memory.

  • Convert newline characters to spaces when reading in the corpus file. Ootherwise you won't be able to match phrases that span multiple lines. The query strings in the query files are delimited by newline characters.

  • Make sure you get this part debugged before continuing - as sanity checks you can use printf and strlen to make sure the strings have been read in correctly.

  • Part 2    (brute force)

  • If you choose to only turn in the brute force method, do not use any string library functions (e.g., strchr), although you may still want to use them for debugging purposes.

  • Try your brute force method on corpus2 and query2, but after seeing how slow it is, use Ctrl-c to stop. Otherwise, you will waste valuable computing resources.

  • To ensure that you understand what your program should return, here is answer for query0. One advantage of writing the brute force algorithm is that now you have a way to check your answer when you run a more sophisticated algorithm.

  • To execute your program use:
    a.out corpus0 query0
    
    For larger files, redirect the output to a file:
    a.out corpus2 query2 > out2
    
    Otherwise, the output will be a blur. Also, printing the output directly to the screen will take more time than the actual computation!

  • Part 3    (faster method)

    Both the BST method and the qsort method suggested in the assignment are capable of solving the text-indexing problem for the supplied data in a reasonable amount of time. We also encourage you to think of your own method for solving the problem. If it's faster, great. If not, that's OK too, since you will have learned more by experimenting.

  • BST approach

  • It is recommended that you start with the brute force approach if you are not totally confident with strings.

  • You can start with the Sedgewick code. Some people find it easier to start from scratch.

  • The string library function strcmp, strncmp, and strlen may be useful. Remember that strlen consumes computer resources so do not overuse or this will slow down your code.

  • Use lcc -n to check if your program accidentally dereferences any null pointers.
  • quicksort approach

  • One obstacle is understanding the library functions bsearch and qsort. This will require advanced knowledge of pointers. If you use this method, you must document and explain how pointers are being used, i.e., it is not sufficient to use trial and error to figure out where to put the &'s and *'s - you must understand why.

  • Alternatively, you can write your own sorting and binary search routines.

  • Be sure to find the first occurrence of each query string. To get the first occurrence, you will probably have to write your own binary search code.
  • hashing

  • This is an approach similar to the code and decode method from Assignment 6. Now the keys are much longer, so you will not be able to find a one-to-one mapping between keys and integers (e.g., aaa = 0, aac = 1, ..., ttt = 63).

  • Write a hash function that converts the first k characters into an integer between 0 and T. You will have to determine good values for k and the table size T.

  • Now, several strings may "hash" to the same integer. So, for each hash value you will need to store a list of corpus strings that match. When you search for a particular query string, you determine its hash value. It now suffices to search the list corresponding to that hash value only. This list will hopefully be very small, and brute force search of this list will be fast.

  • Be careful about handling query phrases less than k letters long - your code may need some minor modificiations.
  • other approaches

  • Too numerous to list. You are encouraged to experiment.

  • Some may be much faster on certain inputs, some will be much slower, some may be more complicated, some may be simpler. None is perfect. Learn to make tradeoffs in computing and life.

  • Submission

  • Use the submit command:
    /u/cs126/bin/submit 8 readme index.c
    
    Also submit any other auxilliary files that are needed to compile your program. It's not necessary to submit a brute force solution, if you submit a different method.

  • The readme file should contain:

  • Name, precept number, any problems encountered, and whatever help your received.

  • Give a high level description of the basic approach you took to solve the problem. Indicate advantages and disadvantages of your method, and why you chose it.

  • Include instructions for compiling your code. Be sure to submit all necessary files.

  • Output of query0/corpus0 and query1/corpus1. First and last 15 lines of output from query2/corpus2.

  • Indicate how long it takes for query2/corpus2 to complete - use the Unix /usr/bin/time command.

  • Extra Credit

  • Find all occurrences of a given query string. Depending on your method, this may require only 5 or so extra lines of code, so give it a shot. No extra credit awarded here if you do a brute force approach.

  • Here's an easy but "malicious" input with only 1000 words in the corpus and 7000 query phrases: query3, corpus3. Design an improved algorithm that does not grind to halt, even on "pathological" inputs.



  • Kevin Wayne