Problem Set 6: Moogle
You may do this assignment in pairs. See the bottom of this document for the restrictions that apply to pair programming, which may diverge from the default COS126 collaboration policies.
The goal of this assignment is to build a small web search engine called Moogle. This is really your first (somewhat) large-scale development project in OCaml. In completing this assignment, you will demonstrate your understanding of OCaml's module system. You are encouraged to work in pairs for this problem set.
Your job is to implement some key components of Moogle, including efficient abstractions for sets and dictionaries and the basic crawler that constructs the web index.
The implementations of sets and dictionaries we provide won't scale very well, so we're asking you to produce better implementations based on balanced trees.
Testing for your Dict and Set functors will not be required. However, we will also not be providing tests either in the code or though tigerfile. You are strongly encouraged to write your own tests for all the functions you will be implementing. The functors you write will have a run_tests : unit -> unit function, in which you can write tests. We will also provide helper functions for using qcheck, an Ocaml package for randomized testing. These functions take a property as an argument, and tests if that property holds on a large number of random examples. See dict.ml for more details. Run your tests by compiling with make test, then running the file testing.native. Tests for the crawler function are provided, and can be found in testing.ml underneath the driver for the dict tests.
In addition to writing random tests, we have provided four sample dumps of webpages to test your crawler.
- simple-html: a directory containing a small list (7 pages) of pages to test your crawler.
- html: a directory containing the Ocaml manual (20 html pages).
- crawlertests: a directory containing basic tests for testing.ml.
- wiki: a larger set of pages you can use to test the performance of your final implementation. These pages have been scraped from Wikipedia, using http://en.wikipedia.org/wiki/Teenage_Mutant_Ninja_Turtles as our start point. Note: Even with optimized dictionaries and sets, indexing all the pages in the wiki directory may take several minutes. (But search is near instantaneous, which is the way Google would want it! We'll talk about how to parallelize web indexing in future classes.)
You must follow the proper style guide as mentioned in precept, previous assignment comments, and in the online style guides here.
Words of Warning
This project is significantly more difficult than previous problem sets because you need to read and understand all of the code that we have given you, as well as write new code. We HIGHLY recommend you start as early as possible. This is not the kind of problem set you can expect to complete last minute.
As a warning, you shouldn't just point Moogle at an arbitrary web site and start crawling it unless you understand the robots.txt protocol. This protocol lets web sites tell crawlers (like Google's or Yahoo's or Bing's) which subdirectories they are allowed to index, at what rate they can connect to the server, how frequently they can index the material, etc. If you don't follow this protocol, then you are likely to have some very angry people on your doorsteps. So to be on the safe side, you should restrict your crawling to your local disk (which is why we have set this to be the default).
Another word of warning: Setting up and configuring a real web server demands a lot of knowledge about security. In particular, we do not recommend that you poke holes in your machine's firewall so that you can talk to Moogle from a remote destination. Rather, you should restrict your interaction to a local machine.
Download the code here. Unzip and untar it:
$ tar xfz a6.tgz
Within the moogle directory, you will find a Makefile and a set of .ml files that make up the project sources. You will also find three directories of web pages that you can use for testing: simple-html, html, and wiki, as described above. Below is a brief description of the contents of each file:
- .merlin: to configure merlin for you, as in previous assignments
- Makefile: used to build the project -- type "make all" at the command line to build the project.
- order.ml: definitions for an order datatype used to compare values.
- myset.ml: an interface and simple implementation of a set abstract datatype.
- dict.ml: an interface and simple implementation of a dictionary abstract datatype.
- query.ml: a datatype for Moogle queries and a function for evaluating a query given a web index.
- util.ml: includes an interface and the implementation of crawler services needed to build the web index. This includes definitions of link and page datatypes, a function for fetching a page given a link, and the values of the command line arguments (e.g., the initial link, the number of pages to search, and the server port.)
- moogle.ml: the main code for the Moogle server.
- crawl.ml: Includes a stub for the crawler that you will need to complete.
- graph.ml: definitions for a graph abstract data type, including a graph signature, a node signature, and a functor for building graphs from a node module.
- nodescore.ml: definitions for node scores maps, which is part of the page-rank algorithm.
- pagerank.ml: Code for the page-rank algorithm including an indegree algorithm for computing page ranks.
- testing.ml: The crawler tests and a driver for dictionary and set tests.
This subsection explains how to build Moogle. Initially, Moogle won't do much for you, but it will compile.
Note: if you are using an OCaml version before 4.02, Moogle may complain about the usage of the some functions from the Bytes library, which was not standard before this version. You may be able to use the functions these replaced, such as the now-depreceated String.create by changing the few instances of them to their String counterparts (and ignoring the ensuing warnings), but it is recommended that you instead procure a more modern OCaml installation.
Once you implement the necessary routines in crawl.ml, you will be able to index small web sites (such as simple-html). To index and query larger sites, you will need to implement more efficient data structures to manage sets and dictionaries.
Compile Moogle via command line by typing make all. To start Moogle up from a terminal or shell, type:
The first command line argument (8080) represents the port that your moogle server listens to. Unless you know what you are doing, you should generally leave it as 8080, though you may need to try a different port (e.g., 8081, 8082, etc.) to find a free one.
$ ./moogle.byte 8080 42 simple-html/index.html
The second command line argument (42) represents the number of pages to index. Moogle will index less than or equal to that number of pages.
The last command line argument (simple-html/index.html) indicates the page from which your crawler should start. Moogle will only index pages that are on your local file system (inside the simple-html, html, or wiki directories.)
You should see that the server starts and then prints some debugging information ending with the lines:
Now try to connect to Moogle with your web-browser -- Chrome (here) seems rather reliable (we have experienced glitches with some versions of Firefox and Safari). Connect to the following URL:
Starting Moogle on port 8080. Press Ctrl-c to terminate Moogle.
Once you connect to the server, you should see a web page that has a PUgle logo. You can try to type in a query and if you do, you'll probably see an empty response (you'll see an empty response until you implement crawl.ml).
The Moogle home page lets you enter a search query. The query is either a word (e.g., "moo"), a conjunctive query (e.g., "moo AND cow"), or a disjunctive query (e.g., "moo OR cow"). By default, queries are treated as conjunctive so if you write "moo cow", you should get back all of the pages that have both the word "moo" and the word "cow".
The query is parsed and sent back to Moogle as an http request. At this point, Moogle computes the set of URLs (drawn from those pages we have indexed) whose web pages satisfy the query. For instance, if we have a query "Greg OR cow", then Moogle would compute the set of URLs of all pages that contain either "Greg" or "cow".
After computing the set of URLs that satisfy the user's query, Moogle builds an html page response and sends the page back to the user to display in their web-browser.
Because crawling the actual web demands careful attention to protocols (in particular robots.txt), we will be crawling web pages on your local hard-disk instead.
1. Implement the Crawler
Your first major task is to implement the web crawler and build the search index. In particular, you need to replace the dummy crawl function in the file crawl.ml with a function which builds a WordDict.dict (dictionary from words to sets of links.)
You will find the definitions in the CrawlerServices module (in the file util.ml) useful. For instance, you will notice that the initial URL provided on the command line can be found in the variable CrawlerServices.initial_link. This will be the only link in the crawlers frontier when crawl is intially called. You should use the function CrawlerServices.get_page to fetch a page given a link. A page contains the URL for the page, a list of links that occur on that page, and a list of words that occur on that page. You need to update your WordDict.dict so that it maps each word on the page to a set that includes the page's URL. Then you need to continue crawling the other links on the page recursively. Of course, you need to figure out how to avoid an infinite loop when one page links to another and vice versa, and for efficiency, you should only visit a page at most once.
The variable CrawlerServices.num_pages_to_search (which is called n in the crawl function) contains the command-line argument specifying how many unique pages you should crawl. So you'll want to stop crawling after you've seen that number of pages, or you run out of links to process.
The module WordDict provides operations for building and manipulating dictionaries mapping words (strings) to sets of links. Note that it is defined in crawl.ml by calling a functor and passing it an argument where keys are defined to be strings, and values are defined to be LinkSet.set. The interface for WordDict can be found in dict.ml.
The module LinkSet is defined in pagerank.ml and like WordDict, it is built using a set functor, where the element type is specified to be a link. The interface for LinkSet can be found in the myset.ml file.
Running the crawler in the top-level interpreter won't really be possible, so to test and debug your crawler, you will want to compile via command line using make, and add thorough testing code. One starting point is debugging code that prints out the status of your program's execution: see the OCaml documentation for the Printf.printf functions for how to do this, and note that all of our abstract types (e.g., sets, dictionaries, etc.) provide operations for converting values to strings for easy printing. We have provided you with three sample sets of web pages to test your crawler. Once you are confident your crawler works, run it on the small html directory:
simple-html contains 7 very small html files that you can inspect yourself, and you should compare that against the output of your crawler. If you attempt to run your crawler on the larger sets of pages, you may notice that your crawler takes a very long time to build up your index. The dummy list implementations of dictionaries and sets do not scale very well.
./moogle.d.byte 8080 7 simple-html/index.html
Note: Unless you do a tiny bit of extra work, your index will be case sensitive. This is fine, but may be confusing when testing, so keep it in mind.
2. Sets as Dictionaries
A set is a data structure that stores keys where keys cannot be duplicated (unlike a list which may store duplicates). For our purposes, insertion of a key already in the set should replace the old key-value binding with the new key-value binding. Moogle uses lots of sets. For example, our web index maps words to sets of URLs, and our query evaluator manipulates sets of URLs. We also want sets to be faster than the naïve list-based implementation we have provided. Here, we build sets of ordered keys, keys on which we can provide a total ordering.
Instead of implementing sets from scratch, we want you to use what you already have for dictionaries to build sets. To do so, you must write a functor, which, when given a COMPARABLE module, produces a SET module by building and adapting an appropriate DICT module. That way, we can just later implement dictionaries efficiently without completely rewriting the infrastructure as well as the underlying algorithm.
For this part, you will need to uncomment the DictSet functor in the file myset.ml. The first step is to build a suitable dictionary D by calling Dict.Make. The key question is: what can you pass in for the type definitions of keys and values? Then you need to build the set definitions in terms of the operations provided by D : DICT.
You should make sure to write a set of tests that exercise your implementation, following the "Testing" section above. Finally, you should change the Make functor at the bottom of myset.ml so that it uses the DictSet functor instead of ListSet. Try your crawler on the simple-html and html directories.
3: Dictionaries as Trees
The critical data structure in Moogle is the dictionary. A dictionary is a data structure that maps keys to values. Each key in the dictionary has a unique value, although multiple keys can have the same value. For example, WordDict is a dictionary that maps words on a page to the set of links that contain that word. At a high level, you can think of a dictionary as a set of (key,value) pairs.
As you may have observed as you tested your crawler with your new set functor, building dictionaries on top of association lists (which we have for you) does not scale particularly well. The implementation we have provided in dict.ml is pretty inefficient, as it is based on lists, so the asymptotic complexity of operations, such as looking up the value associated with a key can take linear time in the size of the dictionary.
For this part of the problem set, we are going to build a different implementation of dictionaries using a kind of balanced tree called a AVL tree. You may have studied some of the properties of AVL trees in COS 226; in this course, you will learn how to implement them in a functional language.
In the file dict.ml, you will find a commented out functor AVLDict which is intended to build a DICT module from a DICT_ARG module. Here is the type definition for a dictionary implemented as an AVL tree:
Notice the similarities between this type definition of an AVL tree and the type definition of a binary search tree we have covered in lecture. Here, in addition to nodes that contain two subtrees, we have an integer representing the height. The invariants for this type of tree are as follows:
type pair = key * value type dict = | Leaf | Node of dict * int * pair * dict
Node: (left, height, (k1,v1),right)
- Every key k appearing in subtree left must be k < k1.
- Every key k appearing in subtree right must be k > k1.
- The height of a node is the length longest path from that node to a leaf.
- The heights of the left and right subtrees cannot differ by more than one.
Note that for our dictionary, only a single copy of a key can be stored.
Open up dict.ml and locate the commented out AVLDict module.
3a: lookup, member
Implement these functions according to their specification provided in DICT:
You may change the order in which the functions appear, but you may not change the signature of these functions (name, order of arguments, types). You may add a "rec" to make it "let rec" if you feel that you need it.
val lookup : dict -> key -> value option val member : dict -> key -> bool
val insert dict -> key -> value -> (bool * dict)
The insert function should first check if the current node d has a key equal to the key being inserted. If it does, it simply changes the value of the current node. If not, it must use the key to determine which of its subtrees to recursively call insert on. It then compares the heights of the tree to determine if rebalancing is necessary. The rules for rebalancing are explained in the AVL trees handout.
Removing a node from the tree follows a similar process, and uses a similar function:
val remove dict -> key -> (bool * dict)
Removal is handled by the recursive function remove, which takes as arguments a dictionary and a key. Remove finds the node to remove in the same process as insert, and has a few different cases depending on the location of that node.
- If both children of a node are leafs, that node is replaced with a leaf.
- If one child is a leaf, the node is replaced with the non-leaf child.
- If both children are nodes, the smallest node on the right subtree replaces the current node. Then remove is used to delete that node from the right subtree.
Once you have finished writing the insert and remove functions, It would be a good idea to test them with random functions.
Implement the function according to the specification in DICT:
choose : dict -> (key * value * dict) option
Write tests to test your choose function.
4. Try your crawler!
Finally, because we have properly factored out an interface, we can easily swap in your tree-based dictionaries in place of the list-based dictionaries by modifying the Make functor at the end of dict.ml. So make the change and try out your new crawler on the three sets of webpages. Our startup code prints out the crawling time, so you can see which implementation is faster. Try it on the larger test set in the wiki directory in addition to the html and simple-html directories with:
If you are confident everything is working, try changing 45 to 200. (This may take several minutes to crawl and index). You're done!
./moogle.d.byte 8080 45 wiki/Teenage_Mutant_Ninja_Turtles
If your MacOSX complains about a Unix.EMFILE error because of the limit on open files when running the crawler, you could try this shell command:
ulimit -S -n 2560
We've provided you with a debug flag at the top of moogle.ml, that can be set to either true or false. If debug is set to true, Moogle prints out debugging information using the string conversion functions defined in your dictionary and set implementations. This is set to true by default.
5: Partnership and Submission Guidelines
If you are completing this assignment with a partner, the overarching principle is that each partner is responsible for equally sharing in all elements of the assignment. This precludes "divide and conquer" approaches to the assignment. It also requires introspection to prevent a situation in which you are "carrying" your partner or allowing yourself to be "carried".
We believe that the COS126 partner collaboration policy is a good guideline with lots of good advice. Bear in mind that in the year of COVID we discourage in-person collaboration. Instead, work together via the internet (e.g. via Zoom, FaceTime, Hangouts, etc., with screen-sharing).
NOTE: both students in a pair will receive the same grade on the assignment. Following from this, both students in a pair must use the same number of automatically waived late days. This means that a pair has available the mininum number of free late days remaining between the two of them. (A partner with more free late days available would retain the excess for future individual assignments or assignments paired with another student with free late days remaining.)
To submit, use the TigerFile link posted next to the assignment link. All files are required, including the readme -- even if you aren't partnered, nor giving feedback.
6: Extra Fun: Favicon
Your web browser may request favicon.ico from the server (moogle.ml), but we don't have a favicon. Just for fun, design a favicon for Moogle and submit it through TigerFile. You may also want to modify moogle.ml so that it knows how to serve the favicon.ico file. Look in process_request to compare how it serves PU-gle.jpg versus how it serves moogle.html.
If you have any additional comments or suggestions concerning this assignment and how to make it better in the future, also include those in your readme.