COS 326: Functional Programming (Fall 2018)

Caml
Power

Problem Set 5: Moogle

You may do this assignment in pairs. See the bottom of this document for the restrictions that apply to pair programming, which may diverge from the default COS126 collaboration policies.

Quick Links

Code and Test bundle

Objectives

The goal of this assignment is to build a small web search engine called Moogle. This is really your first (somewhat) large-scale development project in OCaml. In completing this assignment, you will demonstrate your understanding of OCaml's module system. You are encouraged to work in pairs for this problem set.

Your job is to implement some key components of Moogle, including efficient abstractions for sets and dictionaries and the basic crawler that constructs the web index. You will also implement a page-rank algorithm that can be used to sort the links returned from a query so that more "important" links are returned first.

The naïve implementations of sets and dictionaries we provide won't scale very well, so we're asking you to produce better implementations based on balanced trees.

Testing

Testing for your functors will be required. All the functors you write will have a run_tests : unit -> unit function. See the examples of tests in dict.ml (the tests for remove), and see how to run them by scrolling to the very bottom of dict.ml.

In addition to writing unit tests, we have provided three sample dumps of webpages to test your crawler.

simple-html: a directory containing a small list (7 pages) of pages to test your crawler.
html: a directory containing the Ocaml manual (20 html pages).
wiki: a larger set of pages you can use to test the performance of your final implementation. These pages have been scraped from Wikipedia, using http://en.wikipedia.org/wiki/Teenage_Mutant_Ninja_Turtles as our start point. Note: Even with optimized dictionaries and sets, indexing all the pages in the wiki directory may take several minutes. (But search is near instanteous, which is the way Google would want it! We'll talk about how to parallelize web indexing in future classes.)

Style

You must follow the proper style guide as mentioned in precept, previous assignment comments, and in the online style guides here.

Words of Warning

This project is significantly more difficult than previous problem sets because you need to read and understand all of the code that we have given you, as well as write new code. We HIGHLY recommend you start as early as possible. This is not the kind of problem set you can expect to complete last minute.

As a warning, you shouldn't just point Moogle at an arbitrary web site and start crawling it unless you understand the robots.txt protocol. This protocol lets web sites tell crawlers (like Google's or Yahoo's or Bing's) which subdirectories they are allowed to index, at what rate they can connect to the server, how frequently they can index the material, etc. If you don't follow this protocol, then you are likely to have some very angry people on your doorsteps. So to be on the safe side, you should restrict your crawling to your local disk (which is why we have set this to be the default).

Another word of warning: Setting up and configuring a real web server demands a lot of knowledge about security. In particular, we do not recommend that you poke holes in your machine's firewall so that you can talk to Moogle from a remote destination. Rather, you should restrict your interaction to a local machine.

Getting Started

Download the code here. Unzip and untar it:

$ tar xfz a5.tgz

Within the moogle directory, you will find a Makefile and a set of .ml files that make up the project sources. You will also find three directories of web pages that you can use for testing: simple-html, html, and wiki, as described above. Below is a brief description of the contents of each file:

.merlin: to configure merlin for you, as in previous assignments
Makefile: used to build the project -- type "make all" at the command line to build the project.
order.ml: definitions for an order datatype used to compare values.
myset.ml: an interface and simple implementation of a set abstract datatype.
dict.ml: an interface and simple implementation of a dictionary abstract datatype.
query.ml: a datatype for Moogle queries and a function for evaluating a query given a web index.
util.ml: includes an interface and the implementation of crawler services needed to build the web index. This includes definitions of link and page datatypes, a function for fetching a page given a link, and the values of the command line arguments (e.g., the initial link, the number of pages to search, and the server port.)
moogle.ml: the main code for the Moogle server.
crawl.ml: Includes a stub for the crawler that you will need to complete.
graph.ml: definitions for a graph abstract data type, including a graph signature, a node signature, and a functor for building graphs from a node module.
nodescore.ml: definitions for node scores maps, which as part of the page-rank algorithm.
pagerank.ml: skeleton code for the page-rank algorithm including a dummy indegree algorithm for computing page ranks. You will be editing this.

Build Moogle

This subsection explains how to build Moogle. Initially, Moogle won't do much for you, but it will compile.

Note: if you are using an OCaml version before 4.02, Moogle may complain about the usage of the some functions from the Bytes library, which was not standard before this version. You may be able to use the functions these replaced, such as the now-depreceated String.create by changing the few instances of them to their String counterparts (and ignoring the ensuing warnings), but it is recommended that you instead procure a more modern OCaml installation.

Once you implement the necessary routines in crawl.ml, you will be able index small web sites (such as simple-html). To index and query larger sites, you will need to implement more efficient data structures to manage sets and dictionaries.

Compile Moogle via command line by typing make all. To start Moogle up from a terminal or shell, type:


  $ ./moogle.d.byte 8080 42 simple-html/index.html

The first command line argument (8080) represents the port that your moogle server listens to. Unless you know what you are doing, you should generally leave it as 8080, though you may need to try a different port (e.g., 8081, 8082, etc.) to find a free one.

The second command line argument (42) represents the number of pages to index. Moogle will index less than or equal to that number of pages.

The last command line argument (simple-html/index.html) indicates the page from which your crawler should start. Moogle will only index pages that are on your local file system (inside the simple-html, html, or wiki directories.)

You should see that the server starts and then prints some debugging information ending with the lines:


  Starting Moogle on port 8080.
  Press Ctrl-c to terminate Moogle.

Now try to connect to Moogle with your web-browser -- Chrome (here) seems rather reliable, but we have experienced glitches with some versions of Firefox and Safari. Connect to the following URL:


  http://localhost:8080

Once you connect to the server, you should see a web page that has a PUgle logo. You can try to type in a query and if you do, you'll probably see an empty response (you'll see an empty response until you implement crawl.ml).

The Moogle home page lets you enter a search query. The query is either a word (e.g., "moo"), a conjunctive query (e.g., "moo AND cow"), or a disjunctive query (e.g., "moo OR cow"). By default, queries are treated as conjunctive so if you write "moo cow", you should get back all of the pages that have both the word "moo" and the word "cow".

The query is parsed and sent back to Moogle as an http request. At this point, Moogle computes the set of URLs (drawn from those pages we have indexed) whose web pages satisfy the query. For instance, if we have a query "Greg OR cow", then Moogle would compute the set of URLs of all pages that contain either "Greg" or "cow".

After computing the set of URLs that satisfy the user's query, Moogle builds an html page response and sends the page back to the user to display in their web-browser.

Because crawling the actual web demands careful attention to protocols (in particular robots.txt), we will be crawling web pages on your local hard-disk instead.

1. Implement the Crawler

Your first major task is to implement the web crawler and build the search index. In particular, you need to replace the dummy crawl function in the file crawl.ml with a function which builds a WordDict.dict (dictionary from words to sets of links.)

You will find the definitions in the CrawlerServices module (in the file util.ml) useful. For instance, you will notice that the initial URL provided on the command line can be found in the variable CrawlerServices.initial_link. You should use the function CrawlerServices.get_page to fetch a page given a link. A page contains the URL for the page, a list of links that occur on that page, and a list of words that occur on that page. You need to update your WordDict.dict so that it maps each word on the page to a set that includes the page's URL. Then you need to continue crawling the other links on the page recursively. Of course, you need to figure out how to avoid an infinite loop when one page links to another and vice versa, and for efficiency, you should only visit a page at most once.

The variable CrawlerServices.num_pages_to_search contains the command-line argument specifying how many unique pages you should crawl. So you'll want to stop crawling after you've seen that number of pages, or you run out of links to process.

The module WordDict provides operations for building and manipulating dictionaries mapping words (strings) to sets of links. Note that it is defined in crawl.ml by calling a functor and passing it an argument where keys are defined to be strings, and values are defined to be LinkSet.set. The interface for WordDict can be found in dict.ml.

The module LinkSet is defined in pagerank.ml and like WordDict, it is built using a set functor, where the element type is specified to be a link. The interface for LinkSet can be found in the myset.ml file.

Running the crawler in the top-level interpreter won't really be possible, so to test and debug your crawler, you will want to compile via command line using make, and add thorough testing code. One starting point is debugging code that prints out the status of your program's execution: see the OCaml documentation for the Printf.printf functions for how to do this, and note that all of our abstract types (e.g., sets, dictionaries, etc.) provide operations for converting values to strings for easy printing. We have provided you with three sample sets of web pages to test your crawler. Once you are confident your crawler works, run it on the small html directory:


./moogle.d.byte 8080 7 simple-html/index.html

simple-html contains 7 very small html files that you can inspect yourself, and you should compare that against the output of your crawler. If you attempt to run your crawler on the larger sets of pages, you may notice that your crawler takes a very long time to build up your index. The dummy list implementations of dictionaries and sets do not scale very well.

Note: Unless you do a tiny bit of extra work, your index will be case sensitive. This is fine, but may be confusing when testing, so keep it in mind.

2. Sets as Dictionaries

A set is a data structure that stores keys where keys cannot be duplicated (unlike a list which may store duplicates). For our purposes, insertion of a key already in the set should replace the old key-value binding with the new key-value binding. Moogle uses lots of sets. For example, our web index maps words to sets of URLs, and our query evaluator manipulates sets of URLs. We also want sets to be faster than the naïve list-based implementation we have provided. Here, we build sets of ordered keys, keys on which we can provide a total ordering.

Instead of implementing sets from scratch, we want you to use what you already have for dictionaries to build sets. To do so, you must write a functor, which, when given a COMPARABLE module, produces a SET module by building and adapting an appropriate DICT module. That way, we can just later implement dictionaries efficiently without completely rewriting the infrastructure as well as the underlying algorithm.

For this part, you will need to uncomment the DictSet functor in the file myset.ml. The first step is to build a suitable dictionary D by calling Dict.Make. The key question is: what can you pass in for the type definitions of keys and values? Then you need to build the set definitions in terms of the operations provided by D : DICT.

You should make sure to write a set of unit tests that exercise your implementation, following the "Testing Explanation" section above. You must test ALL your functions (except for the string functions). Finally, you should change the Make functor at the bottom of myset.ml so that it uses the DictSet functor instead of ListSet. Try your crawler on the simple-html and html directories.

3: Dictionaries as Trees

The critical data structure in Moogle is the dictionary. A dictionary is a data structure that maps keys to values. Each key in the dictionary has a unique value, although multiple keys can have the same value. For example, WordDict is a dictionary that maps words on a page to the set of links that contain that word. At a high level, you can think of a dictionary as a set of (key,value) pairs.

As you may have observed as you tested your crawler with your new set functor, building dictionaries on top of association lists (which we have for you) does not scale particularly well. The implementation we have provided in dict.ml is pretty inefficient, as it is based on lists, so the asymptotic complexity of operations, such as looking up the value associated with a key can take linear time in the size of the dictionary.

For this part of the problem set, we are going to build a different implementation of dictionaries using a kind of balanced tree called a 2-3 tree. You studied many of the properties of 2-3 trees and their relationship to red-black trees in COS 226; in this course, you will learn how to implement them in a functional language. Since our trees are guaranteed to be balanced, then the complexity for our insert, remove, and lookup functions will be logarithmic in the number of keys in our dictionary.

Before you start coding, please read this writeup, which explains the invariants in a 2-3 tree and how to implement insert and remove.

Read every word of it. Seriously. Probably at least twice.

In the file dict.ml, you will find a commented out functor BTDict which is intended to build a DICT module from a DICT_ARG module. Here is the type definition for a dictionary implemented as a 2-3 tree:


type pair = key * value

type dict = 
    | Leaf
    | Two of dict * pair * dict
    | Three of dict * pair * dict * pair * dict

Notice the similarities between this type definition of a 2-3 tree and the type definition of a binary search trees we have covered in lecture. Here, in addition to nodes that contain two subtrees, we have a Three node which contains two (key,value) pairs (k1,v1) and (k2,v2), and three subtrees left, middle, and right. Below are the invariants as stated in the handout:

2-node: Two(left,(k1,v1),right)

Every key k appearing in subtree left must be k < k1.
Every key k appearing in subtree right must be k > k1.
The length of the path from the 2-node to every leaf in its two subtrees must be the same.

3-node: Three(left,(k1,v1),middle,(k2,v2),right)

k1 < k2.
Every key k appearing in subtree left must be k < k1.
Every key k appearing in subtree right must be k > k2.
Every key k appearing in subtree middle must be k1 < k < k2.
The length of the path from the 3-node to every leaf in its three subtrees must be the same.

Note that all of these bounds here are non inclusive (< and > vs. <= and >=), unlike the handout. The handout permitted storing multiple keys that are equal to each other, however, for our dictionary, a key can only be stored once. The last invariants of both types of nodes imply that our tree is balanced, that is, the length of the path from our root node to any Leaf node is the same.

Open up dict.ml and locate the commented out BTDict module.

3a: balanced

Locate the balanced function at the bottom of the BTDict implementation (right before the tests):

balanced : dict -> bool

You must write this function which takes a dictionary (our 2-3 tree) and returns true if and only if the tree is balanced. To do so, you only need to check the path length invariants, not the order of the keys. In addition, your solution must be efficient --- our solution runs in O(n) where n is the size of the dictionary. In a comment above the function, please explain carefully in English how you are testing that the tree is balanced.

Once you have written this function, scroll down to run_tests and uncomment test_balanced(); . Next, scroll down to IntStringBTDict and uncomment those two lines. All the tests should pass. Now that you are confident in your balanced function, you are required to use it on all your tests involving insert.

3b: strings, fold, lookup, member

Implement these functions according to their specification provided in DICT:


val fold : (key -> value -> 'a -> 'a) -> 'a -> dict -> 'a
val lookup : dict -> key -> value option
val member : dict -> key -> bool
val string_of_key : key -> string                                       
val string_of_value : value -> string 
val string_of_dict : dict -> string

You may change the order in which the functions appear, but you may not change the signature of any of these functions (name, order of arguments, types). You may add a "rec" to make it "let rec" if you feel that you need it.

3c: insert (upward phase)

Read the handout if you have not already done so. To do insertion, there are two "phases"--the downward phase to find the right place to insert, and the upward phase to rebalance the tree, if necessary, as we go back up the tree. To distinguish when we need to rebalance and when we don't, we have created a new type to represent a "kicked-up" configuration:


type kicked = 
   | Up of dict * pair * dict (* only two nodes require rebalancing *)
   | Done of dict             (* we are done rebalancing *)

The only kind of node that could potentially require rebalancing is a 2-node (we can see this pictorially: the only subtrees that require rebalancing are represented by an up arrow in the handout; only 2-nodes have up arrows), hence we make the Up constructor take the same arguments as the Two constructor. We have provided stubs for functions to perform the upward phase on a 2-node and a 3-node:


let insert_upward_two (w: pair) (w_left: dict) (w_right: dict)
    (x: pair) (x_other: dict) : kicked = ...

let insert_upward_three (w: pair) (w_left: dict) (w_right: dict)
    (x: pair) (y: pair) (other_left: dict) (other_right: dict) : kicked = ...

Please read the specification in the source code. These functions return a "kicked" configuration containing the new balanced tree (See page 5 on the handout), and this new tree will be wrapped with "Up" or "Done", depending on whether this new balanced tree requires further upstream rebalancing (indicated by an upward arrow). Implement these functions according to the specification in the handout.

3d: insert (downward phase)

Now that we have the upward phase (the hard part) taken care of, we can worry about the downward phase. We have provided three mutually recursive functions to simplify the main insert function. Here are the downward phase functions:


  let rec insert_downward (d: dict) (k: key) (v: value) : kicked =
    match d with
      | Leaf -> ??? (* base case! see handout *)
      | Two(left,n,right) -> ??? (* mutual recursion! *) 
      | Three(left,n1,middle,n2,right) -> ??? (* mutual recursion! *)

  and insert_downward_two ((k,v): pair) ((k1,v1): pair) 
      (left: dict) (right: dict) : kicked = ...

  and insert_downward_three ((k,v): pair) ((k1,v1): pair) ((k2,v2): pair) 
      (left: dict) (middle: dict) (right: dict) : kicked = ...

  (* the main insert function *)
  let insert (d: dict) (k: key) (v: value) : dict =
    match insert_downward d k v with
      | Up(l,(k1,v1),r) -> Two(l,(k1,v1),r)
      | Done x -> x

Note that, like the upward phase, the downward phase also returns a "kicked-up" configuration. The main insert function simply extracts the tree from the kicked up configration returned by the downward phases, and returns the tree.

Implement these three downward phase functions. You will need to use your upward phase functions that you wrote in part c. Once you have written your code, you must write tests to test your insert function, lookup function, and member function. We have provided generator functions to generate keys, values, and pairs to help you test (See DICT_ARG). Use the test functions that we have provided for remove as an example. Remember to make run_tests () call your newly created test functions.

3e: remove (upward phase)

Like insert, remove has an upward and downward phase. However, instead of passing "kicked-up" configurations, we pass a "hole". We have provided the following type definition for a "hole":


type hole = 
   | Hole of pair option * dict
   | Absorbed of pair option * dict

Here, a "hole" type can either contain a Hole (represented by an empty circle in the handout), or the hole could have been absorbed, which we represent with Absorbed. The Hole and Absorbed constructors correspond directly to the Hole and Absorbed terminology used in 2-3 tree specification document. The constructors take the tree containing the Hole/Absorbed hole, and a pair option containing the (key,value) pair we removed. This is useful when we need to remove a (key,value) pair that is in the middle of the tree (and therefore need to replace it with the minimum (key,value) pair in the current node's right subtree). In addition to a "hole" type, we have also provided a "direction2/3" type, to indicate, for the current node, which subtree is currently the "hole":


type direction2 = 
   | Left2
   | Right2

type direction3 =
   | Left3
   | Mid3
   | Right3

We have provided the downward phase for you, so you will only need to implement the upward phase (see pages 9 and 10 on the handout):


  (* cases 1-2 *)
  let rec remove_upward_two (n: pair) (rem: pair option) 
      (left: dict) (right: dict) (dir: direction2) : hole =
    match dir,n,left,right with
      | Left2,x,l,Two(m,y,r) -> Hole(rem,Three(l,x,m,y,r))
      | Right2,y,Two(l,x,m),r -> ???
      | Left2,x,a,Three(b,y,c,z,d) -> ???
      | Right2,z,Three(a,x,b,y,c),d -> ???
      | Left2,_,_,_ | Right2,_,_,_ -> Absorbed(rem,Two(Leaf,n,Leaf))

  (* cases 3-4 *)
  let rec remove_upward_three (n1: pair) (n2: pair) (rem: pair option)
      (left: dict) (middle: dict) (right: dict) (dir: direction3) : hole =
    match dir,n1,n2,left,middle,right with
      | Left3,x,z,a,Two(b,y,c),d -> Absorbed(rem,Two(Three(a,x,b,y,c),z,d))
      | Mid3,y,z,Two(a,x,b),c,d -> ???
      | Mid3,x,y,a,b,Two(c,z,d) -> ???
      | Right3,x,z,a,Two(b,y,c),d -> ??? 
      | Left3,w,z,a,Three(b,x,c,y,d),e -> ??? 
      | Mid3,y,z,Three(a,w,b,x,c),d,e -> ???
      | Mid3,w,x,a,b,Three(c,y,d,z,e) -> ???
      | Right3,w,z,a,Three(b,x,c,y,d),e -> ???
      | Left3,_,_,_,_,_ | Mid3,_,_,_,_,_ | Right3,_,_,_,_,_ ->
        Absorbed(rem,Three(Leaf,n1,Leaf,n2,Leaf))

Once you have finished writing the upward phase functions, uncomment out the test_remove_* functions in run_tests (). All the tests should pass.

3f: choose

Implement the function according to the specification in DICT:

choose : dict -> (key * value * dict) option

Write tests to test your choose function.

4. Try your crawler!

Finally, because we have properly factored out an interface, we can easily swap in your tree-based dictionaries in place of the list-based dictionaries by modifying the Make functor at the end of dict.ml. So make the change and try out your new crawler on the three sets of webpages. Our startup code prints out the crawling time, so you can see which implementation is faster. Try it on the larger test set in the wiki directory in addition to the html and simple-html directories with:


./moogle.d.byte 8080 45 wiki/Teenage_Mutant_Ninja_Turtles

If you are confident everything is working, try changing 45 to 200. (This may take several minutes to crawl and index). You're done!

We've provided you with a debug flag at the top of moogle.ml, that can be set to either true or false. If debug is set to true, Moogle prints out debugging information using the string conversion functions defined in your dictionary and set implementations. This is set to true by default.

5. Pagerank

Only do this part when you know for sure the rest of your problem set is working. Making too many changes all at once is a recipe for disaster. Always keep your code compiling and working correctly. Make changes one module at a time.

We will now apply our knowledge of ADTs and graphs to explore solutions to a compelling problem: finding "important" nodes in graphs like the Internet, or the set of pages that you're crawling with Moogle.

The concept of assigning a measure of importance to nodes is very useful in designing search algorithms, such as those that many popular search engines rely on. Early search engines often ranked the relevance of pages based on the number of times that search terms appeared in the pages. However, it was easy for spammers to game this system by including popular search terms many times, propelling their results to the top of the list.

When you enter a search query, you really want the important pages: the ones with valuable information, a property often reflected in the quantity and quality of other pages linking to them. Better algorithms were eventually developed that took into account the relationships between web pages, as determined by links. (For more about the history of search engines, you can check out this page.) These relationships can be represented nicely by a graph structure, which is what we'll be using here.

NodeScore ADT

Throughout the assignment, we'll want to maintain associations of graph nodes to their importance, or NodeScore: a value between 0 (completely unimportant) and 1 (the only important node in the graph).

In order to assign NodeScores to the nodes in a graph, we've provided a module with an implementation of an ADT, NODE_SCORE, to hold such associations. The module makes it easy to create, modify, normalize (to sum to 1), and display NodeScores. You can find the module signature and implementation in nodescore.ml.

NodeScore Algorithms

In this section, you'll implement a series of NodeScore algorithms in different modules: that is, functions rank that take a graph and return a node_score_map on it. As an example, we've implemented a trivial NodeScore algorithm in the IndegreeRanker module that gives all nodes a score equal to the number of incoming edges.

5a: Random Walk Ranker, or Sisyphus walks the web

In this section, we will walk you through creating a robust ranking algorithm based on random walks. The writeup goes through several steps, and we encourage you to do your implementation in that order. However, you will only submit the final version.

Part 1

You may realize that we need a better way of saying that nodes are popular or unpopular. In particular, we need a method that considers global properties of the graph and not just edges adjacent to or near the nodes being ranked. For example, there could be an extremely relevant webpage that is several nodes removed from the node we start at. That node might normally fare pretty low on our ranking system, but perhaps it should be higher based on there being a high probability that the node could be reached when browsing around on the internet.

So consider Sisyphus, doomed to crawl the Web for eternity: or more specifically, doomed to start at some arbitrary page, and follow links randomly. (We assume for the moment that the Web is strongly connected and that every page has at least one outgoing (that is, non-self) link, unrealistic assumptions that we will return to address soon.)

Let's say Sisyphus can take k steps after starting from a random node. We design a system to determine nodescores based off how likely Sisyphus reaches a certain page. In other words, we ask: where will Sisyphus spend most of his time?

Step 1

Implement rank in the RandomWalkRanker functor in pagerank.ml, which takes a graph, and returns a normalized nodescore on the graph where the score for each node is proportional to the number of times Sisyphus visits the node in num_steps steps, starting from a random node. If the graph is empty, you can return the empty node score map. For now, your function may raise an exception if some node has no outgoing edges. Note that num_steps is passed in as part of the functor parameter P.

You may find the library function Random.int useful, and may want to write some helper functions to pick random elements from lists (don't forget about the empty list case).

Step 2

Our Sisyphean ranking algorithm does better at identifying important nodes according to their global popularity, rather than being fooled by local properties of the graph. But what about the error condition we mentioned above: that a node might not have any outgoing edges? In this case, Sisyphus has reached the end of the Internet. What should we do? A good solution is to jump to some random page and surf on from there.

Modify your rank function so that that instead of raising an error at the end of the Internet, it has Sisyphus jump to a random node. Note that these forced jumps should count as a step, because you want the algorithm to terminate even if there are no edges.

Step 3

This should work better. But what if there are very few pages that have no outgoing edges -- or worse, what if there is a set of vertices with no outgoing edges to the rest of the graph, but there are still edges between vertices in the set? Sisyphus would be stuck there forever, and that set would win the node popularity game hands down, just by having no outside links. What we'd really like to model is the fact that Sisyphus doesn't have an unlimited attention span, and starts to get jumpy every now and then...

Modify your rank function so that if the module parameter P.do_random_jumps is Some alpha, then with probability alpha at every step, Sisyphus will jump to a random node in the graph, regardless of whether the current node has outgoing edges. (If the current node has no outgoing edges, Sisyphus should still jump to a random node in the graph, as before.) Again, these random jumps should count as steps because we don't want the algorithm to run forever as do_random_jumps approaches Some 1..

Don't forget to add some testing code. As stated in "Testing" section below, possible approaches include just running it by hand a lot of times and verifying that the results seem reasonable, or writing an approx-equal function to compare the result of a many-step run with what you'd expect to find.

Submit your final version of RandomWalkRanker once you have tested it out.

5b: Using your new page rank algorithm

At the top of crawl.ml is a definition of a MoogleRanker module that's used to order the search results. Replace the call to the IndegreeRanker functor with a call to the RandomWalkRanker you wrote, and experiment with how it changes the results.

6: Partnership and Submission Guidelines

If you are completing this assignment with a partner, the overarching principle is that each partner is responsible for equally sharing in all elements of the assignment. This precludes "divide and conquer" approaches to the assignment. It also requires introspection to prevent a situation in which you are "carrying" your partner or allowing yourself to be "carried".

We believe that the COS126 partner collaboration policy is a good guideline with lots of good advice. We will not be quite so draconian with respect to an absolute prohibition on any attempt to make progress without being physically colocated, however. Two examples that we will permit that would not be allowed under the COS126 policy are: working on the assignment together remotely via the internet (e.g. via FaceTime, Hangouts, or any of various screen-sharing applications); and working on the assignment in office hours that only one partner can attend (though we do encourage you to find a mutually-agreeable office hour to attend!).

NOTE: both students in a pair will receive the same grade on the assignment. Following from this, both students in a pair must use the same number of automatically waived late days. This means that a pair has available the mininum number of free late days remaining between the two of them. (A partner with more free late days available would retain the excess for future individual assignments or assignments paired with another student with free late days remaining.)

To submit, use the TigerFile link posted next to the assignment link. All files are required, including the readme -- even if you aren't partnered, nor giving feedback.

If you have ideas for a new Princeton-based set of web pages that could be downloaded and used next year for testing and experimentation, please let us know in your readme. The set of pages cannot be too large.

If you have any additional comments or suggestions concerning this assignment and how to make it better in the future, also include those in your readme.

Acknowledgements

Thanks to Greg Morrisett at Cornell for the core components (and the theme!) of this assignment!

Princeton University, Computer Science Department
Template design Andreas Viklund

COS 326 (Fall 2018)

Resources

Other Useful Stuff