1.6   Case Study:   PageRank


This section is under construction.


Communicating across the web has become an integral part of everyday life. This communication is enabled in part by scientific studies of the structure of the web. We consider a simple model, known as the random surfer model. We consider the web to be a fixed set of pages, with each page containing a fixed set of hyperlinks, and each link a reference to some other page. We study what happens to a person (the random surfer) who randomly moves from page to page, either by typing a page name into the address bar or by clicking a link on the current page.

Tiny web graph

The model.

The first step in studying the random surfer model is to formulate it more precisely. The crux of the matter is to specify what it means to randomly move from page to page. The following intuitive 90-10 rule captures both methods of moving to a new page: Assume that 90 per cent of the time the random surfer clicks a random link on the current page (each link chosen with equal probability) and that 10 percent of the time the random surfer goes directly to a random page (all pages on the web chosen with equal probability).

You can immediately see that this model has flaws, because you know from your own experience that the behavior of a real web surfer is not quite so simple:

Despite these flaws, the model is sufficiently rich that computer scientists have learned a great deal about properties of the web by studying it. To appreciate the model, consider the small example above. Which page do you think the random surfer is most likely to visit?

Input format.

We assume that there are N web pages, numbered from 0 to N-1, and we represent links with ordered pairs of such numbers, the first specifying the page containing the link and the second specifying the page to which it refers. Given these conventions, a straightforward input for- mat for the random surfer problem is an input stream consisting of an integer (the value of N) followed by a sequence of pairs of integers (the representations of all the links).

Input format

Transition matrix.

We use a two-dimensional matrix, that we refer to as the transition matrix, to completely specify the behavior of the random surfer. With N web pages, we define an N-by-N matrix such that the entry in row i and column j is the probability that the random surfer moves to page j when on page i. Transition.java is a filter that converts the list-of-links representation of a web model into a transition-matrix representation.

Transition matrix computation

Simulation.

RandomSurfer.java simulates the behavior of the random surfer. It reads a transition matrix and surfs according to the rules, starting at page 0 and taking the number of moves as a command-line argument. It counts the number of times that the surfer visits each page. Dividing that count by the number of moves yields an estimate of the probability that a random surfer winds up on the page. This probability is known as the page's rank.

One random move. The key to the computation is the random move, which is specified by the transition matrix: each row represents a discrete probability distribution - the entries fully specify the behavior of the random surfer's next move, giving the probability of surfing to each page.

Generating a random integer from a discrete distribution

Program RandomSurfer.java is an improved version that uses two library methods - StdRandom.discrete() and StdArrayIO.readDouble2D() that we will introduce in Section 2.2.

Markov chains. The random process that describes the surfer's behavior is known as a Markov chain. Markov chains are widely applicable, well-studied, and have many remarkable and useful properties. For example, a basic limit theorem for Markov chains says that our surfer could start anywhere, because the probability that a random surfer eventually winds up on any particular page is the same for all starting pages! No matter where the surfer starts, the process eventually stabilizes to a point where further surfing provides no further information. This phenomenon is known as mixing.

Page ranks. The RandomSurfer simulation is straightforward: it loops for the indicated number of moves, randomly surfing through the graph. Because of the mixing phenomenon, increasing the number of iterations gives increasingly accurate estimates of the probability that the surfer lands on each page (the page ranks). Accurate page rank estimates for the web are valuable in practice for many reasons. First, using them to put in order the pages that match the search criteria for web searches proved to be vastly more in line with people's expectations than previous methods.

Visualizing the histogram. With StdDraw, it is also easy to create a visual representation that can give you a feeling for how the random surfer visit frequencies converge to the page ranks. RandomSurferHistogram.java draws a frequency histogram that eventually stabilizes to the page ranks.

Page ranks and histogram

Mixing a Markov chain.

Directly simulating the behavior of a random surfer to understand the structure of the web is appealing, but it has limitations. Think about the following question: Could you use it to compute page ranks for a web model with millions (or billions!) of web pages and links? The quick answer to this question is no, because you can- not even afford to store the transition matrix for such a large number of pages. A matrix for millions of pages would have trillions of entries. Do you have that much space on your computer? Could you use RandomSurfer to find page ranks for a smaller model with, say, thousands of pages? To answer this question, you might run multiple simulations, record the results for a large number of trials, and then interpret those experimental results. This can be very time-consuming, as a huge number of trials may be necessary to get the desired accuracy. Even for our tiny example, we saw that it takes millions of iterations to get the page ranks accurate to three or four decimal places.

It is important to remember that the page ranks are a property of the web model, not any particular approach for computing it. That is, RandomSurfer is just one way to compute page ranks. Fortunately, we can compute the same quantity more efficiently by using linear algebra.

Program Markov.java mixes an example Markov chain by repeatedly squaring the transition matrix.

Cleve Moler, founder of Matlab, described Google's Pagerank computation as "The World's Largest Matrix Computation."

Exact page ranks: 1/1570055 [428671, 417205, 229519, 388162, 106498]

Page ranks for medium.txt

Q + A

Q. What should row of transition matrix be if some page has no outlinks?

A. To make the matrix stochastic (all rows sum to 1), make that page equally likely to transition to every other page.

Q. How long until convergence of Markov?

A. Brin and Page report that only 50-100 iterations are need before the iterates converge. Convergence depends on the second largest eigenvalue of P λ2. The link structure of the Web is such λ2 is (approximately) equal to α = 0.9. Since .9^50 = .005153775207, we expect to have 2-3 digits of accuracy after 50 iterations.

Q. Any recommended readings on PageRank?

A. Here's a nice article from AMS describing PageRank.

Q. Why add the random page / teleportation component?

A. If not, random surfer can get stuck in part of the graph. More technical reason: makes the Markov chain ergodic.

Exercises

  1. Modify Transition.java to take the leap probability from the command line and use your modified version to examine the effect on page ranks of switching to an 80-20 rule or a 95-5 rule.
  2. Modify Transition.java to ignore the effect of multiple links. That is, if there are multiple links from one page to another, count them as one link. Create a small example that shows how this modification can change the order of page ranks.
  3. Modify Transition.java to handle pages with no outgoing links, by filling rows corresponding to such pages with the value 1/N.
  4. The code fragment in RandomSurfer.java that generates the random move fails if the probabilities in the row p[page] do not add up to 1. Explain what happens in that case, and suggest a way to fix the problem.
  5. Determine, to within a factor of 10, the number of iterations required by RandomSurfer to compute page ranks to four decimal places and to five decimal places for tiny.txt.
  6. Determine the number of iterations required by Markov.java to compute page ranks to three decimal places, to four decimal places, and to ten decimal places for tiny.txt.
  7. Download the file medium.txt from the booksite (which reflects the 50-page example depicted in this section) and add to it links from page 23 to every other page. Observe the effect on the page ranks, and discuss the result.
  8. Add to medium.txt links to page 23 from every other page, observe the effect on the page ranks, and discuss the result.
  9. Suppose that your page is page 23 in medium.txt. Is there a link that you could add from your page to some other page that would raise the rank of your page?
  10. Suppose that your page is page 23 in medium.txt. Is there a link that you could add from your page to some other page that would lower the rank of that page?
  11. Use Transition.java and RandomSurfer.java to determine the transition probabilities for the eight-page example shown below.
  12. Use Transition.java and Markov.java 1.6.12 Use Transition and Markov to determine the transition probabilities for the eight-page example shown below.

    Eight-page web graph

Creative Exercises

  1. Matrix squaring. Write a program like Markov.java that computes page ranks by repeatedly squaring the matrix, thus computing the sequence p, p 2, p 4, p 8, p 16, and so forth. Verify that all of the rows in the matrix converge to the same values.
  2. Random web. Write a generator for Transition.java that takes as input a page count N and a link count M and prints to standard output N followed by M random pairs of integers from 0 to N-1.
  3. Hubs and authorities. Add to your generator from the previous exercise a fixed number of hubs, which have links pointing to them from 10% of the pages, chosen at random, and authorities, which have links pointing from them to 10% of the pages. Compute page ranks. Which rank higher, hubs or authorities?
  4. Page ranks. Design an array of pages and links where the highest-ranking page has fewer links pointing to it than some other page.
  5. Hitting time. The hitting time for a page is the expected number of moves between times the random surfer visits the page. Run experiments to estimate page hitting times for tiny.txt, compare with page ranks, formulate a hypothesis about the relationship, and test your hypothesis on medium.txt. Fact: The hitting time = 1 / PageRank.
  6. Cover time. Write a program that estimates the time required for the random surfer to visit every page at least once, starting from a random page.
  7. Graphical simulation. Create a graphical simulation where the size of the dot representing each page is proportional to its rank. To make your program data-driven, design a file format that includes coordinates specifying where each page should be drawn. Test your program on medium.txt.