COS 226 Programming Assignment Checklist: Small World Phenomenon

Special collaboration policy for Fall '03: for this assignment, you are permitted to work jointly with one classmate. If you choose to do so, you and your partner should submit your code and the readme.txt jointly (under one login name only). You and your partner will receive the same grade.

Frequently Asked Questions

What's the maximum number of movies and actors? The Internet Movie Database contains less than 125,000 movies and 500,000 actors. We won't test your program on anything larger than this.

What if there are multiple chains of the same length. Which one do I print out? Yes, in general there will be many paths of the same length from an actor to Kevin Bacon. Print out any one you like.

Do I always use "Kevin Bacon" as the source? Yes, although you're welcome to experiment with others. Apparently, Christopher Lee is the center of the Hollywood Universe. For your final submission, we recommend something like

#define SOURCE "Bacon, Kevin"

What do I print if an actor is not in the same connected component as Kevin Bacon? Print that their Bacon number is "infinity." Or if the actor does not even appear in the database, you can print out a message to this effect.

When I run my program, I get a segmentation fault or bus error in tst-index.c. Who's to blame? Probably you. It's very likely that you a memory allocation error or an array out-of-bounds access error somewhere else in your program and the error remains hidden for a while. Errors that result from trampling memory can be difficult to debug for exactly this reason.

Input, Output, and Testing

Input and output. Here are a number of sample input files. The numbers below are slightly off because we are now editing the data files to remove movies with no actors and remove '/' characters from movie titles.

File Movies Actors Description
input-bacon.txt 46 1,498 With Kevin Bacon
input-top-grossing.txt 187 8,265 Top grossing
input-hero.txt 193 2,513 Contains string "hero"
input-mpaa-g.txt 967 13,850 Rated G by MPAA
input-year2000.txt 4,757 43,940 Released in 2000
input-mpaa.txt 14,192 170,539 Rated by MPAA
input-all.txt 122,938   418,577   All

Our data is taken from the Internet Movie Database. Feel free to download the full 850MB+ database for yourself. In addition to cast lists, it has movie reviews, audio clips, ratings, etc. If you want to create interesting input files (in our format), send them to us.

Extra credit. Create your own movie database file. If it's sufficiently interesting, we'll award extra credit and post them for other students to use. Feel free to create a database of other social relationships. For example, get the Registrar to release a roster list of all classes taken at Princeton since 1900 and use those to compute your Woodrow Wilson number. :) Or connnect Kazaa users if they've downloaded/uploaded a file.

Reference solutions.   For reference, we provide executable code for our solution in Windows, Solaris, and OS X. As usual it's a challenge to beat the reference solutions, but it has been done several times in the past! Note that your solution to queries on input-all.txt will not agree precisely with The Oracle of Bacon because our input file includes TV movies and excludes a few movies that our perl script doesn't parse.

Submission and readme

We will compile your program with "gcc226 *.c" and execute it with "a.out moviedb.txt < actors.txt"so you are free to organize your program as you see fit. As usual, we encourage and reward sound design principles.

Here is the template readme.txt file. It is required to use it in your submission. It contains the following information:

  • A high-level description for the design of your algorithm, and what influenced your decisions.
  • Unix users may find the shell script timeit.csh useful for computing a table of CPU times.

    Getting started

  • Download the directory bacon. This directory is mirrored on arizona at /u/cs226/pub/bacon. It contains reference solutions and sample input files.

  • Successfully read in and parse the input. The program parse.c reads in the movie database from a command line input. Use this as a starting point and add code to read in the list of actors from standard input.

  • A useful next step is to store all of the actors from the movie database into an associative index symbol table, perhaps as in Sedgewick Program 17.10 (this is a TST, a ternary search tree which was not studied in class, and you are encouraged to read about it in the textbook - it is very easy to understand if you already understand TRIEs). You can also implement a regular TRIE, a hash table or any other reasonable solution (in terms of time and space efficiency). Program 17.10 can be found at the following link: tst-index.c and tst-index.h.
  • Enrichment links

  • The Erdös number is a statistic very similar to the Kevin Bacon number, but it involves academic journal publications and mathematicians instead of movies and actors. Two mathematicians are linked if they have co-authored a journal paper. Paul Erdös was an incredibly prolific 20th century mathematician, and he plays the role of Kevin Bacon in this domain. The Erdös number project is dedicated to this problem.

  • Here's the same game played with AIM buddy lists.

  • The Black Sabbath number is another variant dealing with rock bands. Two bands are linked if a common musician performed for both groups. (This is not quite analogous to the Kevin Bacon number since here we are linking the analogs of movies instead of actors.) As an example, AC/DC has a Black Sabbath number of 2 since Simon Wright used to be the drummer of AC/DC and Wright also played drums for the band Dio; Ronnie James Dio is the singer for Dio and also sang for Black Sabbath. The Black Sabbath Game is dedicated to this problem.

  • COS 226 Home Page
    wayne@cs.princeton.edu