COS 402: Artificial Intelligence

Homework #6

Cat and mouse

Fall 2004

Due: Wednesday, December 8


Special late policy (modified):  In counting late days, we will skip the first three days of break, beginning with Friday, December 10.  In other words, homeworks turned in by 11:59pm on Thursday, December 9 will count as one day late (as usual), but homeworks turned in after this time, but before 11:59pm on Monday, December 13 will count as only two days late.  As usual, no homeworks will be accepted more than five "days" late, which means they will not be accepted after Thursday, December 16.

For this homework only, if you are out of the Princeton area over break, you may mail or email the written part of the assignment to Zafer. If mailed, your homework is considered submitted on the post mark date, and should be sent to this address: Zafer Barutcuoglu, Princeton University, Department of Computer Science, 35 Olden Street, Princeton, NJ 08544. (It would be wise to send him email at the same time you mail your assignment so that he can look out for it; also, save a photocopy of your work.)


Part I: Written Exercises

The written exercises are available here in pdf.

Part II:  Programming

In this assignment, you will use some of the algorithms for computing optimal policies in Markov decision processes (MDP's) to help a mouse escape from a cat while trying to find cheese to eat.

The main components of the assignment are the following:

  1. implement the policy iteration and value iteration algorithms for computing optimal policies, as well as a policy evaluation algorithm,
  2. run your code and explore its performance on a simulated cat-and-mouse world.

Implementing MDP algorithms

The first part of the assignment is to implement three of the algorithms that were discussed in lecture as well as R&N (sections 17.2 and 17.3): value iteration, policy iteration and policy evaluation.  Your implementations should work for any (reasonable) MDP, not just the ones provided.  See below for a description of where you should write your code.

Playing cat and mouse

The second part of the assignment is to use your code to compute optimal policies for a mouse who lives with a cat in the following grid world:

The mouse (M) can occupy any of the 31 blank squares.  The cat (C) also can occupy any square, except for square (6,1) which is the mouse's hole (too small for the cat to squeeze in).  There is cheese (z) that, at any given time, can be found in one of the following four squares: (2,3), (4,1), (5,5) and (7,3).  Thus, this MDP has 31*30*4=3720 states.

The cat and the mouse can each move one square in any direction -- vertically, horizontally or diagonally.  They also can choose not to move at all.  Thus, there are nine possible moves from each square.  If an action is attempted that causes the creature to bump into a wall, then it simply stays where it is.

In this problem, we will always take the point of view of the mouse.  When the mouse is on the cheese, it receives a reward of 1.  When the cat is on the mouse, it (the mouse) receives a reward of -11.  When the cat is on the mouse, and the mouse is on the cheese, the reward is -10.  All other configurations have a reward of 0.  Thus, the mouse is trying to eat cheese while simultaneously avoiding the cat.

We will consider three different cats.  The first cat, poor thing, is blind and oblivious, and simply wanders randomly around its environment choosing randomly among its nine available actions at every step.  The second cat is hungry, alert and unrelenting.  This cat always heads straight for the mouse following the shortest path possible.  Thus, after the mouse makes its move, the cat chooses the action that will move it as close as possible to the mouse's new position.  (If there is a tie among the cat's best available options, the cat chooses randomly among these equally good best actions.)  However, when the mouse is in its hole and out of sight, the cat reverts to aimless (i.e., random) wandering.  The third cat is also alert, but has a more sporting disposition, and therefore follows a combination of these two strategies: half the time, it wanders aimlessly, and half the time, it heads straight for the mouse.  Machine-readable (and compressed) descriptions of the three MDP's corresponding to these three cats is provided in the files oblivious.mdp.gz, unrelenting.mdp.gz and sporting.mdp.gz.

There is always cheese available in exactly one of the four locations listed above.  At every time step, the cheese remains where it is with 90% probability.  With 10% probability, the cheese vanishes and reappears at one of the four locations, chosen randomly.

States are encoded as six tuples, the first two numbers indicating the position of the mouse, the second two numbers the position of the cat, and the last two numbers the position of the cheese.  Thus, 4:2:2:4:7:3 indicates, as depicted in the figure above, that the mouse is in (4,2), the cat is in (2,4), and the cheese is in (7,3).  The cat and the mouse alternate moves.  However, in encoding the MDP, we collapse both moves into a single state transition.  In addition, the cheese, when it moves, does so simultaneously with the mouse's move.  For instance, from the configuration above, if the mouse moves to (5,2) and the cat responds by moving to (3,3), while the cheese moves to (5,5), this all would be encoded as a single transition from state 4:2:2:4:7:3 to 5:2:3:3:5:5.  Actions in the MDP refer to action's that the mouse can make; the cat's actions are effectively "hard-wired" into the dynamics of the MDP itself.

For each of the three cats, your job will be to compute the mouse's optimal policy, i.e., the action that should be taken at each state to maximize the mouse's expected discounted reward, where we fix the discount factor (gamma) to be 0.95.  You can then watch the cat and the mouse go at it using a primitive animator that we are providing.

Exploring optimal behavior

Once you have everything working, you should take some time to understand the mouse's behavior in each of the three cases.  Then briefly write up your observations in 2-4 paragraphs (all together, not per case).  As on HW#5, you can make observations that are either anecdotal or quantitative or both.  Think about questions like the following suggestions:  How is the mouse behaving in each case and why does this behavior make sense?  What sorts of intelligent behavior appear to be emerging?  Is the behavior what you would have expected, or are there aspects of the mouse's behavior that you found surprising?  What happens if you take the mouse's optimal policy for cat A and have the mouse follow that policy when playing against cat B?  Since there are three cats, there are nine possibilities here.  Construct a matrix showing the utility (say, averaged over all possible starting states), for all nine possibilities.  Which policy should the mouse follow if it doesn't know which cat it's up against?  Why?  You may also wish to explore the performance of the algorithms themselves.  For instance, how many iterations does it take to reach convergence?  How is performance affected if we stop iterating one of the algorithms before we reach convergence?  What happens if we vary gamma?

The code we are providing

We are providing a class called Mdp that loads a description of an MDP from a file whose format is described shortly.  The class has a constructor taking a file name as argument that reads in the MDP description from the file.  Each state is represented by an integer in the range 0 (inclusive) to numStates (exclusive), where numStates is the total number of states.  Actions are represented in a similar fashion.  The stateName and actionName arrays convert these integers back to strings in the obvious fashion.  The reward array gives the immediate reward of each state.  For efficiency, state transition probabilities are represented in a sparse fashion to take advantage of the fact that most states have a relatively small number of possible successors.  In particular, the array nextState[s][a] represents a list of possible next states which may follow state s under action a.  The understanding here is that the probability of transitioning to any other state is zero.  There is a corresponding list of probabilities transProb[s][a].  Thus, transProb[s][a][i] is the probability of transitioning from s to nextState[s][a][i] under action a.

Each line of an MDP description file has one of three possible formats.  The first possibility is that the line has the simple form

<state>

where <state> is any name (no white space).  Such a line indicates that <state> is the start state of the MDP.

The second possibility is that the line has the form

<state>  <reward>

where <state> is any name, and <reward> is a number.  Such a line indicates that the reward in <state> is <reward>.

The last possibility is that the line has the form

<from_state>  <action>  <to_state>  <prob>

where the first three tokens are names, and <prob> is a nonnegative number.  Such a line indicates that the probability of transitioning from <from_state> to <to_state> under action <action> is <prob>.  Multiple lines of this form with the same <from_state> and <action> can be combined.  Thus,

u  a  v 0.3 u  0.4 w 0.1

is equivalent to the three lines

u  a v 0.3
u  a u 0.4
u  a w 0.1

Lines of these forms can appear in the file in any order.  However, if a reward is given more than once for the same state, then the last reward value given is assigned to that state.  On the other hand, if more than one probability is given for the same state-action-state triple, then the sum of these probabilities is assigned to this transition.  If the probabilities of transitioning out of a given state under a given action do not add up to one, then these numbers are renormalized so that they do.  However, if this sum is zero, an exception is thrown.  States that are not given a reward in the description file are assumed to have a reward of zero; likewise, transitions that do not appear in the file are assumed to have zero probability.  An exception is also thrown if no start state is declared (using the first form above); if more than one such line appears, then the last one is taken to define the start state.

For instance, the following encodes an MDP with five states, -2, -1, 0, +1 and +2.  There are two actions, L(eft) and R(ight) which cause the MDP to move to the next state to the left or right 90% of the time, or to move in the opposite direction 10% of the time (with movement impeded at the end points  -2 and +2).  The start state is 0.  The reward is 1 in 0, -1 in -2, and -2 in +2.  The start state is 0.

0
0 1
0 L -1 0.9 +1 0.1
0 R +1 0.9 -1 0.1
-1 L -2 0.9 0 0.1
-1 R 0 0.9 -2 0.1
-2 -1
-2 L -2 0.9 -1 0.1
-2 R -1 0.9 -2 0.1
+1 R +2 0.9 0 0.1
+1 L 0 0.9 +2 0.1
+2 -2
+2 R +2 0.9 +1 0.1
+2 L +1 0.9 +2 0.1

If you want to try out or test your program on MDP's of your own invention (which is highly recommended), you can do so by creating files of this form.  Alternatively, you can add a new constructor to the Mdp class that will fill in the public fields with a description of your MDP.

Because the files being provided for the cat and mouse MDP's are quite large, we are providing them in a compressed (gzipped) form.  Note that the MDP class will automatically read a file that has been gzipped, provided that it ends with a ".gz" suffix.  In other words, there is no need to uncompress the gzipped files that we are providing prior to running your code on them.  (Of course, the code will also work on files that have not been compressed.)  In addition, because these MDP's are somewhat large, you may need to increase the memory available to java (the option for doing this is system dependent, but it often has the form -Xmx256m which increases the memory to 256 megabytes).

The class CatMouseAnimator can be used to "animate" a policy on one of the cat-and-mouse MDP's.  When the method animate is called with a policy, the policy is followed on the MDP provided at the time of construction.  Each shot of the animation looks something like the following:

+===============+
| . . . . . . . |
| . C . . .   . |
| . . . . . . z |
| . . . M .     |
| . . . . . _   |
+===============+

The letters M, C and z indicate the position of the mouse, cat and cheese.  The mouse's hole is indicated with an underscore.  The mouse vanishes when the cat is on top of it; likewise, the cheese can be obscured by either the cat or the mouse.

The class CatMouse consists only of a main that loads an MDP from a file, finds and prints the optimal policy and runs the animator (unless the second argument is omitted).  Before starting the animator, four columns are printed following each state: the optimal action under the policy found using value iteration; the optimal action under the policy found using policy iteration; the optimal utility as estimated using value iteration; and the estimated utility of the optimal policy using policy evaluation.  For instance, on the simple five-state MDP given above, the output should look something like this:

Optimal policies:
0    L    L    8.169018052871904    8.169018052871904
-1   R    R    7.5578670677920545   7.5578670677920545
+1   L    L    7.442544315827016    7.442544315827016
-2   R    R    6.035332975545492    6.035332975545492
+2   L    L    4.821409270650342    4.821409270650342

In this case, as expected, the optimal policy says to move to the center (right if in -1 or -2, and left if in +1 or +2), and to move left (away from the larger penalty state +2) if in the center.

You might well decide to modify or replace the CatMouse class.

To debug your code, you will surely want to run it on small MDP's.  There are a number of "sanity checks" that can be done to see if your code is working.  For instance, value iteration and policy iteration should output identical policies (up to possible ties between equally good actions).  In addition, the utility of the optimal policy must satisfy the Bellman equations.

The code that you need to write

Your job is to fill in the constructors in the three classes ValueIter, PolicyIter and PolicyEval.  The constructor for ValueIter takes as input an MDP and a discount factor; the constructed class then includes a field policy containing the optimal policy (an array mapping states to actions), and another field utility (an array mapping states to real numbers).  The other two classes are set up similarly.  You may wish to add other fields that are returned as part of the class, such as the number of iterations until convergence.

The final code that you turn in should not write anything to standard output or standard error.

All code and data files can be obtained from this directory, or all at once from this tar file.  Documentation is available here.

What to turn in

Using whiteboard, you should turn in the following:

In addition, you should turn in your write-up exploring the cat-and-mouse environments as described above.  This written work should be handed in with the written exercises.

What you will be graded on

We will automatically test your code on MDP's other than those provided with this assignment.  Your grade will depend largely on getting the right answers.  In addition, your code should be efficient enough to run reasonably fast (under a minute or two on an MDP like those provided on a 2.4GHz machine), and should not terminate prematurely with an exception (unless given bad data, of course); code that does otherwise risks getting no credit for the automatic testing portion of the grade.  As usual, you should follow good programming practices, including documenting your code. 

Your write-up should be clear, concise, thoughtful and perceptive.

This programming assignment will be worth about 70 points (about 35 based on automatic testing of your code and on implementing the correct algorithms, 10 for documentation and good programming practice, and 25 for the write-up).

Acknowledgments

Thanks to Dale Schuurmans from whom this assignment was liberally borrowed.