Andy Zeng
Pacman Goes To School

Teach Pacman how to make himself smarter!

Inspired by behaviorist psychology, reinforcement learning is an area of machine learning in computer science, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, statistics, and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called approximate dynamic programming. The problem has been studied in the theory of optimal control, though most studies there are concerned with existence of optimal solutions and their characterization, and not with the learning or approximation aspects.
In machine learning, the environment is typically formulated as a Markov decision process (MDP), and many reinforcement learning algorithms for this context are highly related to dynamic programming techniques. The main difference between the classical techniques and reinforcement learning algorithms is that the latter do not need knowledge about the MDP and they target large MDPs where exact methods become infeasible. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs. This project uses a visual gridworld MDP to help us implement some reinforcement learning algorithms.

Value Iteration

- This value iteration agent is an offline planner, not a reinforcement learning agent, and so the relevant training option is the number of iterations of value iteration it should run in its initial planning phase.
- Takes an MDP on construction and runs value iteration for the specified number of iterations before the constructor returns.
- Value iteration computes k-step estimates of the optimal values, Vk.
- Note in code: the method computeActionFromValues(state) computes the best action according to the value function given by self.values. The method computeQValueFromValues(state, action) returns the Q-value of the (state, action) pair given by the value function given by self.values. These quantities are all displayed in the GUI: values are numbers in squares, Q-values are numbers in square quarters, and policies are arrows out from each square.
- This uses the "batch" version of value iteration where each vector Vk is computed from a fixed vector Vk-1, not the "online" version where one single weight vector is updated in place. This means that when a state's value is updated in iteration k based on the values of its successor states, the successor state values used in the value update computation are those from iteration k-1 (even if some of the successor states had already been updated in iteration k).
- A policy synthesized from values of depth k (which reflect the next k rewards) will actually reflect the next k+1 rewards (i.e. when πk+1 is returned). Similarly, the Q-values will also reflect one more reward than the values (i.e. when Qk+1 is returned).
- Handles the case when a state has no available actions in an MDP.
- The images show an AI agent's values, Q-values, and policies after fixed numbers of iterations and at convergence.

Bridge Crossing

- The following is a grid world map with the a low-reward terminal state and a high-reward terminal state separated by a narrow "bridge", on either side of which is a chasm of high negative reward. The agent starts near the low-reward state. With the default discount of 0.9 and the default noise of 0.2, the optimal policy does not cross the bridge.
- By simply changing one of the discount and noise parameters, the optimal policy can cause the agent to attempt to cross the bridge.

Optimal Policies

- This grid has two terminal states with positive payoff (in the middle row), a close exit with payoff +1 and a distant exit with payoff +10. The bottom row of the grid consists of terminal states with negative payoff (shown in red); each state in this "cliff" region has payoff -10. The starting state is the yellow square. We distinguish between two types of paths: (1) paths that "risk the cliff" and travel near the bottom row of the grid; these paths are shorter but risk earning a large negative payoff, and are represented by the red arrow in the right-most figure. (2) paths that "avoid the cliff" and travel along the top edge of the grid. These paths are longer but are less likely to incur huge negative payoffs. These paths are represented by the green arrow in the right-most figure.
- The key is choosing the best settings of the discount, noise, and living reward parameters for this MDP to produce optimal policies for certain conditions.
- Tests are performed across the following policy types produced:
    - Prefer the close exit (+1), risking the cliff (-10).
    - Prefer the close exit (+1), but avoiding the cliff (-10).
    - Prefer the distant exit (+10), risking the cliff (-10).
    - Prefer the distant exit (+10), avoiding the cliff (-10).
    - Avoid both exits and the cliff (infinite loop).


- Note that our value iteration agent does not actually learn from experience. Rather, it ponders its MDP model to arrive at a complete policy before ever interacting with a real environment. When it does interact with the environment, it simply follows the precomputed policy (e.g. it becomes a reflex agent). This distinction may be subtle in a simulated environment like a Gridword, but it's very important in the real world, where the real MDP is not available.
- This Q-learning agent, which does very little on construction, learns by trial and error from interactions with the environment through its update(state, action, nextState, reward) method.
- Ties between values are broken randomly for better behavior.
- In a particular state, actions that the agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that the agent has seen before have a negative Q-value, an unseen action may be optimal.
- The test on the right is performed with 10 episodes of learning.

Epsilon Greedy

- Implemented epsilon-greedy action selection, meaning the agent chooses random actions an epsilon fraction of the time, and follows its current best Q-values otherwise.
- Note that choosing a random action may result in choosing the best action - that is, not a random sub-optimal action... but any random legal action.
- The final Q-values resemble those of the value iteration agent, especially along well-traveled paths. However, the average returns are lower than the Q-values predicted because of the random actions and the initial learning phase.
- We can apply this algorithm to a Q-learning crawler robot (right figure). It attempts to learn how to, well, crawl towards the right.

Approximate Q-learning and State Abstraction

- In the first phase, training, Pacman will begin to learn about the values of positions and actions. Because it takes a very long time to learn accurate Q-values even for tiny grids, Pacman's training games run in quiet mode by default, with no GUI (or console) display. Once Pacman's training is complete, he will enter testing mode. When testing, Pacman's epsilon and alpha will be set to 0.0, effectively stopping Q-learning and disabling exploration, in order to allow Pacman to exploit his learned policy.
- Because unseen actions have by definition a Q-value of zero, if all of the actions that have been seen have negative Q-values, an unseen action may be optimal.
- The MDP state is the exact board configuration facing Pacman, with the now complex transitions describing an entire ply of change to that state. The intermediate game configurations in which Pacman has moved but the ghosts have not replied are not MDP states, but are bundled in to the transitions.
- The left image depicts a Pacman before Q-learning.
- The right image depicts a Pacman after Q-learning 2000 training episodes.

Approximate Q-learning and Features

- Implementation of an approximate Q-learning agent that learns weights for features of states, where many states might share the same features.
- Approximate Q-learning assumes the existence of a feature function f(s,a) over state and action pairs, which yields a vector f1(s,a) .. fi(s,a) .. fn(s,a) of feature values.
- Applying the feature function to our Q-learning Pacman agent gives us the following results (right figure), after 50 training episodes.