% No 'submit' option for the problems by themselves.
\documentclass{cos324}
% Comment out the line above and uncomment the line below for submitting.
% \documentclass[submit]{cos324}
\usepackage{newtxtext,newtxmath}
\usepackage{bm}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{tikz}
\usepackage{color}
\usepackage{listings}
\usepackage{enumitem}
% Put in your full name and email address.
\name{Your Name}
\email{email@princeton.edu}
% You don't need to change these.
\course{COS324-S19}
\assignment{Assignment \#6}
\duedate{23:55pm Friday 10 May 2019}
\dropboxurl{https://dropbox.cs.princeton.edu/COS324_S2019/HW6}
\newcommand{\boldA}{\bm{A}}
\newcommand{\boldm}{\bm{m}}
\newcommand{\boldS}{\bm{S}}
\newcommand{\boldX}{\bm{X}}
\newcommand{\boldx}{\bm{x}}
\newcommand{\boldU}{\bm{U}}
\newcommand{\boldV}{\bm{V}}
\newcommand{\boldw}{\bm{w}}
\newcommand{\boldW}{\bm{W}}
\newcommand{\boldz}{\bm{z}}
\newcommand{\reals}{\mathbb{R}}
\newcommand{\distNorm}{\mathcal{N}}
\newcommand{\identity}{\mathbb{I}}
\newcommand{\given}{\,|\,}
\newcommand{\boldeta}{\bm{\eta}}
\newcommand{\boldmu}{\bm{\mu}}
\newcommand{\boldLambda}{\bm{\Lambda}}
\newcommand{\trans}{\text{T}}
\newcommand{\trace}{\text{trace}}
\begin{document}
% IMPORTANT: Uncomment the \documentclass command above with the [submit] option and put your name where it says ``your name''.
\begin{problem}[1pt]
Use the instructions above to put your name on the assignment when you compile this document.
\end{problem}
\subsection*{Problem 2: Lawn Darts [49 pts]}
Lawn darts was a classic fun family game, that also turned out to be somewhat deadly.
The game was played by throwing pointy weighted darts into the air and trying to have them land in a target area.
It was kind of like a weaponized version of horseshoes.
Like the \emph{Gilbert U-238 Atomic Energy Lab}, it is a toy that has been banned by the Consumer Product Safety Commission.
Fortunately, we're going to play it here in simulation.
This will be a single-player game.
The idea is to throw darts at the grid in Figure~2 until you get \textbf{exactly} 101 points.
You'll start with zero points, and in each turn you aim a dart at one of the sixteen inner numbered points in the grid.
You're not a great aim, so you have a good chance to hit one of the other squares.
Whatever square you hit, you add that number of points to your score.
If you hit one of the blank squares in the ring, you add zero points.
You can't aim for the blank ring.
If you get a score of exactly 101, the game ends and you get one unit of reward.
If you go over 101, the game ends and you get a negative one unit of reward.
\vspace{\baselineskip}
\begin{minipage}[b]{0.45\linewidth}
\centering
\includegraphics[width=0.7\textwidth]{hw6-figures/Lawndart}
\captionof{figure}{Lawn darts turned out to be a bad idea, except for in simulation.}
\end{minipage}%
\hfill
\begin{minipage}[b]{0.45\linewidth}
\centering
\resizebox{0.5\textwidth}{!}{%
\begin{tikzpicture}
\foreach \i in {0,...,6} {
\draw [thin,gray] (\i,0) -- (\i,6);
}
\foreach \i in {0,...,6} {
\draw [thin,gray] (0,\i) -- (6,\i);
}
\foreach \i in {1,...,5} {
\draw [thick,black] (\i,1) -- (\i,5);
}
\foreach \i in {1,...,5} {
\draw [thick,black] (1,\i) -- (5,\i);
}
\draw (1.5,1.5) node {$9$};
\draw (2.5,1.5) node {$6$};
\draw (3.5,1.5) node {$15$};
\draw (4.5,1.5) node {$4$};
\draw (1.5,2.5) node {$16$};
\draw (2.5,2.5) node {$3$};
\draw (3.5,2.5) node {$10$};
\draw (4.5,2.5) node {$5$};
\draw (1.5,3.5) node {$2$};
\draw (2.5,3.5) node {$13$};
\draw (3.5,3.5) node {$8$};
\draw (4.5,3.5) node {$11$};
\draw (1.5,4.5) node {$7$};
\draw (2.5,4.5) node {$12$};
\draw (3.5,4.5) node {$1$};
\draw (4.5,4.5) node {$14$};
\end{tikzpicture}
}
\captionof{figure}{You aim for one of the 16 middle squares. You get zero points if you accidentally hit in the ring outside.}
\end{minipage}
\vspace{\baselineskip}
Your objective is to learn a policy to play the game.
In this case, you happen to know exactly what your aiming distribution is: you have a 60\% chance of hitting the square you aim for, and 10\% chance of hitting the square to the north, east, south or west.
So, if you aim for the $15$, you get $15$ points with probability $0.6$, $4$ points with probability $0.1$, $6$ points with probability $0.1$, $10$ points with probability $0.1$, and $0$ points with probability $0.1$.
This is a Markov decision process (MDP) in which you know the current state~${s\in\{0,117\}}$ (your score), you know the reward function
\begin{align}
R(s) &= \begin{cases}
0 & \text{if $s < 101$}\\
1 & \text{if $s=101$}\\
-1 & \text{if $s > 101$}
\end{cases}\,,
\end{align}
and you know the transition probabilities.
A policy for this game will be a function from the states~${s=0,1,\ldots,100}$ to 16 possible actions.
Your objective is to find such a policy using value iteration or policy iteration.
Explain how you set up the problem and why you chose the approach you did.
Plot a representation of the evolution of the value function over time, that is, make a figure where the x-axis is the current state (score) and the y-axis is the value of that state.
(You don't need to plot these curves for each iteration of value iteration, just for the final solution.)
Do you notice any interesting structure in this plot?
Which non-terminal state has the highest value?
Why do you think this is?
Turn in the code you used to perform this simulation.
%\end{problem}
\subsection*{Problem 3: Swingy Monkey [50 pts]}
In 2014, the mobile game \emph{Flappy Bird} took the world by storm.
After its discontinuation, iPhones with the game installed sold for thousands of dollars on eBay.
In this problem, you'll be developing a reinforcement learning agent to play a similar game, \emph{Swingy Monkey}.
In this game, you control a monkey that is trying to swing on vines and avoid tree trunks.
You can either make him jump to a new vine, or have him swing down on the vine he's currently holding.
You get points for successfully passing tree trunks without hitting them, falling off the bottom of the screen, or jumping off the top.
There are some sources of randomness: the monkey's jumps are sometimes higher than others, the gaps in the trees vary vertically, and the distances between the trees are different.
You can play the game directly by pushing a key on the keyboard to make the monkey jump, but this problem is about building a learning agent to play.
\begin{figure}[t]
\centering%
\includegraphics[width=0.7\textwidth]{hw6-figures/swingy-ann}
\caption{Various pieces of the state dictionary.}
\end{figure}
You will implement two Python functions, \verb|action_callback| and \verb|reward_callback|.
The reward callback will tell you what your reward was in the immediately previous time step:
\begin{itemize}[topsep=2pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]
\item Reward of $+1$ for passing a tree trunk.
\item Reward of $-5$ for hitting a tree trunk.
\item Reward of $-10$ for falling off the bottom of the screen.
\item Reward of $-10$ for jumping off the top of the screen.
\item Reward of $0$ otherwise.
\end{itemize}
The action callback will take in a dictionary that describes the current state of the game and you will use it to return an action in the next time step.
This will be a binary action, where 0 means to swing downward and 1 means to jump up.
The dictionary you get for the state looks like this:
\begin{lstlisting}[language=Python]
{ 'score': ,
'tree': { 'dist': ,
'top': ,
'bot': },
'monkey': { 'vel': ,
'top': ,
'bot': }}
\end{lstlisting}
All of the units here (except score) will be in screen pixels.
Figure~3 shows these graphically.
There are multiple challenges here.
First, the state space is very large -- effectively continuous.
You'll need to figure out how to handle this.
One strategy might be to use some kind of function approximation, such as a neural network, to represent the value function or the $Q$-function.
Another strategy -- one that worked well for me -- is to discretize the position space into bins.
Second, you don't know the dynamics, so you'll need to use a reinforcement learning approach, rather than a standard MDP solving approach.
I got a pretty good monkey policy with Q-Learning.
Your task is to use reinforcement learning to find a policy for the monkey that can navigate the trees.
The implementation of the game is in file \verb|SwingyMonkey.py|, along with a few files in the \verb|res/| directory.
A file called \verb|stub.py| is provided to give you an idea of how you might go about setting up a learner that interacts with the game.
I also posted a YouTube video of my Q-Learner at \href{http://youtu.be/l4QjPr1uCac}{here}.
You can see that it figures out a reasonable policy in a few dozen iterations.
(Don't take the specific performance or number of iterations too seriously as a threshold for a good solution. The goal is to get a competent monkey policy; your algorithm might find one faster or slower than mine. The 100 iterations in the starter code is not a constraint.)
You should explain how you decided to solve the problem, what decisions you made, and what issues you encountered along the way.
Provide evidence where necessary to explain your decisions.
Effort and explanation will be large fractions of the grading rubric of this problem.
You should not use off-the-shelf RL tools like PyBrain to solve this.
Turn in your code.
\subsection*{Changelog}
\begin{itemize}
\item 29 April 2019 -- Initial version.
\end{itemize}
\end{document}