Reinforcement Learning Without Rewards
Machine learning can be broadly defined as the study and design of algorithms that improve with experience. Reinforcement learning is a variety of machine learning that makes minimal assumptions about the information available for learning, and, in a sense, defines the problem of learning in the broadest possible terms. Reinforcement learning algorithms are usually applied to ``interactive'' problems, such as learning to drive a car, operate a robotic arm, or play a game. In reinforcement learning, an autonomous agent must learn how to behave in an unknown, uncertain, and possibly hostile environment, using only the sensory feedback that it receives from the environment. As the agent moves from one state of the environment to another, it receives only a reward signal --- there is no human ``in the loop'' to tell the algorithm exactly what to do. The goal in reinforcement learning is to learn an optimal behavior that maximizes the total reward that the agent collects.
Despite its generality, the reinforcement learning framework does make one strong assumption: that the reward signal can always be directly and unambiguously observed. In other words, the feedback a reinforcement learning algorithm receives is assumed to be a part of the environment in which the agent is operating, and is included in the agent's experience of that environment. However, in practice, rewards are usually manually-specified by the practitioner applying the learning algorithm, and specifying a reward function that elicits the desired behavior from the agent can be a subtle and frustrating design problem. Our main focus in this thesis is the design and analysis of reinforcement learning algorithms which do not require complete knowledge of the rewards. The contributions of this thesis can be divided into three main parts:
* In Chapters 2 and 3, we review the theory of two-player zero-sum games, and present a novel analysis of existing no-regret algorithms for solving these games. Our results show that no-regret algorithms can be used to compute strategies in games that satisfy a much stronger definition of optimality than is commonly used.
* In Chapters 4 and 5, we present new algorithms for apprenticeship learning, a generalization of reinforcement learning where the true rewards are unknown. The algorithms described in Chapter 5 will leverage the game-theoretic results from Chapters 2 and 3.
* In Chapter 6, we show how partial knowledge of the rewards can be used to accelerate imitation learning, an alternative to reinforcement learning where the goal is to imitate another agent in the environment.
In summary, we design and analyse several new algorithms for reinforcement learning that do not require access to a fully observable or fully accurate reward signal, and by doing so, add considerable flexibility to the traditional reinforcement learning framework.