Jekyll2019-07-05T20:50:30+00:00https://jmichaux.github.io/feed.xmlJon MichauxJon MichauxJon Michauxjmichaux@ttic.eduUsing Intrinsic Motivation to Solve Robotic Tasks with Sparse Rewards2019-05-05T00:00:00+00:002019-05-05T00:00:00+00:00https://jmichaux.github.io/intrinsic-motivation<h1 id="abstract">Abstract</h1>
<p>Many robotics problems are naturally formulated such that the extrinsic rewards to the agent are either sparse or missing altogether. These problems can be extremely difficult to solve as the environment provides little to no feedback to guide the agent toward accomplishing its goal. Previous works have shown that agents that train using prediction error as an intrinsic reward are able to learn across a wide range of domains, including Atari games and continuous control tasks [<a href="https://arxiv.org/abs/1705.05363">1</a>, <a href="https://arxiv.org/abs/1808.04355">2</a>, <a href="https://arxiv.org/abs/1810.12894">3</a>]. In this project, I use curiosity-driven exploration to solve challenging robotics tasks with sparse rewards. Following these previous works, I formulate the intrinsic reward as the error in the agent’s ability to predict its next state, given its current state and executed action. My results demonstrate that this approach is capable of solving several difficult robotic manipulation tasks in simulation.</p>
<h1 id="introduction">Introduction</h1>
<p>The goal of reinforcement learning is for an agent to learn to solve a given task by maximizing its total expected reward. Instead of relying on external instructions, the agent learns how to choose actions by exploring and interacting directly with the environment. Reinforcement learning problems can roughly be sorted into two categories: 1) where the agent receives dense rewards and 2) where the agent receives sparse (or no) rewards.</p>
<p>In the first case, the environment provides a continuous source of feedback to the agent in the form of dense scalar rewards. These rewards, which are received at every time step, guide the agent toward choosing the best actions to solve the task. This approach has seen many successes, including solving challenging Atari games [<a href="https://arxiv.org/abs/1312.5602">4</a>] and physical control problems with continuous state and action spaces [<a href="https://arxiv.org/abs/1509.02971">5</a>].</p>
<p>In the second case, which is the focus of this project, the environment provides little-to-no feedback to the agent. Robotics problems are great examples of the sparse-reward settings that are so common to the real-world. Consider, for example, a robotic agent tasked with clearing a table. Rather than designing a complicated reward function that considers the myriad of subtasks, a more natural approach is to supply a reward only once every item has been removed and the goal is finally met.</p>
<p>Therefore, the agent must learn the requisite skills in the absence of any feedback from the environment, but is unlikely to randomly stumble upon a good policy by chance. One way to overcome this challenge is by carefully engineering a reward function that generates extrinsic rewards to guide the agent’s progress. This approach, however, is saddled with the difficult chore of designing a custom reward functions for every environment; and a design that is hand-crafted may inadvertently fail to specify the task well enough to deter undesirable behaviors from the agent. Alternatively, we may opt for methods that will encourage the agent to explore and learn new skills in the absence of any external rewards from the environment. In this project, I explore learning a reward function that is intrinsic to the agent in order to solve sparse reward problems.</p>
<h1 id="intrinsic-motivation-in-reinforcement-learning">Intrinsic Motivation in Reinforcement Learning</h1>
<p>One source of inspiration for solving sparse reward problems has come from the field of developmental psychology, namely <strong><em>motivation</em></strong>. There are two types of motivation: <em>extrinsic motivation</em> and <em>intrinsic motivation</em>.</p>
<p>In reinforcement learning, intrinsic motivation, or curiosity, is often formulated to encourage the agent to perform actions that lead to the discovery of novel states [<a href="https://ieeexplore.ieee.org/abstract/document/170605">6</a>, <a href="https://arxiv.org/abs/1103.5708">7</a>, <a href="http://people.idsia.ch/~juergen/ieeecreative.pdf">8</a>, <a href="https://pdfs.semanticscholar.org/2547/be25e1e07728aa0966a0354e90664816d15e.pdf">9</a>, <a href="https://arxiv.org/abs/1611.07507">10</a>, <a href="https://arxiv.org/abs/1606.01868">11</a>]. For simple discrete environments, this amounts to keeping track of the state-visitation counts <script type="math/tex">N(s)</script>, and modeling novelty as some function <script type="math/tex">h(s)=\frac{1}{[N(s)]^p}f(s)</script> where <script type="math/tex">p>0</script> [<a href="https://arxiv.org/abs/1606.01868">11</a>, <a href="https://papers.nips.cc/paper/4642-exploration-in-model-based-reinforcement-learning-by-empirically-estimating-learning-progress">12</a>]. As <script type="math/tex">N(s)</script> increases, the state <script type="math/tex">s</script> becomes less novel and the agent receives less reward. While this works well for small discrete Markov Decision Processes, this counts-based method will fail in large and continuous state spaces where an agent is unlikely to visit a given state multiple times. Bellemare <em>et al</em> solved this problem by learning a parameterized density model that can be used to derive an approximation to state-visitation counts called <em>pseudo-counts</em> [<a href="https://arxiv.org/abs/1606.01868">11</a>]. Their pseudo-counts measured how often an agent has visited similar states, and were converted into exploration bonuses that significantly improved exploration for a number of challenging Atari games.</p>
<p>Another formulation for an intrinsic reward encourages the agent to perform actions that reduce the error in its ability to predict the outcome of its actions [<a href="https://arxiv.org/abs/1705.05363">1</a>, <a href="https://arxiv.org/abs/1808.04355">2</a>, <a href="https://arxiv.org/abs/1810.12894">3</a>, <a href="https://pdfs.semanticscholar.org/fb3c/6456708b0e143f545d77dc8ec804eb947395.pdf">13</a>, <a href="https://arxiv.org/abs/1810.01176">14</a>]. Using deep neural networks to predict the agent’s next state given its current state and action, [<a href="https://arxiv.org/abs/1705.05363">1</a>] showed that intrinsic reward (<em>i.e.</em> prediction error) can be used to learn good exploration policies, even in the absence of extrinsic rewards. Burda <em>et al</em> systematically investigated how the choice of feature representation affects curiosity-driven learning across a diverse range of environments [<a href="https://arxiv.org/abs/1808.04355">2</a>]. They showed that random features work well for evaluating trained tasks, but learned features tend generalize better to unseen scenarios. Follow-up work by Burda <em>et al</em> showed that distilling features from a randomly initialized network and combining non-episodic extrinsic rewards and intrinsic rewards with different discount factors was able to solve the challenging Atari game <em>Montezuma’s Revenge</em> [<a href="https://arxiv.org/abs/1810.12894">3</a>].</p>
<h1 id="proposed-methods">Proposed Methods</h1>
<p>The agent is composed two different submodules: a policy <script type="math/tex">\pi_{\theta_{P}}(s_t)</script> and a dynamics model <script type="math/tex">f_{\theta_{D}}(s_t, a_t)</script>. The policy takes as input the current state and ouputs the action to be executed. The dynamics models takes as input the agent’s current state and action and outputs a prediction for the next state.</p>
<p align="center">
<a href="/assets/final_project/gifs/curious.png"><img src="/assets/final_project/gifs/curious.png" width="320" height="240" /></a>
</p>
<p>The prediction error of the dynamics model is used to generate the dense intrinsic rewards <script type="math/tex">r^{i}_{t}</script> at every timestep. We calculate it as follows:</p>
<script type="math/tex; mode=display">\hat s_{t+1} = f(s_t, a_t ; \theta_{D})</script>
<script type="math/tex; mode=display">r^{i}_{t} = \frac{\eta}{2} \lVert \hat s_{t+1} - s_{t+1} \rVert_{2}^{2}</script>
<p>The reward <script type="math/tex">R_t</script> that is supplied to the agent at each timestep is the sum of the extrinsic reward <script type="math/tex">r_{e}</script> from the environment and the intrinsic reward <script type="math/tex">r_{i}</script>:</p>
<script type="math/tex; mode=display">R_{t}(s, a, s') = \underbrace{r^{e}_{t}(s, a, s')}_{\text{extrinsic}} + \underbrace{r^{i}_{t}(s, a, s')}_{\text{intrinsic}}</script>
<p>The network parameters <script type="math/tex">\theta_{D}</script> of the dynamics model are optimized by minimizing the loss function:</p>
<script type="math/tex; mode=display">\begin{equation}
L_{D}(s_{t+1}, s_{t+1}) = \frac{1}{2} \lVert \hat s_{t+1} - s_{t+1} \rVert_{2}^{2}.
\end{equation}</script>
<p>Thus, the optimization problem that we need to solve for the agent is:</p>
<script type="math/tex; mode=display">\begin{equation}
\underset{\theta_{P}, \theta_{D}}{\operatorname{min}}[ -\mathbb{E}_{\pi(s_t ; \theta_{P})}[\sum_{t}R_t] + \beta L_{D}].
\end{equation}</script>
<p>Although it is possible to use any on-policy reinforcement learning algorithm with this formulation, I chose <em>Proximal Policy Optimization</em> (PPO) [<a href="https://arxiv.org/abs/1707.06347">15</a>] to train the agent’s policy due to its ease of implementation and improved stability over other policy gradient algorithms.</p>
<h1 id="experimental-setup">Experimental Setup</h1>
<p>This section describes the experimental design and environments that I used to evaluate whether adding intrinsic rewards as an exploration bonus aids in solving sparse robotics tasks.</p>
<h2 id="environments">Environments</h2>
<p>For this project, I used the standard Fetch robotics environments [<a href="https://arxiv.org/abs/1802.09464">16</a>] provided in the OpenAI gym [<a href="https://arxiv.org/abs/1606.01540">17</a>]. The Fetch robotics environments are a set of benchmark tasks for continuous control of robotic manipulation. The tasks include <strong>reaching</strong>, <strong>pushing</strong>, <strong>pick-and-place</strong> and <strong>sliding</strong>. The Fetch robot is a 7-degrees-of-freedom robotic arm, containing a parallel pinch gripper as its end effector. The robot’s arm is controlled using position control of the end effector. The action space consists of four dimensions: three dimensions controlling the change in position along each of the ordinal axes while the fourth dimension controls the opening and closing of the gripper. The state space of each environment includes the position (<script type="math/tex">x</script>, <script type="math/tex">y</script>, <script type="math/tex">z</script>) and velocity (<script type="math/tex">v_x</script>, <script type="math/tex">v_y</script> <script type="math/tex">v_z</script>) of the gripper, and the joint states and velocities of the gripper fingers. If an object is present in the scene, the state will also include the position, linear velocity, and angular velocity of the object as well as the position and linear velocity of the object relative to the gripper.</p>
<!-- <br /> -->
<p align="center">Random Agents</p>
<p align="center">
<a href="/assets/final_project/gifs/random_agents.gif"><img src="/assets/final_project/gifs/random_agents.gif" width="200" height="120" /></a>
</p>
<!-- <br /> -->
<h3 id="reaching">Reaching</h3>
<p>The Reaching task is the simplest task in the Fetch robotics suite. The agent can control the <script type="math/tex">x</script>, <script type="math/tex">y</script>, and <script type="math/tex">z</script> position of the robot’s gripper while the gripper fingers are blocked. This task is considered solved when the robot’s gripper is within some thresholded distance from a given target. This environment is 13 dimensions.</p>
<h3 id="pushing">Pushing</h3>
<p>The Pushing task is more complex than the Reaching task, as it includes an additional object in the scene and hence more degrees of freedom. Like the Reaching task, only the 3D position of the gripper is controlled with the fingers are blocked. The Pushing task is considered solved once the agent has pushed a block to within some thresholded distance from a given target location. Note that the given target is within the workspace of the robot. This environment is 28 dimensions.</p>
<h3 id="pick-and-place">Pick and Place</h3>
<p>The Pick and Place task requires the agent grab a block in its workspace and lift it to a specified location. This environment is 28 dimensions.</p>
<h3 id="sliding">Sliding</h3>
<p>Similar to the Pushing task, the Sliding task is solved once the robot slides a puck to a given target location. Unlike the Pushing task, the target location is not within the workspace of the robot. This environment is 28 dimensions.</p>
<h2 id="environment-modifications">Environment Modifications</h2>
<p>I made a number modifications to the environments that allow more control over setting the extrinsic reward, observation type, and termination of an episode.</p>
<h3 id="rewards">Rewards</h3>
<p>The environments are classified based on the types of extrinsic rewards provided by the environment:</p>
<ul>
<li><strong>Dense</strong> - at every time step the agent receives a reward that is proportional to the distance from a desired goal</li>
<li><strong>Sparse</strong> - the agent receives a reward of -1 at every time step and a reward of 0 if it solves the task</li>
<li><strong>Very Sparse</strong> - the agent receives a reward of 0 at every time step and a reward of 1 if it solves the task</li>
</ul>
<h3 id="resets">Resets</h3>
<p>I made versions of each task that:</p>
<ul>
<li><strong>Never resets</strong> the environment before the max number of time steps</li>
<li><strong>Only resets</strong> the environment early <strong>after successful attempts</strong></li>
<li><strong>Only resets</strong> the environment early <strong>when the task becomes impossible</strong></li>
<li><strong>Resets</strong> the environment early <strong>after successes and failures</strong></li>
</ul>
<h3 id="observation-types">Observation types</h3>
<p>The following observations can be used:</p>
<ul>
<li><strong>State space</strong> (<em>i.e.</em> kinematics) of the robot (default)</li>
<li><strong>RGB images</strong> from three different views</li>
<li><strong>Depth maps</strong> corresponding the RGB images above</li>
<li><strong>Touch sensor data</strong> to measure contacts between the gripper and the environment</li>
</ul>
<h2 id="network-architecture-and-training-details">Network Architecture and Training Details</h2>
<p>All agents in this project are trained using kinematic inputs, <em>i.e.</em> the positions and velocities of the robot’s end effector as well as any objects in the scene. The agent’s policy and value function networks are both parameterized as Feedforward Neural Networks. The policy network consists of two hidden layers: the first layer has a ReLU activation, while the second layer has a Tanh activation. The output of the second layers is fed to another module that models the robot’s actions as a Gaussian distribution (<em>i.e.</em> predicts the mean and standard deviation). The value function network also contains two hidden layers, both of which have ReLU activations. Both the policy and value networks take the robot’s state as input.</p>
<p>The agent’s dynamics model is a three-layer neural network with ReLU activations on the two hidden layers and Tanh activation on the last layer. The dynamics model takes as input the state (<script type="math/tex">s_t</script>) of the robot and its current action (<script type="math/tex">a_t</script>) and outputs a prediction of either <script type="math/tex">1</script>) the robot’s next state (<script type="math/tex">\hat s_{t+1}</script>) or <script type="math/tex">2</script>) the change in the robot’s state (<script type="math/tex">\Delta \hat s_t</script>).</p>
<p>All networks are trained separately using the Adam optimizer with different learning rates.</p>
<h1 id="results">Results</h1>
<h2 id="solving-the-reaching-task">Solving the Reaching Task</h2>
<p>As mentioned above, the reaching task is the simplest task in the Fetch robotics suite. It is therefore a great environment to verify the baseline PPO implementation, as well as investigate whether adding dense intrinsic rewards will improve exploration and learning of sparse tasks. As shown in the plots below, all agents readily solve the Reaching task, converging to 100% accuracy by the end of training.</p>
<p align="center">Learning progress on the Reaching Task</p>
<p align="center">
<img src="/assets/final_project/gifs/reach_10.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/reach_50.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/reach_90.gif" width="200" height="120" />
</p>
<p align="center">
<a href="/assets/final_project/plots/reach_total_reward.png"><img src="/assets/final_project/plots/reach_total_reward.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/reach_solved.png"><img src="/assets/final_project/plots/reach_solved.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/reach_intrinsic.png"><img src="/assets/final_project/plots/reach_intrinsic.png" width="320" height="240" /></a>
</p>
<p><br /></p>
<h2 id="solving-the-pushing-task">Solving the Pushing Task</h2>
<p>Unlike the reaching task, the baseline PPO was unable to solve the pushing task (red curve in the plots below). However, adding intrinsic rewards as an exploration bonus led to rapid convergence. By the end of training the agent was able to solve <script type="math/tex">\gt 96\%</script> of the episodes. It is interesting to note that some of the learned policies had a propensity to roll the block while pushing it to the target. I think this is because the environment does not provide detailed information about the block such as its size and moment of inertia. Under these conditions, the dynamics model cannot accurately predict the state of the block. The agent is therefore incentivized to move the block in intersting ways as it pushes it to the goal.</p>
<p align="center">Learning progress on the Pushing Task</p>
<p align="center">
<img src="/assets/final_project/gifs/push_10.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/push_50.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/push_90.gif" width="200" height="120" />
</p>
<p align="center">
<a href="/assets/final_project/plots/push_total_reward.png"><img src="/assets/final_project/plots/push_total_reward.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/push_solved.png"><img src="/assets/final_project/plots/push_solved.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/push_intrinsic.png"><img src="/assets/final_project/plots/push_intrinsic.png" width="320" height="240" /></a>
</p>
<p><br /></p>
<h2 id="analysis-of-hyperparameters-for-the-pushing-task">Analysis of Hyperparameters for the Pushing Task</h2>
<p>The Pushing Task is the simplest non-trivial task in the OpenAI Robotics Suite. It therefore provided a good environment to investigate how tuning the various hyperparameters affects overall performance. This section details the analyses used to select the hyperparameters used in most of the experiments.</p>
<h3 id="number-of-hidden-units">Number of Hidden Units</h3>
<p>Tuning the number of neurons in the hidden layers is an extremely important aspect of training neural networks. If the network is too large, you risk over-fitting the data. Likewise, if the network is too small, it can be difficult to learn anything at all. In this experiment, I varied the size of the hidden layers from 16 to 1024 neurons. In general, bigger networks were observed to be more performant than smaller networks. For most of the experiments shown, I settled on training networks with 64 neurons. This is because 64-unit networks have similar performance to the larger networks, but they have fewer parameters and require less wall time to train.</p>
<p align="center">
<a href="/assets/final_project/plots/hidden_total_reward.png"><img src="/assets/final_project/plots/hidden_total_reward.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/hidden_solved.png"><img src="/assets/final_project/plots/hidden_solved.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/hidden_intrinsic.png"><img src="/assets/final_project/plots/hidden_intrinsic.png" width="320" height="240" /></a>
</p>
<h3 id="number-of-parallel-workers">Number of Parallel Workers</h3>
<p>Since PPO is <em>on-policy</em>, training requires data collected with the current policy. This is difficult to do because the distributions of states, rewards, and actions change every time the model is updated. One way to get around this issue is to parallelize the environment and use multiple workers to collect a large and varied amount of training data. The number of workers required to solve a given task is likely proportional to the complexity of the task. For example, solving the reaching task with four or fewer workers is relatively easy, while solving the pushing task with so few workers is vastly more difficult (data not shown). In this experiment, I tuned the number of parallel workers collecting rollout data using the current policy. With only 8 workers, the model is only able to learn up to a certain point until performance begins to decrease. In contrast, increasing the number of works to 16 and beyond greatly improved algorithm convergence. Based on these results and the number of cores available on my CPU, I chose to use 32 parallel workers in all of the experiments.</p>
<p align="center">
<a href="/assets/final_project/plots/workers_total_reward.png"><img src="/assets/final_project/plots/workers_total_reward.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/workers_solved.png"><img src="/assets/final_project/plots/workers_solved.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/workers_intrinsic.png"><img src="/assets/final_project/plots/workers_intrinsic.png" width="320" height="240" /></a>
</p>
<h3 id="discount-factor-gamma">Discount Factor <script type="math/tex">\gamma</script></h3>
<p>The discount factor <script type="math/tex">\gamma</script> is the critical hyperparameter that determines how much the agent cares about future rewards. I performed experiments comparing <script type="math/tex">\gamma = 0.99, 0.95</script>, and <script type="math/tex">0.90.</script> Of the three values, <script type="math/tex">0.95</script> had the best performance.</p>
<p align="center">
<a href="/assets/final_project/plots/gamma_total_reward.png"><img src="/assets/final_project/plots/gamma_total_reward.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/gamma_solved.png"><img src="/assets/final_project/plots/gamma_solved.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/gamma_intrinsic.png"><img src="/assets/final_project/plots/gamma_intrinsic.png" width="320" height="240" /></a>
</p>
<h3 id="dynamics-model-learning-rate">Dynamics Model Learning Rate</h3>
<p>The prediction error of the dynamics model is used to generate the intrinsic reward that encourages the agent to explore. As such, the intrinsic rewards should be relatively large at the beginning of training when agent lacks understanding of the environment, and smaller once the agent is able to predict the consequences of its actions. If the dynamics model learns too quickly, the intrinsic rewards will prematurely decrease in size and the agent won’t explore efficiently. If, on the other hand, the dynamics model learns too slowly, large intrinsic rewards will result in excessive exploration by the agent. To determine which learning would be best for Fetch Robotics tasks, I varied the dynamics model learning rate over three orders of magnitude. I observed that, while the agent was able to learn for every tuning of the learning rate, smaller learning rates caused the agent to learn fastest.</p>
<p align="center">
<a href="/assets/final_project/plots/dynamics_lr_total_reward.png"><img src="/assets/final_project/plots/dynamics_lr_total_reward.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/dynamics_lr_solved.png"><img src="/assets/final_project/plots/dynamics_lr_solved.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/dynamics_lr_intrinsic.png"><img src="/assets/final_project/plots/dynamics_lr_intrinsic.png" width="320" height="240" /></a>
</p>
<!-- ### Terminating the Episode
- Previous work has discussed terminating episodes to improve training. I decided to investigate whether terminating the episodes after successes or failures impacted learning. -->
<h2 id="solving-the-pick-and-place-task">Solving the Pick and Place Task</h2>
<p>The Pick and Place task was readily solved using the same set of hyperparameters that were used to solve the Pushing task. As can be seen in the plots below, agents trained on the Pick and Place task were able to solve <script type="math/tex">\approx 80 \%</script> of the episodes by the end of training.</p>
<p align="center">Learning progress on the Pick and Place Task</p>
<p align="center">
<img src="/assets/final_project/gifs/pick_10.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/pick_50.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/pick_90.gif" width="200" height="120" />
</p>
<p align="center">
<a href="/assets/final_project/plots/pick_total_reward.png"><img src="/assets/final_project/plots/pick_total_reward.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/pick_solved.png"><img src="/assets/final_project/plots/pick_solved.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/pick_intrinsic.png"><img src="/assets/final_project/plots/pick_intrinsic.png" width="320" height="240" /></a>
</p>
<p><br /></p>
<h2 id="solving-the-sliding-task">Solving the Sliding Task</h2>
<p>The Sliding Task seemed to be the most challenging. The difficulty of this task is likely due to the agent having fewer timesteps to interact with the puck due the low friction between the puck and the surface of the table. Unlike Reaching and Pushing, learned agents were only able to solve <script type="math/tex">\sim 80 \%</script> of episodes near the end of training. Furthermore, networks with hidden layers of size 64 were insufficient in solving the Sliding task (data not shown). Instead, sliding required networks with at least 128 neurons. The sliding task was also sensitive to the dynamics model learning rate. It is also worth mentioning that the Sliding task required 40 million environment steps. This is roughly double the number of environment interactions required to solved the Pushing task.</p>
<p align="center">Learning progress on the Sliding Task</p>
<p align="center">
<img src="/assets/final_project/gifs/slide_10.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/slide_50.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/slide.gif" width="200" height="120" />
</p>
<p align="center">
<a href="/assets/final_project/plots/slide_total_reward.png"><img src="/assets/final_project/plots/slide_total_reward.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/slide_solved.png"><img src="/assets/final_project/plots/slide_solved.png" width="320" height="240" /></a>
<a href="/assets/final_project/plots/slide_intrinsic.png"><img src="/assets/final_project/plots/slide_intrinsic.png" width="320" height="240" /></a>
</p>
<p><br /></p>
<h1 id="summary">Summary</h1>
<p>Consistent with previous works, my results demonstrate that curiosity-driven exploration can be used to solve challenging tasks with sparse rewards. In particular, I have shown that using the prediction error as the intrinsic reward can encourage an agent to solve a diverse set of robotics tasks. In addition to the intrinsic reward signal, properly tuning various hyperparameters plays a significant role in ensuring that the agent can learn in the presence of sparse rewards. This was especially true for the Sliding task, as it seemed the most sensitive to the selection of hyperparameters. Surprisingly, while one set of hyperparameters was sufficient for solving Reaching, Pushing, and Pick and Place, Sliding required significant tuning and many more environment interactions.</p>
<h1 id="future-directions">Future Directions</h1>
<p>The work presented here provides an excellent starting point for future research. In this section, I will briefly discuss a few ideas that are motivating my next set of experiments.</p>
<h2 id="new-tasks">New Tasks</h2>
<p><br /></p>
<p align="center">More Challenging Manipulation Tasks </p>
<p align="center">
<img src="/assets/final_project/gifs/hook.gif" width="200" height="120" />
<img src="/assets/final_project/gifs/stack.gif" width="200" height="120" />
</p>
<p><br /></p>
<h2 id="learning-from-pixels">Learning from Pixels</h2>
<p>Learning control policies from pixels is currently an active area of research. Building off of my current results, I will perform experiments in which I compare how learning different feature representations for images my affect intrinsic motivation and exploration. In particular, I will investigate using feature representations such as random convolutional neural network features (RF), variational autoencoder (VAE) features, and inverse dynamics features (IDF) [<a href="https://arxiv.org/abs/1808.04355">2</a>].</p>
<h2 id="combining-multiple-modalities">Combining Multiple Modalities</h2>
<p>While previous papers have largely focused on learning a feature space for a single modality such as images for Atari games or joint kinematics for continuous control, it is not clear which modalities are most important for solving robotic manipulation tasks. Robots are often equipped with one or more sensors that measure various aspects of their state and environment: (1) encoders that measure positions and velocities of its joints; (2) cameras that provide visual perception; and (3) tactile sensors that measure contacts with the environment. I therefore intend to address the following question:</p>
<ul>
<li>Does including additional sensor modalities in the feature space result in better exploration policies?</li>
</ul>
<p>Although there are numerous ways to combine the feature representations for the different modalities, I will initially focus on learning vector representations, and concatenating those features as input to the dynamics model to generate intrinsic rewards. In a separate set of experiments I will also investigate whether the learned multimodal features are able to improve policy learning.</p>
<!-- ## Preliminary Results on Adding Contacts as Input to the Dynamics Model -->
<h1 id="references">References</h1>
<ol>
<li><a href="https://arxiv.org/abs/1705.05363">Curiosity-driven Exploration by Self-supervised Prediction</a></li>
<li><a href="https://arxiv.org/abs/1808.04355">Large-Scale Study of Curiosity-Driven Learning</a></li>
<li><a href="https://arxiv.org/abs/1810.12894">Exploration by Random Network Distillation</a></li>
<li><a href="https://arxiv.org/abs/1312.5602">Playing Atari with Deep Reinforcement Learning</a></li>
<li><a href="https://arxiv.org/abs/1509.02971">Continuous control with deep reinforcement learning</a></li>
<li><a href="https://ieeexplore.ieee.org/abstract/document/170605">Curious model-building control systems</a></li>
<li><a href="https://arxiv.org/abs/1103.5708">Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments</a></li>
<li><a href="http://people.idsia.ch/~juergen/ieeecreative.pdf">Formal Theory of Creativity, Fun, and Intrinsic Motivation</a></li>
<li><a href="https://pdfs.semanticscholar.org/2547/be25e1e07728aa0966a0354e90664816d15e.pdf">Reinforcement Driven Information Acquisition In Non-Deterministic Environments</a></li>
<li><a href="https://arxiv.org/abs/1611.07507">Variational Intrinsic Control</a></li>
<li><a href="https://arxiv.org/abs/1606.01868">Unifying Count-Based Exploration and Intrinsic Motivation</a></li>
<li><a href="https://papers.nips.cc/paper/4642-exploration-in-model-based-reinforcement-learning-by-empirically-estimating-learning-progress">Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress</a></li>
<li><a href="https://pdfs.semanticscholar.org/fb3c/6456708b0e143f545d77dc8ec804eb947395.pdf">Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks</a></li>
<li><a href="https://arxiv.org/abs/1810.01176">EMI: Exploration with Mutual Information</a></li>
<li><a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization Algorithms</a></li>
<li><a href="https://arxiv.org/abs/1802.09464">Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research</a></li>
<li><a href="https://arxiv.org/abs/1606.01540">OpenAI Gym</a></li>
</ol>Jon Michauxjmichaux@ttic.eduAbstract Many robotics problems are naturally formulated such that the extrinsic rewards to the agent are either sparse or missing altogether. These problems can be extremely difficult to solve as the environment provides little to no feedback to guide the agent toward accomplishing its goal. Previous works have shown that agents that train using prediction error as an intrinsic reward are able to learn across a wide range of domains, including Atari games and continuous control tasks [1, 2, 3]. In this project, I use curiosity-driven exploration to solve challenging robotics tasks with sparse rewards. Following these previous works, I formulate the intrinsic reward as the error in the agent’s ability to predict its next state, given its current state and executed action. My results demonstrate that this approach is capable of solving several difficult robotic manipulation tasks in simulation.Exploring Exploration2019-03-16T00:00:00+00:002019-03-16T00:00:00+00:00https://jmichaux.github.io/week6<p>This week I learned about Exploration and Intrinsic Motivation.</p>
<h1 id="overview-of-exploration-vs-exploitation">Overview of Exploration vs. Exploitation</h1>
<p>The goal of a reinforcement learning agent is to learn how to solve a task by maximizing its long-term expected return. In order to do that, the agent must learn about the environment by first taking actions and then using the rewards it received to determine whether those actions were successful or not.</p>
<p align="center"><a href="/assets/week6/rl.png"><img src="/assets/week6/rl.png" width="320" height="240" /></a></p>
<p align="center">Graphic credit to <a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/simsek-lecture1.pdf">Andrew Barto</a></p>
<p>One of the major challenges faced by a learning agent is determining whether not it has learned enough about the environment to actually solve the task. This challenge is known as the <em>Exploration-Exploitation Dilemma</em>. <em>Exploration</em> refers to the agent taking an action in order to gather more information about the world. <em>Exploitation</em>, on the other hand, refers to the agent choosing the most rewarding action given what it already knows about of the world. The dilemma results from fact that exploring the world to gather information and exploiting what you know are often mutually exclusive. But, by properly balancing this trade-off between exploration and exploitation, an agent can learn how to optimally perform a given task.</p>
<h1 id="multi-armed-bandits">Multi-armed bandits</h1>
<p>To understand this dilemma better, let’s look at one of the standard problems in reinforcement learning known as the <em>Multi-armed bandit</em> problem. A multi-armed bandit is a simplified Markov Decision process <script type="math/tex">M = \langle A, R, \rangle</script>, where <script type="math/tex">A</script> is the set of actions (<em>i.e.</em> “arms”), <script type="math/tex">R</script> is the reward function, and there is only one state. We can think of a <em>k-armed</em> bandit as a row of slot machines, where each of the <script type="math/tex">k</script> actions corresponds to pulling one of the levers. The goal here is to maximize the sum of rewards by learning through trial-and-error the correct sequence of slot arms to pull. The agent faces the exploration-exploitation dilemma at the beginning of each step.</p>
<p align="center"><a href="/assets/week6/octopus-bandit.jpeg"><img src="/assets/week6/octopus-bandit.jpeg" width="320" height="240" /></a></p>
<p align="center">Graphic credit <a href="https://www.google.com/search?q=multi+armed+bandit&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiS-v7vgojhAhUJ9YMKHSuYAiQQ_AUIDigB&biw=1027&bih=455">unknown</a></p>
<h2 id="exploration-finding-the-best-strategy">Exploration: finding the best strategy</h2>
<p>There are many different exploration strategies an agent could use to find the best action. Here are a few:</p>
<ul>
<li><strong><script type="math/tex">\bf \text{Greedy}</script></strong></li>
<li><strong><script type="math/tex">\bf \epsilon \text{-Greedy}</script></strong></li>
<li><strong><script type="math/tex">\bf \text{Decaying } \epsilon \text{-Greedy}</script></strong></li>
<li><strong><script type="math/tex">\bf \text{Optimistic Initialization}</script></strong></li>
</ul>
<h1 id="challenges-in-the-full-mdp-case">Challenges in the full MDP case</h1>
<p>While the above exploration strategies seem to work well for small MDPs they are likely insufficient for the most difficult problems. Why is that? Firstly, reinforcement learning tasks specified by researchers often rely on a hand-engineered reward function. This can be problematic as the reward might not really be informative. Secondly, many problem domains that are currently researched have large and continuous state and/or action spaces. Large and high-dimensional state spaces are difficult to explore efficiently. Finally, some problems are naturally formulated as having a <em>sparse reward</em>. That is, the agent only receives the reward when it completes the tasks. These problems are challenging because there are no environmental signals that will guide the agent to the goal. So how do we solve these problems? One approach is to use an idea from animal psychology called <em>intrinsic motivation</em>.</p>
<h1 id="intrinsic-motivation-in-humans-and-animals">Intrinsic Motivation in Humans and Animals</h1>
<p>According to (some) psychologists, <em>motivation</em> can be characterized as the set of <strong><em>Forces</em></strong> that influence an organism to act and direct its behavior toward certain activities. There are two types of motivation: <em>Extrinsic Motivation</em> and <em>Intrinsic Motivation</em>. Extrinsic motivation refers to an organism being driven to do something because of an external reward. Intrinsic motivation, on the other hand, refers to an organism being driven to do something simply because it is enjoyable. Research studies in animals and humans suggest that intrinsic motivation is largely independent from the necessary biological drives such as satisfying hunger, seeking shelter, etc. Furthermore, it is currently thought that extrinsic motivation needs to be learned, while intrinsic motivation is inherent. As such, intrinsic motivation is likely used to augment an organisms knowledge and/or skill in a way that has biological utility at a later point in life. Indeed, according to a classic <a href="https://psycnet.apa.org/record/1961-04411-001">paper</a> by Robert White:</p>
<script type="math/tex; mode=display">\text{“The motivation needed to obtain competence cannot be} \\
\text{wholly derived from sources of energy currently} \\
\text{conceptualized as drives or instincts.”}</script>
<p>So the natural question is: what is intrinsically motivating (<em>e.g. rewarding</em>) about certain activities and how can organism can use that motivation to learn? Most experts have settled on organisms using some notion of novelty, surprise, or incongruity to update what it knows about the world. In the next section we will look at how an artificial agent can use intrinsic motivation to learn.</p>
<h1 id="intrinsic-motivation-in-reinforcement-learning">Intrinsic Motivation in Reinforcement Learning</h1>
<p>To understand intrinsic motivation and its role in learning autonomous behaviors, let’s first refine our view of reinforcement learning. Here, we can imagine the environment being split into two separate parts: an external environment and an internal environment. As with the traditional view of RL, the external environment provides sensations (<em>e.g.</em> observations) and reward signals that are external to the agent. Those external signals are then passed a critic within the internal environment, which then generates internal reward signals. We can think of the external reward signals as things such as food or money, while we can think of the internal rewards as biochemical signals generated within the brain.</p>
<p align="center"><a href="/assets/week6/rl2.png"><img src="/assets/week6/rl2.png" width="320" height="240" /></a></p>
<p align="center">Graphic credit to <a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/simsek-lecture1.pdf">Andrew Barto</a></p>
<p>We can then view the reward signal as a combination of the the extrinsic and intrinsic rewards. This can be viewed as adding an <em>exploration bonus</em> to the reward provided by the environment:</p>
<script type="math/tex; mode=display">R_{t}(s, a, s') = \underbrace{r^{e}_{t}(s, a, s')}_{\text{extrinsic}} + \underbrace{r^{i}_{t}(s, a, s')}_{\text{intrinsic}}</script>
<p>As mentioned above, intrinsic motivation usually reflects some notion of novelty or surprise that the agent experiences <em>internally</em> while exploring. The fact that intrinsic motivation does not come from the external environment means that intrinsic motivation is independent of the task. It also means that the intrinsic reward <script type="math/tex">r^{i}_{t}(s, a, s')</script> <em>should</em> change as the agent explores and is no longer surprised by its experiences. Note that this is in contrast to the extrinsic reward <script type="math/tex">r^{e}_{t}(s, a, s')</script> which is always the same given the same <script type="math/tex">(s,a,s')</script> tuple.</p>
<p>So how do we measure intrinsic motivation, novelty, or surprise for autonomously learning agents? We’ll discuss some recent works below.</p>
<h3 id="count-based-exploration-of-novel-states">Count-based exploration of novel states</h3>
<p>One form of novelty involves some measure of how “surprised” the agent is upon entering some state <script type="math/tex">s</script>. For simple discrete environments, this amounts to keeping track of how many times the agent has visited each state <script type="math/tex">s</script> (<script type="math/tex">N(s)</script>), and modeling surprise as some function <script type="math/tex">h(s) = \frac{1}{[N(s)]^p}f(s)</script> and <script type="math/tex">p \gt 0</script>. As the state-visitation count <script type="math/tex">N(s)</script> increases, the state <script type="math/tex">s</script> becomes less novel and the agent receives less reward.</p>
<p>While this may work well for small discrete Markov Decision Processes, the above count-based method will fail in large continuous state spaces. This is because any given state is rarely visited multiple times in high dimensional spaces. To get around this, <a href="https://arxiv.org/abs/1606.01868">Bellamare <em>et al</em></a> showed that it is possible to learn a parameterized density model that can be used to derive what the authors call <em>pseudo-counts</em>. The pseudo-counts are a measure or how often the agent has visited <em>similar</em> states.</p>
<h3 id="curiosity-as-a-measure-of-prediction-error">Curiosity as a measure of prediction error</h3>
<p>Another form of novelty is related to <em>curiosity</em>. The curious agent needs to learn how to predict the next state given its current state and current action. Here, curiosity is modeled as the prediction error between the agent’s prediction and the actual next state. In other words, the agent needs to learn a forward dynamics model of the environment. Indeed, according to Schmidhuber,</p>
<script type="math/tex; mode=display">\text{“The direct goal of curiosity and boredom is to improve} \\
\text{the world model. The indirect goal is to ease the learning} \\
\text{of new goal-directed action sequences.”} \\</script>
<p>Curiosity, therefore encourages the agent to select actions that will reduce the uncertainty in the agent’s ability to predict future consequences. The uncertainty is highest in unexplored regions of the state space, or in regions that have complex dynamics. So how does the agent actually learn to make these predictions?</p>
<p>Instead of predicting the next state in the raw pixel space, <a href="https://arxiv.org/abs/1705.05363">Pathak <em>et al</em></a> learned what they call an <em>Intrinsic Curiosity Module</em> (ICM). The ICM consists of two neural networks that model the forward and inverse dynamics of the environment. The ICM learns a lower dimensional feature space of the environment by learning how to predict actions given state, next state feature representation. This module eventually learns to only model the important aspects of the environment such as the agent itself as well as objects that affect the agent. Finally, the learned features of the current state along with the current action are then used to predict the features of the next state. As mentioned above, the prediction error of this forward dynamics model is an intrinsic reward that encourages the agent’s curiosity.</p>
<h1 id="references">References</h1>
<ol>
<li><a href="http://science.sciencemag.org/content/153/3731/25">Curiosity and Exploration</a></li>
<li><a href="https://psycnet.apa.org/record/1961-04411-001">Motivation Reconsidered: The Concept of Competence</a></li>
<li><a href="https://www.youtube.com/watch?v=aJI_9SoBDaQ&t=4008s">DeepHack.RL: Andrew Barto - Intrinsically motivated reinforcement learning</a></li>
<li><a href="https://arxiv.org/abs/1606.01868">Unifying Count-Based Exploration and Intrinsic Motivation</a></li>
<li><a href="ftp://ftp.idsia.ch/pub/juergen/curioussingapore.pdf">Curious model-building control systems</a></li>
<li><a href="https://arxiv.org/abs/1705.05363">Curiosity-driven Exploration by Self-supervised Prediction</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLpIxOj-HnDsNfvOwRKLsUobmnF2J1l5oV">CMU Deep RL Lectures 16-18</a></li>
</ol>Jon Michauxjmichaux@ttic.eduThis week I learned about Exploration and Intrinsic Motivation.Off-Policy Actor-Critic Algorithms2019-03-10T00:00:00+00:002019-03-10T00:00:00+00:00https://jmichaux.github.io/week4b<p>This post extends my learning about Actor-Critic algorithms to the <em>off-policy</em> setting.</p>
<h1 id="deep-deterministic-policy-gradients">Deep Deterministic Policy Gradients</h1>
<p><em>Deep Deterministic Policy Gradients</em> (DDPG) is an extension of the DQN algorithm that is able to learn control policies for continuous action spaces. DDPG is an Actor-Critic algorithm, so it learns both a policy and value function (Q function). Like DQN, DDPG makes use of experience replay buffers and a frozen target network to stabilize training. The critic used in DDPG, however, differs from the critic used in DQN in two key ways. First, the critic used in DDPG takes as input both the states <em>and</em> actions. Second, the critic does not output a Q-value for every action (otherwise, there’d be infintely many outputs!), but instead the architecture has only one neuron that outputs values for each state-action pair.</p>
<p align="center">
<a href="/assets/week4/dqn_ddpg_cropped.png"><img src="/assets/week4/dqn_ddpg_cropped.png" /></a>
</p>
<p>So how does training work? Training the critic network <script type="math/tex">Q_{\phi}(s,a)</script> in DDPG is very similar to how it is trained in DQN. Training the actor <script type="math/tex">\pi_{\theta}(s)</script>, on the other hand, relies on the <a href="http://proceedings.mlr.press/v32/silver14.pdf"><em>Deterministic Policy Gradient Theorem</em></a> proved by David Silver in 2014:</p>
<script type="math/tex; mode=display">\begin{equation}
\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{j=1}^{N} \color{green}{\nabla_{a}Q_{\phi}(s_j, \mu_{\theta}(s_j))} \color{purple} {\nabla_{\theta} \mu_{\theta}(s_j)}.
\end{equation}</script>
<p>Notice that we are taking the gradient of <script type="math/tex">Q_{\phi}(s,a)</script> with respect to the actions <script type="math/tex">a</script>. The intuition for this is as follows. The critic can evaluate the action that the actor proposes from a particular state. By making small changes to that action, the critic will tell us whether the new action is an improvement over the previous action. If the new action does have a higher Q value, the gradient <script type="math/tex">\nabla{a}Q_{\phi}</script> is used to update the parameters of the actor in the right direction.</p>
<p>It is also important to note that unlike actor-critic algorithms like A2C and PPO, the actor in DDPG maps states <em>directly</em> (<em>i.e. deterministically</em>) to actions rather than outputting a distribution over actions. Since the actor isn’t sampling actions, how, then, do we actually get exploration? One method involves adding Gaussian noise or Ornstein-Uhlenbenk process noise to the deterministic action. Another <a href="https://arxiv.org/abs/1706.01905">method</a> involves adaptively perturbing the parameters of the actor network.</p>
<h2 id="ddpg-algorithm">DDPG Algorithm</h2>
<ol>
<li>For episode = 1,2,… <script type="math/tex">\textbf{do}</script></li>
<li> For <script type="math/tex">t = 1,T</script></li>
<li> Select action according to policy and noise process:<br />
<script type="math/tex">a_t = \mu_{\theta}(s_t) + \mathcal{N}_{t}</script></li>
<li> Execute action <script type="math/tex">a_t</script> and observe reward <script type="math/tex">r_t</script> and next state <script type="math/tex">s_t</script></li>
<li> Store transition <script type="math/tex">(s_t, a_t, r_t, s_{t+1})</script> in replay buffer</li>
<li> Sample minibatch from replay buffer
<script type="math/tex">\{(s_j, a_j, r_j, s_{j+1})\}_{j=1}^{N}</script></li>
<li> Calculate targets <script type="math/tex">y_j</script>:
<script type="math/tex">y_j = r_j + \gamma Q_{\bf{\phi'}}(s_{j+1}, \mu_{\theta'}(s_{j+1}))</script></li>
<li> Calculate the loss:
<script type="math/tex">\begin{align}
\nonumber
L = \frac{1}{N} \sum_{j=1}^{N} (r_j + \gamma Q_{\bf{\phi'}}(s_{j+1}, \mu_{\theta'}(s_{j+1})) - Q_{\phi}(s_j, a_j))^2 \\ \nonumber
\end{align}</script></li>
<li> Update the critic network parameters:
<script type="math/tex">\begin{align}
\nonumber
\phi \longleftarrow \phi + \alpha \nabla L(\phi) \\ \nonumber
\end{align}</script></li>
<li> Approximate the policy gradient:
<script type="math/tex">\begin{align}
\nonumber
\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_{a}Q_{\phi}(s_j, \mu_{\theta}(s_j)) \nabla_{\theta} \mu_{\theta}(s) \\ \nonumber
\end{align}</script></li>
<li>
<p> Update the policy parameters</p>
<script type="math/tex; mode=display">\begin{align}
\nonumber
\theta \longleftarrow \theta + \beta \nabla J(\theta)
\end{align}</script>
</li>
<li>
<p> Update the target networks</p>
<script type="math/tex; mode=display">\begin{align}
\nonumber
\theta' \longleftarrow \tau \theta + (1 - \tau)\theta' \\ \nonumber
\phi' \longleftarrow \tau \phi + (1 - \tau)\phi'
\end{align}</script>
</li>
</ol>
<h2 id="ddpg-results">DDPG Results</h2>
<p align="center">
<a href="/assets/week4/ddpg_rewards.png"><img src="/assets/week4/ddpg_rewards.png" width="320" height="240" /></a>
<a href="/assets/week4/ddpg_biped.gif"><img src="/assets/week4/ddpg_biped.gif" width="320" height="240" /></a>
</p>
<h1 id="twin-delayed-ddpg">Twin Delayed DDPG</h1>
<p>Although DDPG is capable of solving challenging continuous control tasks, training can be very difficult in practice. Twin Delayed DDPG (TD3) uses a few tricks to that greatly improve algorithm performance:</p>
<ol>
<li><span style="color:red">Target policy smoothing</span></li>
<li><span style="color:blue">Clipped Double Q learning</span></li>
<li><span style="color:green">Delaying update of policy and target networks</span></li>
</ol>
<h2 id="td3-algorithm">TD3 Algorithm</h2>
<ol>
<li>For episode = 1,2,… <script type="math/tex">\textbf{do}</script></li>
<li> For <script type="math/tex">t = 1,T</script></li>
<li> Select action according to policy and noise process:<br />
<script type="math/tex">a_t = \mu_{\theta}(s_t) + \mathcal{N}_{t}</script></li>
<li> Execute action <script type="math/tex">a_t</script> and observe reward <script type="math/tex">r_t</script> and next state <script type="math/tex">s_{t+1}</script></li>
<li> Store transition <script type="math/tex">(s_t, a_t, r_t, s_{t+1})</script> in replay buffer</li>
<li> Sample minibatch from replay buffer
<script type="math/tex">\{(s_j, a_j, r_j, s_{j+1})\}_{j=1}^{N}</script></li>
<li> Add noise to the action:
<script type="math/tex">\color{red}{\tilde{a} = \pi_{\theta'} + \epsilon, \epsilon \sim \text{clip}(\mathcal{N}(0, \tilde{\sigma}), -c, c)}</script></li>
<li> Calculate targets:
<script type="math/tex">\color{blue}{y_j = r_j + \gamma \underset{i}{\operatorname{min}}Q_{\phi_{i}'}(s_{j+1}, \tilde{a})}</script></li>
<li> Calculate the loss:
<script type="math/tex">\begin{align}
\nonumber
\color{blue}{L = \frac{1}{N} \sum_{j=1}^{N} (y_j - \underset{i}{\operatorname{min}}Q_{\phi_{i}}(s_j, a_j))^2}\\ \nonumber
\end{align}</script></li>
<li> Update the critic network parameters:
<script type="math/tex">\begin{align}
\nonumber
\color{blue}{\phi_{i} \longleftarrow \phi_{i} + \alpha \nabla L(\phi)} \\ \nonumber
\end{align}</script></li>
<li> If <script type="math/tex">t \text{ mod } d</script> then:</li>
<li> Approximate the policy gradient:
<script type="math/tex">\begin{align}
\nonumber
\color{green}{\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{j=1}^{N} \nabla_{a}Q_{\phi}(s_j, \mu_{\theta}(s_j)) \nabla_{\theta} \mu_{\theta}(s_j)} \\ \nonumber
\end{align}</script></li>
<li>
<p> Update the policy parameters:</p>
<script type="math/tex; mode=display">\begin{align}
\nonumber
\color{green}{\theta \longleftarrow \theta + \beta \nabla J(\theta)}
\end{align}</script>
</li>
<li>
<p> Update the target networks:</p>
<script type="math/tex; mode=display">\begin{align}
\nonumber
\color{green}{\theta' \longleftarrow \tau \theta + (1 - \tau)\theta'} \\ \nonumber
\color{green}{\phi_{i}' \longleftarrow \tau \phi_{i} + (1 - \tau)\phi_{i}'}
\end{align}</script>
</li>
</ol>
<h2 id="td3-results">TD3 Results</h2>
<p align="center">
<a href="/assets/week4/td3_rewards.png"><img src="/assets/week4/td3_rewards.png" width="320" height="240" /></a>
<a href="/assets/week4/td3_biped.gif"><img src="/assets/week4/td3_biped.gif" width="320" height="240" /></a>
</p>
<h1 id="take-aways">Take aways</h1>
<ul>
<li>TD3 works much better than the original DDPG algorithm</li>
<li>Even using a subset of the modifications like delaying the update of the policy improves learning (not shown)</li>
<li>I didn’t investigate how decaying the exploration rate over time might affect algorithm convergence, so I might want to look into that at some point</li>
</ul>
<h1 id="references">References</h1>
<ol>
<li><a href="https://arxiv.org/abs/1312.5602">Playing Atari with Deep Reinforcement Learning</a></li>
<li><a href="http://proceedings.mlr.press/v32/silver14.pdf">Deterministic Policy Gradient Algorithms</a></li>
<li><a href="https://arxiv.org/abs/1509.02971">Continuous control with deep reinforcement learning</a></li>
<li><a href="https://arxiv.org/abs/1706.01905">Parameter Space Noise for Exploration</a></li>
<li><a href="https://arxiv.org/abs/1802.09477">Addressing Function Approximation Error in Actor-Critic Methods</a></li>
</ol>Jon Michauxjmichaux@ttic.eduThis post extends my learning about Actor-Critic algorithms to the off-policy setting.Deep Q-networks2019-03-03T00:00:00+00:002019-03-03T00:00:00+00:00https://jmichaux.github.io/week4a<p>This post uses Deep Q Networks to introduce off-policy algorithms</p>
<h1 id="overview-of-off-policy-algorithms">Overview of Off-Policy Algorithms</h1>
<p>Until now I have focused on <em>on-policy</em> algorithms - <em>i.e.</em> algorithms that learn from data that were generated with the current policy. <em>Off-policy</em> algorithms, on the other hand, are able to learn from experiences (<em>e.g.</em> transitions of the form <script type="math/tex">(s, a, r, s')</script>) collected from previous policies. Because off-policy methods are able to reuse old data, they tend to be more sample-efficient than on-policy methods.</p>
<h1 id="deep-q-learning">Deep Q Learning</h1>
<p>One recent example of an off-policy method is the venerable <a href="https://arxiv.org/abs/1312.5602"><em>Deep Q Learning</em></a> algorithm that learned to play a number of Atari games with human-level performance. The use of deep neural network function approximators extended classical Q-learning beyond finite and discrete state spaces to problem domains with continuous and high-dimensional state spaces. Quite surprisingly, Deep Q-learning was able solve 57 challenging Atari games using the same set of hyperparameters.</p>
<p align="center">
<a href="/assets/week4/dqn.png"><img src="/assets/week4/dqn.png" width="256" height="192" /></a>
</p>
<p>At the core of Deep Q-learning is the Deep Q-Network (DQN). Q-networks take as input some representation of the state of the environment. For Atari games, the input could be RGB or gray-scale pixel values. For a robot manipulator, the input could include a combination of the position, linear velocity, and angular velocity of its links and/or joints. Q-networks output one Q-value per action. Because Q-networks learn the values of state-action pairs, they can be viewed as a parameterized representation of the critic <script type="math/tex">Q_{\phi}(s_t, a_t)</script> introduced in my <a href="https://jmichaux.github.io/week2/">last post</a>. Unlike policy gradient methods that learn a policy <script type="math/tex">\pi_{\theta}</script> directly, Deep Q-networks learn an <em>induced</em> policy. In other words, an action <script type="math/tex">a</script> is selected by finding the <em>maximum</em> over the set of Q-values {<script type="math/tex">Q_{\phi}(s, a_i)</script>} that the network outputs.</p>
<p>So how does Deep Q-learning work? The core of the algorithm involves the computation of the <em>temporal difference</em> (TD) error for transitions <script type="math/tex">(s_i, a_i, r_i, s_{i+1})</script> sampled from taking actions in the environment:</p>
<script type="math/tex; mode=display">\delta_i = y_i - Q_{\phi}(s_i, a_i)</script>
<p>where <script type="math/tex">y_i = r_i + \gamma \underset{a}{\operatorname{max}}Q_{\phi'}(s_{i+1}, a)</script> is a bootstrapped estimate of the Q function. Similar to supervised learning, we minimize the squared loss between the <em>target values</em> <script type="math/tex">y_i</script> and the outputs of the network <script type="math/tex">Q_{\phi}(s_i, a_i)</script>:</p>
<script type="math/tex; mode=display">L = \frac{1}{N} \sum_{i}^{N} (r_i + \gamma \underset{a}{\operatorname{max}}Q_{\phi'}(s_{i+1}, a) - Q_{\phi}(s_i, a_i))
^2.</script>
<h2 id="tricks-of-the-trade">Tricks of the trade</h2>
<p>Although Deep Q-learning is conceptually straightforward, there are a few tricks required to get the algorithm to converge in practice.</p>
<ul>
<li><strong>Stable Target Network</strong>
<ul>
<li>Because DQN is <em>not really</em> a supervised learning algorithm, the target <script type="math/tex">y_i = r_i + \gamma \underset{a}{\operatorname{max}}Q_{\bf{\phi}}(s_{i+1}, a)</script> would change every time the network parameter <script type="math/tex">\phi</script> is updated. This is bad because changing <script type="math/tex">y_i</script> and <script type="math/tex">Q_{\phi}(s_i, a_i)</script> in the same direction would cause the algorithm to diverge. We can avoid this by computing a different target <script type="math/tex">y_i = r_i + \gamma \underset{a}{\operatorname{max}}Q_{\bf{\phi'}}(s_{i+1}, a)</script> using a frozen target network <script type="math/tex">Q_{\phi'}</script> that is updated to match <script type="math/tex">Q_{\phi}</script> every <script type="math/tex">k</script> iterations.</li>
</ul>
</li>
<li><strong>Replay Buffer</strong>
<ul>
<li>When an agent acts in an environment, the set of experiences <script type="math/tex">(s_i, a_i, r_i, s_{i+1})</script> for a single episode are temporally correlated. This violates the <em>i.i.d.</em> assumption required of most learning algorithms. We can de-correlate the experiences by placing them in a replay buffer and randomly sampling them to update the Q-network.</li>
</ul>
</li>
<li><strong>Stacked frames</strong>
<ul>
<li>Single images don’t convey dynamic information, so stacking multiple frames allows agent to infer movement in the environment.</li>
</ul>
</li>
</ul>
<h2 id="dqn-algorithm">DQN Algorithm</h2>
<ol>
<li>For episode = 1,2,… <script type="math/tex">\textbf{do}</script></li>
<li> For <script type="math/tex">t = 1,T</script></li>
<li> Perform <script type="math/tex">\epsilon</script>-greedy action selection:<br />
<script type="math/tex">% <![CDATA[
a_t =
\begin{cases}
\text{random action}, & \text{with probability } \epsilon \\
\underset{a}{\operatorname{argmax}}Q_{\bf{\phi}}(s_{t}, a), & \text{otherwise}
\end{cases} %]]></script></li>
<li> Execute action <script type="math/tex">a_t</script> and observe reward <script type="math/tex">r_t</script> and next state <script type="math/tex">s_t</script></li>
<li> Store transition <script type="math/tex">(s_t, a_t, r_t, s_{t+1})</script> in replay buffer</li>
<li> Sample minibatch from replay buffer
<script type="math/tex">\{(s_j, a_j, r_j, s_{j+1})\}_{j=1}^{N}</script></li>
<li> Calculate targets <script type="math/tex">y_j</script>:
<script type="math/tex">% <![CDATA[
y_j =
\begin{cases}
r_j, & \text{if episode terminates at step } j+1 \\
r_j + \gamma \underset{a'}{\operatorname{max}}Q_{\bf{\phi'}}(s_{j+1}, a'), & \text{otherwises}
\end{cases} %]]></script></li>
<li> Calculate the loss:
<script type="math/tex">L = \frac{1}{N} \sum_{j=1}^{N} (r_j + \gamma \underset{a}{\operatorname{max}}Q_{\phi'}(s_{j+1}, a) - Q_{\phi}(s_j, a_j))
^2.</script></li>
<li> Update the network parameters <script type="math/tex">\phi \longleftarrow \phi + \alpha \nabla L(\phi)</script></li>
<li> If <script type="math/tex">t\text{ mod}(k) = 0</script>, update target <script type="math/tex">\phi' \longleftarrow \phi</script></li>
</ol>
<h2 id="results">Results</h2>
<p align="center">
<a href="/assets/week4/dqn_pong_rewards.png"><img src="/assets/week4/dqn_pong_rewards.png" width="320" height="240" /></a>
<a href="/assets/week4/pong.gif"><img src="/assets/week4/pong.gif" width="160" height="120" /></a>
<figcaption>DQN rewards for Pong.</figcaption>
</p>
<!-- <figure class="half">
<img src="/assets/week4/dqn_pong_rewards.png" width="320" height="240">
<img src="/assets/week4/pong.gif" width="160" height="120">
<figcaption>Caption describing these two images.</figcaption>
</figure> -->
<h1 id="take-aways">Take aways</h1>
<p>One major drawback of Deep Q Networks is that they can only handle low-dimensional, discrete action spaces. This makes DQN unsuitable for robotics control problems where the action space is often both high-dimensional and continuous. Consider for a moment a standard 7 degree of freedom robot manipulator. If we discretize the action space so that there are 5 actions for every degree of freedom we end up with a network that must have <script type="math/tex">5^7 = 78125</script> outputs! The situation would be much worse for a robot like <a href="https://en.wikipedia.org/wiki/Atlas_(robot)">Atlas</a> that has 28 degrees of freedom. The natural question is, of course, can we do better? I’ll try to address this question in my next post.</p>
<h1 id="references">References</h1>
<ol>
<li><a href="https://arxiv.org/abs/1312.5602">Playing Atari with Deep Reinforcement Learning</a></li>
</ol>Jon Michauxjmichaux@ttic.eduThis post uses Deep Q Networks to introduce off-policy algorithmsOn-Policy Actor-Critic Algorithms2019-02-24T00:00:00+00:002019-02-24T00:00:00+00:00https://jmichaux.github.io/week3<p>This post introduces Actor-Critic Algorithms as an extension of basic policy gradient algorithms such as <em>REINFORCE</em>.</p>
<h1 id="overview-of-actor-critic-style-algorithms">Overview of Actor-Critic Style algorithms</h1>
<p>In my <a href="https://jmichaux.github.io/week2/">last post</a> I focused on deriving and implementing the most basic class of policy gradient algorithms known as <em>REINFORCE</em>. We saw that we were able to improve the algorithm’s performance by subtracting a baseline. Why is this the case? It turns out that subtracting a good baseline can reduce the variance in the policy gradient estimate and empirically leads to faster convergence of the algorithm.</p>
<p>So how do we choose the baseline? One possible baseline is some form of running average of the normalized empirical return. This approach appears to work fine for small toy problems. However, in practice, it is common to learn a state-dependent baseline such as the state value function <script type="math/tex">V^{\pi}</script> or to learn a baseline that is dependent on both states <em>and</em> actions such as the state-action value function <script type="math/tex">Q^{\pi}</script>. Algorithms that learn both a policy (actor) function and value (critic) function are called <em>Actor-Critic</em> methods. The actor network (<em>i.e.</em> policy function) learns how the agent should choose actions, while the critic network (<em>i.e.</em> value function) evaluates how well the actor is doing. In our case, <script type="math/tex">V^{\pi}</script> or <script type="math/tex">Q^{\pi}</script> would be the critic.</p>
<h1 id="advantage-actor-critic-algorithm">Advantage Actor-Critic Algorithm</h1>
<p>The <em>Advantage Actor-Critic</em> (A2C) algorithm is the synchronous version of the famous <a href="https://arxiv.org/abs/1602.01783"><em>A3C</em></a> algorithm published in 2016 by DeepMind. Both A2C and A3C can be viewed as extensions to the classic <em>REINFORCE</em> algorithm. While <em>REINFORCE</em> uses the <em>reward to go</em> to estimate the policy gradient, A2C uses something called an <em>advantage function</em>. There are a few different ways to estimate the advantage function, but one common approach is to consider the difference between the <em>Q-function</em> and <em>state-value function</em>:</p>
<script type="math/tex; mode=display">\begin{equation}
A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t).
\end{equation}</script>
<p>The advantage tells us, on average, how much better it is to choose action <script type="math/tex">a_t</script>. In other words, if <script type="math/tex">a_t</script> is better than average then we should increase its probability. Likewise, if it is worse than average then we should decrease its probability. As the algorithm learns, the stochastic policy <script type="math/tex">\pi_{\theta}(a_t \vert s_t)</script> becomes more and more deterministic over time.</p>
<p>Here, we are approximating the advantage function <script type="math/tex">A_t</script> by using the critic to bootstrap an estimate of <script type="math/tex">Q^{\pi}</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
A^{\pi}(s_t, a_t) \approx r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t).
\end{equation}</script>
<p>Note that this approximation introduces bias, but has been shown to reduce variance in practice.</p>
<h2 id="a2c-algorithm">A2C Algorithm</h2>
<ol>
<li>For k = 1,2,… <script type="math/tex">\textbf{do}</script></li>
<li> Sample <script type="math/tex">N</script> rollouts <script type="math/tex">\{\tau^{(i)}\}</script> of length <script type="math/tex">T</script> from <script type="math/tex">N</script> actors acting in parallel <script type="math/tex">\pi_{\theta}(a_t \vert s_t)</script></li>
<li> Bootstrap an Advantage function estimate using temporal difference updates <br />
<script type="math/tex">A^{\pi}(s_t, a_t) \approx r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t)</script></li>
<li> Approximate the policy gradient <script type="math/tex">\nabla J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t \vert s_t)A^{\pi}(s_t, a_t)</script></li>
<li> Compute the Value function loss <script type="math/tex">L(\phi) = \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} (r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t))^2</script></li>
<li> Update the policy network parameters <script type="math/tex">\theta \longleftarrow \theta + \alpha \nabla J(\theta)</script></li>
<li> Update the value network parameters <script type="math/tex">\phi \longleftarrow \phi + \beta \nabla L(\phi)</script></li>
</ol>
<p>Note that it is possible to share network parameters between the actor and the critic. In that case, we will only have one loss function and one parameter update step.</p>
<h2 id="results">Results</h2>
<p align="center">
<img src="/assets/week3/a2c_total_reward.png" width="320" height="240" />
<img src="/assets/week3/a2c_pong.gif" width="160" height="120" />
</p>
<p align="center">
<img src="/assets/week3/a2c_total_loss.png" width="320" height="240" />
<img src="/assets/week3/a2c_policy_loss.png" width="320" height="240" />
</p>
<p align="center">
<img src="/assets/week3/a2c_value_loss.png" width="320" height="240" />
<img src="/assets/week3/a2c_entropy_loss.png" width="320" height="240" />
</p>
<h1 id="proximal-policy-optimization-ppo">Proximal Policy Optimization (PPO)</h1>
<p>Although policy gradient algorithms such as <em>REINFORCE</em> and <em>A2C</em> are well-known, they are often not the best choice when trying to solve difficult problems. One reason is that these standard policy gradient methods suffer from low sample complexity and require a large amount of data to train. Another reason is that it is hard to choose a step size that works for the entire course of training. This is because policy gradient methods are <em>on-policy</em>, and the distributions of states, rewards, and actions change as the model is updated. As a result, small changes in the network parameters <script type="math/tex">\theta</script> can have large effects on the performance of the policy <script type="math/tex">\pi_{\theta}</script>.</p>
<p><a href="https://arxiv.org/abs/1502.05477"><em>Trust Region Policy Optimization</em></a> (TRPO) is an on-policy algorithm developed by Schulman <em>et al</em> that mitigates these issues by optimizing a surrogate objective:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathbb{E_t}[\frac{\pi_{\theta}(a_t \vert s_t)}{\pi_{\theta_{old}}(a_t \vert s_t)} A_t] \\
s.t. \mathbb{E_t}[KL[\pi_{\theta_{old}}(\cdot \vert s_t), \pi_{\theta}(\cdot \vert s_t)]] \le \delta
\end{equation}</script>
<p>TRPO works by enforcing a <em>trust region constraint</em>. This constraint on the KL divergence ensures that the new policy <script type="math/tex">\pi_{\theta}</script> is close to the old policy <script type="math/tex">\pi_{\theta_{old}}</script>. The theoretical foundations of TRPO show that the policy can be guaranteed to improve monotonically at every step (more on this in a later post).</p>
<p>Although TRPO has been shown to have good performance on a set of benchmarks tasks, it can be tricky to implement. <a href="https://arxiv.org/abs/1707.06347"><em>Proximal Policy Optimization</em></a> (PPO) is an extension of TRPO that is much easier to implement in practice. PPO optimizes the following surrogate objective:</p>
<script type="math/tex; mode=display">\begin{equation}
J^{CLIP}(\theta) = \mathbb{E_t}[min(r_{t}(\theta)A_t, clip(r_{t}(\theta),1-\epsilon, 1+\epsilon)A_t)]
\end{equation}</script>
<p>where</p>
<script type="math/tex; mode=display">\begin{equation}
r_{t}(\theta) = \frac{\pi_{\theta}(a_t \vert s_t)}{\pi_{\theta_{old}}(a_t \vert s_t)}.
\end{equation}</script>
<p>Like TRPO, PPO makes conservative updates to the policy parameters <script type="math/tex">\theta</script> and has also been shown to work well in practice.</p>
<h2 id="ppo-algorithm">PPO Algorithm</h2>
<ol>
<li>For k = 1,2,… <script type="math/tex">\textbf{do}</script></li>
<li> Sample <script type="math/tex">N</script> rollouts <script type="math/tex">\{\tau^{(i)}\}</script> of length <script type="math/tex">T</script> from <script type="math/tex">N</script> actors acting in parallel <script type="math/tex">\pi_{\theta}(a_t \vert s_t)</script></li>
<li> Bootstrap an Advantage function estimate using temporal difference updates <br />
<script type="math/tex">A^{\pi}(s_t, a_t) \approx r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t)</script></li>
<li> Calculate the gradient of the surrogate objective <script type="math/tex">J^{CLIP}(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} min(r_{t}(\theta)A_t(s_t, a_t), clip(r_{t}(\theta),1-\epsilon, 1+\epsilon)A_t(s_t, a_t))</script></li>
<li> Compute the Value function loss <script type="math/tex">L(\phi) = \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} (r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t))^2</script></li>
<li> Update the policy network parameters <script type="math/tex">\theta \longleftarrow \theta + \alpha \nabla J(\theta)</script></li>
<li> Update the value network parameters <script type="math/tex">\phi \longleftarrow \phi + \beta \nabla L(\phi)</script></li>
</ol>
<h2 id="results-1">Results</h2>
<p align="center">
<img src="/assets/week3/ppo_total_reward.png" width="320" height="240" />
<img src="/assets/week3/ppo_pong.gif" width="160" height="120" />
</p>
<p align="center">
<img src="/assets/week3/ppo_total_loss.png" width="320" height="240" />
<img src="/assets/week3/ppo_policy_loss.png" width="320" height="240" />
</p>
<p align="center">
<img src="/assets/week3/ppo_value_loss.png" width="320" height="240" />
<img src="/assets/week3/ppo_entropy_loss.png" width="320" height="240" />
</p>
<h1 id="take-aways">Take aways</h1>
<p>Once I had a working implementation of A2C, it wasn’t very difficult to extend the code to also get PPO to work. Here’s a list of problems I ran into.</p>
<ul>
<li>The length of the policy rollouts can have a pretty significant effect on how long the algorithm takes to train</li>
<li>Similarly, the number of parallel actors can also affect algorithm convergence</li>
<li>Normalizing rewards, returns, and/or advantages is <del>probably</del> always worth trying</li>
</ul>
<h1 id="next-steps">Next steps</h1>
<p>For the past two weeks I have focused on learning about and implementing standard <em>on-policy</em> learning algorithms. Next, I will learn about <em>off-policy</em> actor-critic style algorithms like <a href="https://arxiv.org/abs/1509.02971"><em>Deep Deterministic Policy Gradients</em> (DDPG)</a>.</p>
<h1 id="references">References</h1>
<ol>
<li><a href="">CS294 Lecture 6 Actor Critic Algorithms</a></li>
<li><a href="">CS294 Lecture 9 Advanced Policy Gradients</a></li>
<li><a href="https://arxiv.org/abs/1502.05477">Trust Region Policy Optimization</a></li>
<li><a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization Algorithms</a></li>
</ol>Jon Michauxjmichaux@ttic.eduThis post introduces Actor-Critic Algorithms as an extension of basic policy gradient algorithms such as REINFORCE.An Introduction to Policy Gradient Methods2019-02-17T00:00:00+00:002019-02-17T00:00:00+00:00https://jmichaux.github.io/week2<p>This post begins my deep dive into Policy Gradient methods.</p>
<h1 id="overview-of-reinforcement-learning">Overview of Reinforcement Learning</h1>
<p>The goal of reinforcement learning is for an agent to learn to solve a given task by maximizing some notion of external reward. Instead of receiving explicit instructions, the agent learns how to choose actions by exploring and interacting with the environment. The reward signal serves as a way to encode whether the actions taken by the agent were successful. By maximizing the accumulated reward (<em>e.g. return</em>) over time, the agent eventually learns how to choose the best action depending on their current state.</p>
<p align="center"><a href="/assets/week2/rl.png"><img src="/assets/week2/rl.png" width="320" height="240" /></a></p>
<p align="center">Graphic credit to <a href="https://feryal.github.io/">Feryal Behbahani</a></p>
<!-- <figure class="half">
<img src="/assets/week2/rl.png" width="30%" align="center">
<figcaption>Graphic credit to <a href="https://feryal.github.io/">Feryal Behbahani</a></figcaption>
</figure> -->
<p>Basic reinforcement learning problems are often formulated as a <em>Markov Decision Process</em> <script type="math/tex">M = \langle S, A, R, P, \gamma \rangle</script>, where <script type="math/tex">S</script> is the distribution of states, <script type="math/tex">A</script> is the set of actions the agent can take, <script type="math/tex">R=R(s,a)</script> is the reward function, <script type="math/tex">P</script> are the state transition dynamics, and <script type="math/tex">\gamma</script> is the discount factor. The goal is to find a policy <script type="math/tex">\pi</script> that maps states <script type="math/tex">s \in S</script> to optimal or near-optimal actions <script type="math/tex">a \in A</script>.</p>
<h1 id="policy-gradients">Policy Gradients</h1>
<p>Policy Gradient methods are a family of reinforcement learning algorithms that rely on optimizing a parameterized policy directly. As alluded to above, the goal of the policy is to maximize the total expected reward:</p>
<script type="math/tex; mode=display">\begin{equation}
\mathbb{E}_{\tau \sim p_{\theta}(\tau)}[R(\tau)].
\end{equation}</script>
<p>Policy gradient methods have a number of benefits over other reinforcement learning methods. First, by optimizing the policy directly, it is not required that we also learn a value function (although we’ll later see that learning a value function can help). Second, policy gradient methods can handle both discrete and continuous states and actions, making them well suited for high dimensional problems. This is in contrast to methods such as <em>Deep Q-learning</em>, which struggles in high dimensions because it assigns scores for each possible action.</p>
<p>In addition to their benefits, policy gradient methods also have a few drawbacks. By definition, policy gradient methods are <em>on-policy</em>. This means that they are only able to learn from data that was collected with the current policy. As a result, policy gradient methods are not very sample efficient. Another issue is that policy gradient methods are not guaranteed to converge to a global optimum, and solutions may get stuck in local optima. Lastly, policy gradient methods tend to suffer from high variance. However, even with these drawbacks, policy gradient methods such as TRPO and PPO are still considered to be the state-of-the art reinforcement learning algorithms.</p>
<h1 id="reinforce">REINFORCE</h1>
<h2 id="derivation">Derivation</h2>
<p>In deriving the most basic policy gradiant algorithm, <em>REINFORCE</em>, we seek the optimal policy <script type="math/tex">\pi^{\ast}</script> that will maximize the total expected reward:</p>
<script type="math/tex; mode=display">\begin{equation}
\pi^{\ast} = \underset{\pi}{\operatorname{argmax}} \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[R(\tau)]
\end{equation}</script>
<p>where</p>
<script type="math/tex; mode=display">\begin{equation}
\tau = (s_1, a_1, ..., s_T, a_T)\\
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
R(\tau) = \sum_{t=1}^{T}r(s_t, a_t)\\
\end{equation}</script>
<script type="math/tex; mode=display">\begin{equation}
p_{\theta}(\tau) = p_{\theta}(s_1, a_1, ..., s_T, a_T).
\end{equation}</script>
<p>The <em>trajectory</em> <script type="math/tex">\tau</script> is a sequence of states and actions experienced by the agent, <script type="math/tex">R(\tau)</script> is the <em>return</em>, and <script type="math/tex">p_{\theta}(\tau)</script> is the probability of observing that particular sequence of states and actions. It is important to note that <script type="math/tex">p_{\theta}(\tau)</script> is a function of both the environment transition dynamics and the policy <script type="math/tex">\pi</script>.</p>
<p>Since the policy <script type="math/tex">\pi</script> is parameterized by <script type="math/tex">\theta</script>, finding the optimal policy <script type="math/tex">\pi^{\ast}</script> is equivalent to finding the optimal parameter vector $$$\theta^{\ast}$:</p>
<script type="math/tex; mode=display">\begin{equation}
\theta^{\ast} = \underset{\theta}{\operatorname{argmax}} \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[R(\tau)]
\end{equation}</script>
<p>Thus, we can define our objective <script type="math/tex">J(\theta)</script> to be the total expected reward:</p>
<script type="math/tex; mode=display">\begin{equation}
J(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[R(\tau)]
\end{equation}</script>
<p>One way to optimize this objective is to take the derivative and then use gradient ascent. The calculation of the gradient goes as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla J(\theta) &= \nabla \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[R(\tau)] \\
&= \nabla \int p_{\theta}(\tau) R(\tau) \mathrm{d\tau} \\
&= \int \nabla p_{\theta}(\tau) R(\tau) \mathrm{d\tau} \\
&= \int p_{\theta}(\tau) \nabla log p_{\theta}(\tau) R(\tau) \mathrm{d\tau} \\
&= \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\nabla log p_{\theta}(\tau) R(\tau)]
\end{align} %]]></script>
<p>Now that we have an expression for the policy gradient, it is easy to see that this quantity can approximated it by sampling trajectories from <script type="math/tex">p_{\theta}(\tau)</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
\nabla J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N}[\nabla log p_{\theta}(\tau) R(\tau)].
\end{equation}</script>
<p>So how do we get the policy gradient approximation in terms of the parameterized policy <script type="math/tex">\pi_{\theta}</script>? Let’s start by expanding <script type="math/tex">p_{\theta}(\tau)</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p_{\theta}(\tau) &= p_{\theta}(s_1, a_1, ..., s_T, a_T) \\
&= p(s_1) \cdot \pi_{\theta}(a_1|s_1) \cdot p(s_2 | s_1, a_1) \cdots p(s_{T+1} | s_T, a_T) \\
&= p(s_1) \prod_{t=1}^{T} \pi_{\theta}(a_t|s_t) p(s_{t+1} | s_t, a_t).
\end{align} %]]></script>
<p>Now, taking the logarithm of both sides we get:</p>
<script type="math/tex; mode=display">\begin{equation}
log p_{\theta}(\tau) = log p(s_1) + \sum_{t=1}^{T} log \pi_{\theta}(a_t|s_t) + \sum_{t=1}^{T} log p(s_{t+1} | s_t, a_t).
\end{equation}</script>
<p>And differentiating <em>w.r.t.</em> <script type="math/tex">\theta</script>:</p>
<script type="math/tex; mode=display">\begin{equation}
\nabla log p_{\theta}(\tau) = \displaystyle\sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t|s_t).
\end{equation}</script>
<p>Combining equations <script type="math/tex">(4)</script>, <script type="math/tex">(13)</script>, and <script type="math/tex">(18)</script> we get:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla J(\theta) &\approx \frac{1}{N}\sum_{i=1}^{N}[\nabla log p_{\theta}(\tau) R(\tau)] \\
&= \frac{1}{N}\sum_{i=1}^{N} (\sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t|s_t))R(\tau) \\
&= \frac{1}{N}\sum_{i=1}^{N} (\sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t|s_t)) (\sum_{t=1}^{T} r(s_t, a_t)).
\end{align} %]]></script>
<p>Now that we have an approximation of the policy gradient we can write down our first algorithm!</p>
<h2 id="reinforce-algorithm">REINFORCE Algorithm</h2>
<ol>
<li>For k = 1,2,… <script type="math/tex">\textbf{do}</script></li>
<li> Sample a set of trajectories <script type="math/tex">\{\tau^{(i)}\}</script> from <script type="math/tex">\pi_{\theta}(a_t \vert s_t)</script></li>
<li> Approximate the policy gradient <script type="math/tex">\nabla J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} (\sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t \vert s_t)) (\sum_{t=1}^{T} r(s_t, a_t))</script></li>
<li> Update the parameters <script type="math/tex">\theta \longleftarrow \theta + \alpha \nabla J(\theta)</script></li>
</ol>
<p>The intuition behind <em>REINFORCE</em> is that the parameters <script type="math/tex">\theta</script> are updated proportional to the total reward (or <em>reward to go</em>). In other words, the log probabilities of good actions are increased while the log probabilities of bad actions are decreased.</p>
<h2 id="cartpole-results">CartPole Results</h2>
<p align="center">
<a href="/assets/week2/CartPole-v0-total_reward.png"><img src="/assets/week2/CartPole-v0-total_reward.png" width="320" height="240" /></a>
<a href="/assets/week2/tr.gif"><img src="/assets/week2/tr.gif" width="320" height="240" /></a>
</p>
<h2 id="reducing-variance">Reducing Variance</h2>
<h3 id="trick-1-using-the-reward-to-go">Trick 1: Using the Reward-to-Go</h3>
<p>Instead of weighting the gradients of log probabilities by the total reward at each time step <script type="math/tex">t</script>, we can instead weight them by the <em>Reward-to-Go</em>. The intuition here is that past rewards don’t influence future actions.</p>
<script type="math/tex; mode=display">\begin{equation}
\nabla J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t|s_t)
\color{blue}{(\sum_{t'=t}^{T} r(s_t', a_t'))}
\end{equation}</script>
<p align="center">
<a href="/assets/week2/CartPole-v0-reward-to-go.png"><img src="/assets/week2/CartPole-v0-reward-to-go.png" width="320" height="240" /></a>
<a href="/assets/week2/r2g.gif"><img src="/assets/week2/r2g.gif" width="320" height="240" /></a>
</p>
<h3 id="trick-2-subtracting-a-baseline">Trick 2: Subtracting a Baseline</h3>
<p>From equation <script type="math/tex">(21)</script> we see that the gradient will increase the likelihood of trajectories with positive returns, and decrease the likelihood of trajectories with negative returns. This can cause problems if the agent only receives positive rewards. But by subtracting a baseline from the return estimate, the gradient will then increase the likelihood of trajectories proportional to how much better the returns are than the baseline. Note that it can be shown that subtracting a baseline is unbiased in expectation.</p>
<script type="math/tex; mode=display">\begin{equation}
\nabla J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} (\sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t|s_t)) \color{blue}{(\sum_{t'=t}^{T} r(s_t', a_t') - v_{\phi}(s_{t}))}
\end{equation}</script>
<p align="center">
<a href="/assets/week2/CartPole-v0.png"><img src="/assets/week2/CartPole-v0.png" width="320" height="240" /></a>
<a href="/assets/week2/ac.gif"><img src="/assets/week2/ac.gif" width="320" height="240" /></a>
</p>
<h2 id="lunar-lander-results">Lunar Lander Results</h2>
<p align="center">
<a href="/assets/week2/LunarLander-v2.png"><img src="/assets/week2/LunarLander-v2.png" width="320" height="240" /></a>
<a href="/assets/week2/llac.gif"><img src="/assets/week2/llac.gif" width="320" height="240" /></a>
</p>
<h1 id="takeaways">Takeaways</h1>
<p>There are quite a number of things that make deep reinforcement learning difficult. However, in the process of struggling to get the code to work I’ve learned a few things:</p>
<ul>
<li>Normalizing rewards can improve training stability</li>
<li>Random seeds matter a lot</li>
<li>Total reward isn’t always the best indicator for the actual performance of the policy</li>
<li>Although historically important, <em>REINFORCE</em> is not a very good algorithm</li>
</ul>
<h1 id="next-steps">Next steps</h1>
<p>Moving forward, I would like to deepen my understanding of the theoretical foundations of policy gradient methods. In order to do that, I’ve decided to devote a significant amount of time learning about Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).</p>
<h1 id="references">References</h1>
<ol>
<li><a href="https://www.youtube.com/watch?v=XGmd3wcyDg8&t=0s&index=22&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37">CS294 Lecture 5 Policy Gradients</a></li>
<li><a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Simple statistical gradient-following algorithms for connectionist reinforcement learning</a></li>
</ol>Jon Michauxjmichaux@ttic.eduThis post begins my deep dive into Policy Gradient methods.