An Introduction to Policy Gradient Methods

This post begins my deep dive into Policy Gradient methods.

Overview of Reinforcement Learning

The goal of reinforcement learning is for an agent to learn to solve a given task by maximizing some notion of external reward. Instead of receiving explicit instructions, the agent learns how to choose actions by exploring and interacting with the environment. The reward signal serves as a way to encode whether the actions taken by the agent were successful. By maximizing the accumulated reward (e.g. return) over time, the agent eventually learns how to choose the best action depending on their current state.

Graphic credit to Feryal Behbahani

Basic reinforcement learning problems are often formulated as a Markov Decision Process $M = \langle S, A, R, P, \gamma \rangle$, where $S$ is the distribution of states, $A$ is the set of actions the agent can take, $R=R(s,a)$ is the reward function, $P$ are the state transition dynamics, and $\gamma$ is the discount factor. The goal is to find a policy $\pi$ that maps states $s \in S$ to optimal or near-optimal actions $a \in A$.

Policy Gradient methods are a family of reinforcement learning algorithms that rely on optimizing a parameterized policy directly. As alluded to above, the goal of the policy is to maximize the total expected reward:

Policy gradient methods have a number of benefits over other reinforcement learning methods. First, by optimizing the policy directly, it is not required that we also learn a value function (although we’ll later see that learning a value function can help). Second, policy gradient methods can handle both discrete and continuous states and actions, making them well suited for high dimensional problems. This is in contrast to methods such as Deep Q-learning, which struggles in high dimensions because it assigns scores for each possible action.

In addition to their benefits, policy gradient methods also have a few drawbacks. By definition, policy gradient methods are on-policy. This means that they are only able to learn from data that was collected with the current policy. As a result, policy gradient methods are not very sample efficient. Another issue is that policy gradient methods are not guaranteed to converge to a global optimum, and solutions may get stuck in local optima. Lastly, policy gradient methods tend to suffer from high variance. However, even with these drawbacks, policy gradient methods such as TRPO and PPO are still considered to be the state-of-the art reinforcement learning algorithms.

REINFORCE

Derivation

In deriving the most basic policy gradiant algorithm, REINFORCE, we seek the optimal policy $\pi^{\ast}$ that will maximize the total expected reward:

where

The trajectory $\tau$ is a sequence of states and actions experienced by the agent, $R(\tau)$ is the return, and $p_{\theta}(\tau)$ is the probability of observing that particular sequence of states and actions. It is important to note that $p_{\theta}(\tau)$ is a function of both the environment transition dynamics and the policy $\pi$.

Since the policy $\pi$ is parameterized by $\theta$, finding the optimal policy $\pi^{\ast}$ is equivalent to finding the optimal parameter vector $\theta^{\ast}$:

Thus, we can define our objective $J(\theta)$ to be the total expected reward:

One way to optimize this objective is to take the derivative and then use gradient ascent. The calculation of the gradient goes as follows:

Now that we have an expression for the policy gradient, it is easy to see that this quantity can approximated it by sampling trajectories from $p_{\theta}(\tau)$:

So how do we get the policy gradient approximation in terms of the parameterized policy $\pi_{\theta}$? Let’s start by expanding $p_{\theta}(\tau)$:

Now, taking the logarithm of both sides we get:

And differentiating w.r.t. $\theta$:

Combining equations $(4)$, $(13)$, and $(18)$ we get:

Now that we have an approximation of the policy gradient we can write down our first algorithm!

REINFORCE Algorithm

1. For k = 1,2,… $\textbf{do}$
2.        Sample a set of trajectories $\{\tau^{(i)}\}$ from $\pi_{\theta}(a_t \vert s_t)$
3.        Approximate the policy gradient $\nabla J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} (\sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t \vert s_t)) (\sum_{t=1}^{T} r(s_t, a_t))$
4.        Update the parameters $\theta \longleftarrow \theta + \alpha \nabla J(\theta)$

The intuition behind REINFORCE is that the parameters $\theta$ are updated proportional to the total reward (or reward to go). In other words, the log probabilities of good actions are increased while the log probabilities of bad actions are decreased.

Reducing Variance

Trick 1: Using the Reward-to-Go

Instead of weighting the gradients of log probabilities by the total reward at each time step $t$, we can instead weight them by the Reward-to-Go. The intuition here is that past rewards don’t influence future actions.

Trick 2: Subtracting a Baseline

From equation $(21)$ we see that the gradient will increase the likelihood of trajectories with positive returns, and decrease the likelihood of trajectories with negative returns. This can cause problems if the agent only receives positive rewards. But by subtracting a baseline from the return estimate, the gradient will then increase the likelihood of trajectories proportional to how much better the returns are than the baseline. Note that it can be shown that subtracting a baseline is unbiased in expectation.

Takeaways

There are quite a number of things that make deep reinforcement learning difficult. However, in the process of struggling to get the code to work I’ve learned a few things:

• Normalizing rewards can improve training stability
• Random seeds matter a lot
• Total reward isn’t always the best indicator for the actual performance of the policy
• Although historically important, REINFORCE is not a very good algorithm

Next steps

Moving forward, I would like to deepen my understanding of the theoretical foundations of policy gradient methods. In order to do that, I’ve decided to devote a significant amount of time learning about Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).

Updated: