# On-Policy Actor-Critic Algorithms

This post introduces Actor-Critic Algorithms as an extension of basic policy gradient algorithms such as REINFORCE.

# Overview of Actor-Critic Style algorithms

In my last post I focused on deriving and implementing the most basic class of policy gradient algorithms known as REINFORCE. We saw that we were able to improve the algorithm’s performance by subtracting a baseline. Why is this the case? It turns out that subtracting a good baseline can reduce the variance in the policy gradient estimate and empirically leads to faster convergence of the algorithm.

So how do we choose the baseline? One possible baseline is some form of running average of the normalized empirical return. This approach appears to work fine for small toy problems. However, in practice, it is common to learn a state-dependent baseline such as the state value function $V^{\pi}$ or to learn a baseline that is dependent on both states and actions such as the state-action value function $Q^{\pi}$. Algorithms that learn both a policy (actor) function and value (critic) function are called Actor-Critic methods. The actor network (i.e. policy function) learns how the agent should choose actions, while the critic network (i.e. value function) evaluates how well the actor is doing. In our case, $V^{\pi}$ or $Q^{\pi}$ would be the critic.

The Advantage Actor-Critic (A2C) algorithm is the synchronous version of the famous A3C algorithm published in 2016 by DeepMind. Both A2C and A3C can be viewed as extensions to the classic REINFORCE algorithm. While REINFORCE uses the reward to go to estimate the policy gradient, A2C uses something called an advantage function. There are a few different ways to estimate the advantage function, but one common approach is to consider the difference between the Q-function and state-value function:

The advantage tells us, on average, how much better it is to choose action $a_t$. In other words, if $a_t$ is better than average then we should increase its probability. Likewise, if it is worse than average then we should decrease its probability. As the algorithm learns, the stochastic policy $\pi_{\theta}(a_t \vert s_t)$ becomes more and more deterministic over time.

Here, we are approximating the advantage function $A_t$ by using the critic to bootstrap an estimate of $Q^{\pi}$:

Note that this approximation introduces bias, but has been shown to reduce variance in practice.

## A2C Algorithm

1. For k = 1,2,… $\textbf{do}$
2.        Sample $N$ rollouts $\{\tau^{(i)}\}$ of length $T$ from $N$ actors acting in parallel $\pi_{\theta}(a_t \vert s_t)$
$A^{\pi}(s_t, a_t) \approx r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t)$
4.        Approximate the policy gradient $\nabla J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t \vert s_t)A^{\pi}(s_t, a_t)$
5.        Compute the Value function loss $L(\phi) = \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} (r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t))^2$
6.        Update the policy network parameters $\theta \longleftarrow \theta + \alpha \nabla J(\theta)$
7.        Update the value network parameters $\phi \longleftarrow \phi + \beta \nabla L(\phi)$

Note that it is possible to share network parameters between the actor and the critic. In that case, we will only have one loss function and one parameter update step.

## Results      # Proximal Policy Optimization (PPO)

Although policy gradient algorithms such as REINFORCE and A2C are well-known, they are often not the best choice when trying to solve difficult problems. One reason is that these standard policy gradient methods suffer from low sample complexity and require a large amount of data to train. Another reason is that it is hard to choose a step size that works for the entire course of training. This is because policy gradient methods are on-policy, and the distributions of states, rewards, and actions change as the model is updated. As a result, small changes in the network parameters $\theta$ can have large effects on the performance of the policy $\pi_{\theta}$.

Trust Region Policy Optimization (TRPO) is an on-policy algorithm developed by Schulman et al that mitigates these issues by optimizing a surrogate objective:

TRPO works by enforcing a trust region constraint. This constraint on the KL divergence ensures that the new policy $\pi_{\theta}$ is close to the old policy $\pi_{\theta_{old}}$. The theoretical foundations of TRPO show that the policy can be guaranteed to improve monotonically at every step (more on this in a later post).

Although TRPO has been shown to have good performance on a set of benchmarks tasks, it can be tricky to implement. Proximal Policy Optimization (PPO) is an extension of TRPO that is much easier to implement in practice. PPO optimizes the following surrogate objective:

where

Like TRPO, PPO makes conservative updates to the policy parameters $\theta$ and has also been shown to work well in practice.

## PPO Algorithm

1. For k = 1,2,… $\textbf{do}$
2.        Sample $N$ rollouts $\{\tau^{(i)}\}$ of length $T$ from $N$ actors acting in parallel $\pi_{\theta}(a_t \vert s_t)$
$A^{\pi}(s_t, a_t) \approx r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t)$
4.        Calculate the gradient of the surrogate objective $J^{CLIP}(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} min(r_{t}(\theta)A_t(s_t, a_t), clip(r_{t}(\theta),1-\epsilon, 1+\epsilon)A_t(s_t, a_t))$
5.        Compute the Value function loss $L(\phi) = \frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T} (r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t))^2$
6.        Update the policy network parameters $\theta \longleftarrow \theta + \alpha \nabla J(\theta)$
7.        Update the value network parameters $\phi \longleftarrow \phi + \beta \nabla L(\phi)$

## Results      # Take aways

Once I had a working implementation of A2C, it wasn’t very difficult to extend the code to also get PPO to work. Here’s a list of problems I ran into.

• The length of the policy rollouts can have a pretty significant effect on how long the algorithm takes to train
• Similarly, the number of parallel actors can also affect algorithm convergence
• Normalizing rewards, returns, and/or advantages is probably always worth trying

# Next steps

For the past two weeks I have focused on learning about and implementing standard on-policy learning algorithms. Next, I will learn about off-policy actor-critic style algorithms like Deep Deterministic Policy Gradients (DDPG).