# On-Policy Actor-Critic Algorithms

This post introduces Actor-Critic Algorithms as an extension of basic policy gradient algorithms such as *REINFORCE*.

# Overview of Actor-Critic Style algorithms

In my last post I focused on deriving and implementing the most basic class of policy gradient algorithms known as *REINFORCE*. We saw that we were able to improve the algorithm’s performance by subtracting a baseline. Why is this the case? It turns out that subtracting a good baseline can reduce the variance in the policy gradient estimate and empirically leads to faster convergence of the algorithm.

So how do we choose the baseline? One possible baseline is some form of running average of the normalized empirical return. This approach appears to work fine for small toy problems. However, in practice, it is common to learn a state-dependent baseline such as the state value function or to learn a baseline that is dependent on both states *and* actions such as the state-action value function . Algorithms that learn both a policy (actor) function and value (critic) function are called *Actor-Critic* methods. The actor network (*i.e.* policy function) learns how the agent should choose actions, while the critic network (*i.e.* value function) evaluates how well the actor is doing. In our case, or would be the critic.

# Advantage Actor-Critic Algorithm

The *Advantage Actor-Critic* (A2C) algorithm is the synchronous version of the famous *A3C* algorithm published in 2016 by DeepMind. Both A2C and A3C can be viewed as extensions to the classic *REINFORCE* algorithm. While *REINFORCE* uses the *reward to go* to estimate the policy gradient, A2C uses something called an *advantage function*. There are a few different ways to estimate the advantage function, but one common approach is to consider the difference between the *Q-function* and *state-value function*:

The advantage tells us, on average, how much better it is to choose action . In other words, if is better than average then we should increase its probability. Likewise, if it is worse than average then we should decrease its probability. As the algorithm learns, the stochastic policy becomes more and more deterministic over time.

Here, we are approximating the advantage function by using the critic to bootstrap an estimate of :

Note that this approximation introduces bias, but has been shown to reduce variance in practice.

## A2C Algorithm

- For k = 1,2,…
- Sample rollouts of length from actors acting in parallel
- Bootstrap an Advantage function estimate using temporal difference updates

- Approximate the policy gradient
- Compute the Value function loss
- Update the policy network parameters
- Update the value network parameters

Note that it is possible to share network parameters between the actor and the critic. In that case, we will only have one loss function and one parameter update step.

## Results

# Proximal Policy Optimization (PPO)

Although policy gradient algorithms such as *REINFORCE* and *A2C* are well-known, they are often not the best choice when trying to solve difficult problems. One reason is that these standard policy gradient methods suffer from low sample complexity and require a large amount of data to train. Another reason is that it is hard to choose a step size that works for the entire course of training. This is because policy gradient methods are *on-policy*, and the distributions of states, rewards, and actions change as the model is updated. As a result, small changes in the network parameters can have large effects on the performance of the policy .

*Trust Region Policy Optimization* (TRPO) is an on-policy algorithm developed by Schulman *et al* that mitigates these issues by optimizing a surrogate objective:

TRPO works by enforcing a *trust region constraint*. This constraint on the KL divergence ensures that the new policy is close to the old policy . The theoretical foundations of TRPO show that the policy can be guaranteed to improve monotonically at every step (more on this in a later post).

Although TRPO has been shown to have good performance on a set of benchmarks tasks, it can be tricky to implement. *Proximal Policy Optimization* (PPO) is an extension of TRPO that is much easier to implement in practice. PPO optimizes the following surrogate objective:

where

Like TRPO, PPO makes conservative updates to the policy parameters and has also been shown to work well in practice.

## PPO Algorithm

- For k = 1,2,…
- Sample rollouts of length from actors acting in parallel
- Bootstrap an Advantage function estimate using temporal difference updates

- Calculate the gradient of the surrogate objective
- Compute the Value function loss
- Update the policy network parameters
- Update the value network parameters

## Results

# Take aways

Once I had a working implementation of A2C, it wasn’t very difficult to extend the code to also get PPO to work. Here’s a list of problems I ran into.

- The length of the policy rollouts can have a pretty significant effect on how long the algorithm takes to train
- Similarly, the number of parallel actors can also affect algorithm convergence
- Normalizing rewards, returns, and/or advantages is
~~probably~~always worth trying

# Next steps

For the past two weeks I have focused on learning about and implementing standard *on-policy* learning algorithms. Next, I will learn about *off-policy* actor-critic style algorithms like *Deep Deterministic Policy Gradients* (DDPG).