# Off-Policy Actor-Critic Algorithms

This post extends my learning about Actor-Critic algorithms to the off-policy setting.

Deep Deterministic Policy Gradients (DDPG) is an extension of the DQN algorithm that is able to learn control policies for continuous action spaces. DDPG is an Actor-Critic algorithm, so it learns both a policy and value function (Q function). Like DQN, DDPG makes use of experience replay buffers and a frozen target network to stabilize training. The critic used in DDPG, however, differs from the critic used in DQN in two key ways. First, the critic used in DDPG takes as input both the states and actions. Second, the critic does not output a Q-value for every action (otherwise, there’d be infintely many outputs!), but instead the architecture has only one neuron that outputs values for each state-action pair.

So how does training work? Training the critic network $Q_{\phi}(s,a)$ in DDPG is very similar to how it is trained in DQN. Training the actor $\pi_{\theta}(s)$, on the other hand, relies on the Deterministic Policy Gradient Theorem proved by David Silver in 2014:

Notice that we are taking the gradient of $Q_{\phi}(s,a)$ with respect to the actions $a$. The intuition for this is as follows. The critic can evaluate the action that the actor proposes from a particular state. By making small changes to that action, the critic will tell us whether the new action is an improvement over the previous action. If the new action does have a higher Q value, the gradient $\nabla{a}Q_{\phi}$ is used to update the parameters of the actor in the right direction.

It is also important to note that unlike actor-critic algorithms like A2C and PPO, the actor in DDPG maps states directly (i.e. deterministically) to actions rather than outputting a distribution over actions. Since the actor isn’t sampling actions, how, then, do we actually get exploration? One method involves adding Gaussian noise or Ornstein-Uhlenbenk process noise to the deterministic action. Another method involves adaptively perturbing the parameters of the actor network.

## DDPG Algorithm

1. For episode = 1,2,… $\textbf{do}$
2.    For $t = 1,T$
3.        Select action according to policy and noise process:
$a_t = \mu_{\theta}(s_t) + \mathcal{N}_{t}$
4.        Execute action $a_t$ and observe reward $r_t$ and next state $s_t$
5.        Store transition $(s_t, a_t, r_t, s_{t+1})$ in replay buffer
6.        Sample minibatch from replay buffer $\{(s_j, a_j, r_j, s_{j+1})\}_{j=1}^{N}$
7.        Calculate targets $y_j$: $y_j = r_j + \gamma Q_{\bf{\phi'}}(s_{j+1}, \mu_{\theta'}(s_{j+1}))$
8.        Calculate the loss: \begin{align} \nonumber L = \frac{1}{N} \sum_{j=1}^{N} (r_j + \gamma Q_{\bf{\phi'}}(s_{j+1}, \mu_{\theta'}(s_{j+1})) - Q_{\phi}(s_j, a_j))^2 \\ \nonumber \end{align}
9.        Update the critic network parameters: \begin{align} \nonumber \phi \longleftarrow \phi + \alpha \nabla L(\phi) \\ \nonumber \end{align}
10.        Approximate the policy gradient: \begin{align} \nonumber \nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} \nabla_{a}Q_{\phi}(s_j, \mu_{\theta}(s_j)) \nabla_{\theta} \mu_{\theta}(s) \\ \nonumber \end{align}
11.        Update the policy parameters

12.        Update the target networks

# Twin Delayed DDPG

Although DDPG is capable of solving challenging continuous control tasks, training can be very difficult in practice. Twin Delayed DDPG (TD3) uses a few tricks to that greatly improve algorithm performance:

1. Target policy smoothing
2. Clipped Double Q learning
3. Delaying update of policy and target networks

## TD3 Algorithm

1. For episode = 1,2,… $\textbf{do}$
2.    For $t = 1,T$
3.        Select action according to policy and noise process:
$a_t = \mu_{\theta}(s_t) + \mathcal{N}_{t}$
4.        Execute action $a_t$ and observe reward $r_t$ and next state $s_{t+1}$
5.        Store transition $(s_t, a_t, r_t, s_{t+1})$ in replay buffer
6.        Sample minibatch from replay buffer $\{(s_j, a_j, r_j, s_{j+1})\}_{j=1}^{N}$
7.        Add noise to the action: $\color{red}{\tilde{a} = \pi_{\theta'} + \epsilon, \epsilon \sim \text{clip}(\mathcal{N}(0, \tilde{\sigma}), -c, c)}$
8.        Calculate targets: $\color{blue}{y_j = r_j + \gamma \underset{i}{\operatorname{min}}Q_{\phi_{i}'}(s_{j+1}, \tilde{a})}$
9.        Calculate the loss: \begin{align} \nonumber \color{blue}{L = \frac{1}{N} \sum_{j=1}^{N} (y_j - \underset{i}{\operatorname{min}}Q_{\phi_{i}}(s_j, a_j))^2}\\ \nonumber \end{align}
10.        Update the critic network parameters: \begin{align} \nonumber \color{blue}{\phi_{i} \longleftarrow \phi_{i} + \alpha \nabla L(\phi)} \\ \nonumber \end{align}
11.        If $t \text{ mod } d$ then:
12.           Approximate the policy gradient: \begin{align} \nonumber \color{green}{\nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{j=1}^{N} \nabla_{a}Q_{\phi}(s_j, \mu_{\theta}(s_j)) \nabla_{\theta} \mu_{\theta}(s_j)} \\ \nonumber \end{align}
13.           Update the policy parameters:

14.           Update the target networks:

# Take aways

• TD3 works much better than the original DDPG algorithm
• Even using a subset of the modifications like delaying the update of the policy improves learning (not shown)
• I didn’t investigate how decaying the exploration rate over time might affect algorithm convergence, so I might want to look into that at some point