These are notes that I took while learning about policy gradient algorithms. Note that these notes overlap significantly with a blog post that I wrote.

# REINFORCE

## On-policy derivation

In deriving the REINFORCE algorithm, we seek the optimal policy $\pi^{\ast}$ that will maximize the total expected reward:

where

The trajectory $\tau$ is a sequence of states and actions experienced by the agent, $R(\tau)$ is the return, and $p_{\theta}(\tau)$ is the probability of observing that particular sequence of states and actions. It is important to note that $p_{\theta}(\tau)$ is a function of both the environment transition dynamics and the policy $\pi$.

Since the policy $\pi$ is parameterized by $\theta$, the above is equivalent to finding the optimal set of parameters:

Thus, we can define our objective $J(\theta)$ to be the total expected reward:

One way to optimize this objective is to take the derivative and then use gradient ascent. The calculation of the gradient goes as follows:

Now that we have an expression for the policy gradient, it is easy to see that this quantity can approximated it by sampling trajectories from $p_{\theta}(\tau)$:

So how do we get the policy gradient approximation in terms of the parameterized policy $\pi_{\theta}$? Let’s start by expanding $p_{\theta}(\tau)$:

Now, taking the logarithm of both sides we get:

And differentiating w.r.t. $\theta$:

Combining equations $(4)$, $(13)$, and $(18)$ we get:

## REINFORCE Algorithm

1. For k = 1,2,… $\textbf{do}$
2.        Sample a set of trajectories $\{\tau^{(i)}\}$ from $\pi_{\theta}(a_t \vert s_t)$
3.        Approximate the policy gradient $\nabla J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N} (\sum_{t=1}^{T} \nabla log\pi_{\theta}(a_t \vert s_t)) (\sum_{t=1}^{T} r(s_t, a_t))$
4.        Update the parameters $\theta \longleftarrow \theta + \alpha \nabla J(\theta)$