# Policy Gradients

These are notes that I took while learning about policy gradient algorithms. Note that these notes overlap significantly with a blog post that I wrote.

# REINFORCE

## On-policy derivation

In deriving the *REINFORCE* algorithm, we seek the optimal policy that will maximize the total expected reward:

where

The *trajectory* is a sequence of states and actions experienced by the agent, is the *return*, and is the probability of observing that particular sequence of states and actions. It is important to note that is a function of both the environment transition dynamics and the policy .

Since the policy is parameterized by , the above is equivalent to finding the optimal set of parameters:

Thus, we can define our objective to be the total expected reward:

One way to optimize this objective is to take the derivative and then use gradient ascent. The calculation of the gradient goes as follows:

Now that we have an expression for the policy gradient, it is easy to see that this quantity can approximated it by sampling trajectories from :

So how do we get the policy gradient approximation in terms of the parameterized policy ? Let’s start by expanding :

Now, taking the logarithm of both sides we get:

And differentiating *w.r.t.* :

Combining equations , , and we get:

## On-policy derivation

## REINFORCE Algorithm

- For k = 1,2,…
- Sample a set of trajectories from
- Approximate the policy gradient
- Update the parameters

# Questions (Answers coming soon!)

**At its core, REINFORCE computes an approximate gradient of the reward with respect to the parameters. Why can’t we just use the familiar stochastic gradient descent?****Does the REINFORCE gradient estimator resemble maximum likelihood estimation (MLE)? Why or why not?****In its original formulation, REINFORCE is an on-policy algorithm. Why? Can we make REINFORCE work off-policy as well?****Do policy gradient methods work for discrete and continuous action spaces? If not, why not?**