These are notes that I took while learning about policy gradient algorithms. Note that these notes overlap significantly with a blog post that I wrote.
In deriving the REINFORCE algorithm, we seek the optimal policy that will maximize the total expected reward:
The trajectory is a sequence of states and actions experienced by the agent, is the return, and is the probability of observing that particular sequence of states and actions. It is important to note that is a function of both the environment transition dynamics and the policy .
Since the policy is parameterized by , the above is equivalent to finding the optimal set of parameters:
Thus, we can define our objective to be the total expected reward:
One way to optimize this objective is to take the derivative and then use gradient ascent. The calculation of the gradient goes as follows:
Now that we have an expression for the policy gradient, it is easy to see that this quantity can approximated it by sampling trajectories from :
So how do we get the policy gradient approximation in terms of the parameterized policy ? Let’s start by expanding :
Now, taking the logarithm of both sides we get:
And differentiating w.r.t. :
Combining equations , , and we get:
- For k = 1,2,…
- Sample a set of trajectories from
- Approximate the policy gradient
- Update the parameters
Questions (Answers coming soon!)
- At its core, REINFORCE computes an approximate gradient of the reward with respect to the parameters. Why can’t we just use the familiar stochastic gradient descent?
- Does the REINFORCE gradient estimator resemble maximum likelihood estimation (MLE)? Why or why not?
- In its original formulation, REINFORCE is an on-policy algorithm. Why? Can we make REINFORCE work off-policy as well?
- Do policy gradient methods work for discrete and continuous action spaces? If not, why not?