Derivation of Bellman’s Equation

Preliminaries

This note follows Chapter 3 from Reinforcement Learning: An Introduction by Sutton and Barto.

Markov Decision Process

Let denote a Markov Decision Process (MDP), where is the set of states, the set of possible actions, the transition dynamics, the reward function, and the discount factor. If and are both finite, we say that is a finite MDP. Note that is a map from state-action pairs () to scalar rewards.

Returns

Policy

Let denote our policy.

Value functions

State-value function:
Action-value function:

We will derive a more useful form for these value function equations below. But first, let’s re-prove the well known Law of Iterated Expectations using our notation for the expected return .

Law of iterated expectations

Theorem

Proof: To keep the notation clean and easy to read we’ll drop the subscripts, and denote the random variables , , . Following this convention, we can write the expected return as:

Conditioning on and taking the expectation of the above expression we get:

Bellman’s Equations

Using the law of iterated expectation, we can expand the state-value function as follows:

Another way to derive this equation is by looking at the full Bellman backup diagram:

Bellman backup diagram

Similarly we can rewrite the action-value function as follows:

From the above equations it is easy to see that:

We can express this relationship as a backup diagram as well:

References

  1. Sutton, Richard, Barto, Andrew. Reinforcement Learning: An Introduction. (draft, 2018).