This note follows Chapter 3 from Reinforcement Learning: An Introduction by Sutton and Barto.
Markov Decision Process
Let denote a Markov Decision Process (MDP), where is the set of states, the set of possible actions, the transition dynamics, the reward function, and the discount factor. If and are both finite, we say that is a finite MDP. Note that is a map from state-action pairs () to scalar rewards.
Let denote our policy.
We will derive a more useful form for these value function equations below. But first, let’s re-prove the well known Law of Iterated Expectations using our notation for the expected return .
Law of iterated expectations
Proof: To keep the notation clean and easy to read we’ll drop the subscripts, and denote the random variables , , . Following this convention, we can write the expected return as:
Conditioning on and taking the expectation of the above expression we get:
Using the law of iterated expectation, we can expand the state-value function as follows:
Another way to derive this equation is by looking at the full Bellman backup diagram:
Bellman backup diagram
Similarly we can rewrite the action-value function as follows:
From the above equations it is easy to see that:
We can express this relationship as a backup diagram as well:
- Sutton, Richard, Barto, Andrew. Reinforcement Learning: An Introduction. (draft, 2018).