# Derivation of Bellman’s Equation

## Preliminaries

This note follows Chapter 3 from Reinforcement Learning: An Introduction by Sutton and Barto.

#### Markov Decision Process

Let $M = \langle S, A, P, R, \gamma \rangle$ denote a Markov Decision Process (MDP), where $S$ is the set of states, $A$ the set of possible actions, $P$ the transition dynamics, $R$ the reward function, and $\gamma$ the discount factor. If $S$ and $A$ are both finite, we say that $M$ is a finite MDP. Note that $R$ is a map from state-action pairs ($S,A$) to scalar rewards.

#### Returns

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$

#### Policy

Let $\pi : S \rightarrow A$ denote our policy.

#### Value functions

State-value function: $v_{\pi}(s) = \mathbb{E}_\pi[G_t \,|\, S_t = s]$
Action-value function: $q_{\pi}(s,a) = \mathbb{E}_\pi[G_t | S_t = s, A_t =a]$

We will derive a more useful form for these value function equations below. But first, let’s re-prove the well known Law of Iterated Expectations using our notation for the expected return $G_{t+1}$.

## Law of iterated expectations

Theorem

Proof: To keep the notation clean and easy to read we’ll drop the subscripts, and denote the random variables $s=S_t$, $g'=G_{t+1}$, $s'=S_{t+1}$. Following this convention, we can write the expected return as:

Conditioning on $S_t = s$ and taking the expectation of the above expression we get:

## Bellman’s Equations

Using the law of iterated expectation, we can expand the state-value function $v_{\pi}(s)$ as follows: %

Another way to derive this equation is by looking at the full Bellman backup diagram:

Bellman backup diagram

Similarly we can rewrite the action-value function $q_{\pi}(s,a)$ as follows:

From the above equations it is easy to see that:

We can express this relationship as a backup diagram as well:

## References

1. Sutton, Richard, Barto, Andrew. Reinforcement Learning: An Introduction. (draft, 2018).