Deep Q-networks

This post uses Deep Q Networks to introduce off-policy algorithms

Overview of Off-Policy Algorithms

Until now I have focused on on-policy algorithms - i.e. algorithms that learn from data that were generated with the current policy. Off-policy algorithms, on the other hand, are able to learn from experiences (e.g. transitions of the form $(s, a, r, s')$) collected from previous policies. Because off-policy methods are able to reuse old data, they tend to be more sample-efficient than on-policy methods.

Deep Q Learning

One recent example of an off-policy method is the venerable Deep Q Learning algorithm that learned to play a number of Atari games with human-level performance. The use of deep neural network function approximators extended classical Q-learning beyond finite and discrete state spaces to problem domains with continuous and high-dimensional state spaces. Quite surprisingly, Deep Q-learning was able solve 57 challenging Atari games using the same set of hyperparameters.

At the core of Deep Q-learning is the Deep Q-Network (DQN). Q-networks take as input some representation of the state of the environment. For Atari games, the input could be RGB or gray-scale pixel values. For a robot manipulator, the input could include a combination of the position, linear velocity, and angular velocity of its links and/or joints. Q-networks output one Q-value per action. Because Q-networks learn the values of state-action pairs, they can be viewed as a parameterized representation of the critic $Q_{\phi}(s_t, a_t)$ introduced in my last post. Unlike policy gradient methods that learn a policy $\pi_{\theta}$ directly, Deep Q-networks learn an induced policy. In other words, an action $a$ is selected by finding the maximum over the set of Q-values {$Q_{\phi}(s, a_i)$} that the network outputs.

So how does Deep Q-learning work? The core of the algorithm involves the computation of the temporal difference (TD) error for transitions $(s_i, a_i, r_i, s_{i+1})$ sampled from taking actions in the environment:

where $y_i = r_i + \gamma \underset{a}{\operatorname{max}}Q_{\phi'}(s_{i+1}, a)$ is a bootstrapped estimate of the Q function. Similar to supervised learning, we minimize the squared loss between the target values $y_i$ and the outputs of the network $Q_{\phi}(s_i, a_i)$:

Although Deep Q-learning is conceptually straightforward, there are a few tricks required to get the algorithm to converge in practice.

• Stable Target Network
• Because DQN is not really a supervised learning algorithm, the target $y_i = r_i + \gamma \underset{a}{\operatorname{max}}Q_{\bf{\phi}}(s_{i+1}, a)$ would change every time the network parameter $\phi$ is updated. This is bad because changing $y_i$ and $Q_{\phi}(s_i, a_i)$ in the same direction would cause the algorithm to diverge. We can avoid this by computing a different target $y_i = r_i + \gamma \underset{a}{\operatorname{max}}Q_{\bf{\phi'}}(s_{i+1}, a)$ using a frozen target network $Q_{\phi'}$ that is updated to match $Q_{\phi}$ every $k$ iterations.
• Replay Buffer
• When an agent acts in an environment, the set of experiences $(s_i, a_i, r_i, s_{i+1})$ for a single episode are temporally correlated. This violates the i.i.d. assumption required of most learning algorithms. We can de-correlate the experiences by placing them in a replay buffer and randomly sampling them to update the Q-network.
• Stacked frames
• Single images don’t convey dynamic information, so stacking multiple frames allows agent to infer movement in the environment.

DQN Algorithm

1. For episode = 1,2,… $\textbf{do}$
2.    For $t = 1,T$
3.        Perform $\epsilon$-greedy action selection:
$% $
4.        Execute action $a_t$ and observe reward $r_t$ and next state $s_t$
5.        Store transition $(s_t, a_t, r_t, s_{t+1})$ in replay buffer
6.        Sample minibatch from replay buffer $\{(s_j, a_j, r_j, s_{j+1})\}_{j=1}^{N}$
7.        Calculate targets $y_j$:              $% $
8.        Calculate the loss:              $L = \frac{1}{N} \sum_{j=1}^{N} (r_j + \gamma \underset{a}{\operatorname{max}}Q_{\phi'}(s_{j+1}, a) - Q_{\phi}(s_j, a_j)) ^2.$
9.        Update the network parameters $\phi \longleftarrow \phi + \alpha \nabla L(\phi)$
10.        If $t\text{ mod}(k) = 0$, update target $\phi' \longleftarrow \phi$

Take aways

One major drawback of Deep Q Networks is that they can only handle low-dimensional, discrete action spaces. This makes DQN unsuitable for robotics control problems where the action space is often both high-dimensional and continuous. Consider for a moment a standard 7 degree of freedom robot manipulator. If we discretize the action space so that there are 5 actions for every degree of freedom we end up with a network that must have $5^7 = 78125$ outputs! The situation would be much worse for a robot like Atlas that has 28 degrees of freedom. The natural question is, of course, can we do better? I’ll try to address this question in my next post.

Updated: