# Deep Q-networks

This post uses Deep Q Networks to introduce off-policy algorithms

# Overview of Off-Policy Algorithms

Until now I have focused on *on-policy* algorithms - *i.e.* algorithms that learn from data that were generated with the current policy. *Off-policy* algorithms, on the other hand, are able to learn from experiences (*e.g.* transitions of the form ) collected from previous policies. Because off-policy methods are able to reuse old data, they tend to be more sample-efficient than on-policy methods.

# Deep Q Learning

One recent example of an off-policy method is the venerable *Deep Q Learning* algorithm that learned to play a number of Atari games with human-level performance. The use of deep neural network function approximators extended classical Q-learning beyond finite and discrete state spaces to problem domains with continuous and high-dimensional state spaces. Quite surprisingly, Deep Q-learning was able solve 57 challenging Atari games using the same set of hyperparameters.

At the core of Deep Q-learning is the Deep Q-Network (DQN). Q-networks take as input some representation of the state of the environment. For Atari games, the input could be RGB or gray-scale pixel values. For a robot manipulator, the input could include a combination of the position, linear velocity, and angular velocity of its links and/or joints. Q-networks output one Q-value per action. Because Q-networks learn the values of state-action pairs, they can be viewed as a parameterized representation of the critic introduced in my last post. Unlike policy gradient methods that learn a policy directly, Deep Q-networks learn an *induced* policy. In other words, an action is selected by finding the *maximum* over the set of Q-values {} that the network outputs.

So how does Deep Q-learning work? The core of the algorithm involves the computation of the *temporal difference* (TD) error for transitions sampled from taking actions in the environment:

where is a bootstrapped estimate of the Q function. Similar to supervised learning, we minimize the squared loss between the *target values* and the outputs of the network :

## Tricks of the trade

Although Deep Q-learning is conceptually straightforward, there are a few tricks required to get the algorithm to converge in practice.

**Stable Target Network**- Because DQN is
*not really*a supervised learning algorithm, the target would change every time the network parameter is updated. This is bad because changing and in the same direction would cause the algorithm to diverge. We can avoid this by computing a different target using a frozen target network that is updated to match every iterations.

- Because DQN is
**Replay Buffer**- When an agent acts in an environment, the set of experiences for a single episode are temporally correlated. This violates the
*i.i.d.*assumption required of most learning algorithms. We can de-correlate the experiences by placing them in a replay buffer and randomly sampling them to update the Q-network.

- When an agent acts in an environment, the set of experiences for a single episode are temporally correlated. This violates the
**Stacked frames**- Single images don’t convey dynamic information, so stacking multiple frames allows agent to infer movement in the environment.

## DQN Algorithm

- For episode = 1,2,…
- For
- Perform -greedy action selection:

- Execute action and observe reward and next state
- Store transition in replay buffer
- Sample minibatch from replay buffer
- Calculate targets :
- Calculate the loss:
- Update the network parameters
- If , update target

## Results

# Take aways

One major drawback of Deep Q Networks is that they can only handle low-dimensional, discrete action spaces. This makes DQN unsuitable for robotics control problems where the action space is often both high-dimensional and continuous. Consider for a moment a standard 7 degree of freedom robot manipulator. If we discretize the action space so that there are 5 actions for every degree of freedom we end up with a network that must have outputs! The situation would be much worse for a robot like Atlas that has 28 degrees of freedom. The natural question is, of course, can we do better? I’ll try to address this question in my next post.