# Exploring Exploration

This week I learned about Exploration and Intrinsic Motivation.

# Overview of Exploration vs. Exploitation

The goal of a reinforcement learning agent is to learn how to solve a task by maximizing its long-term expected return. In order to do that, the agent must learn about the environment by first taking actions and then using the rewards it received to determine whether those actions were successful or not. Graphic credit to Andrew Barto

One of the major challenges faced by a learning agent is determining whether not it has learned enough about the environment to actually solve the task. This challenge is known as the Exploration-Exploitation Dilemma. Exploration refers to the agent taking an action in order to gather more information about the world. Exploitation, on the other hand, refers to the agent choosing the most rewarding action given what it already knows about of the world. The dilemma results from fact that exploring the world to gather information and exploiting what you know are often mutually exclusive. But, by properly balancing this trade-off between exploration and exploitation, an agent can learn how to optimally perform a given task.

# Multi-armed bandits

To understand this dilemma better, let’s look at one of the standard problems in reinforcement learning known as the Multi-armed bandit problem. A multi-armed bandit is a simplified Markov Decision process $M = \langle A, R, \rangle$, where $A$ is the set of actions (i.e. “arms”), $R$ is the reward function, and there is only one state. We can think of a k-armed bandit as a row of slot machines, where each of the $k$ actions corresponds to pulling one of the levers. The goal here is to maximize the sum of rewards by learning through trial-and-error the correct sequence of slot arms to pull. The agent faces the exploration-exploitation dilemma at the beginning of each step. Graphic credit unknown

## Exploration: finding the best strategy

There are many different exploration strategies an agent could use to find the best action. Here are a few:

• $\bf \text{Greedy}$
• $\bf \epsilon \text{-Greedy}$
• $\bf \text{Decaying } \epsilon \text{-Greedy}$
• $\bf \text{Optimistic Initialization}$

# Challenges in the full MDP case

While the above exploration strategies seem to work well for small MDPs they are likely insufficient for the most difficult problems. Why is that? Firstly, reinforcement learning tasks specified by researchers often rely on a hand-engineered reward function. This can be problematic as the reward might not really be informative. Secondly, many problem domains that are currently researched have large and continuous state and/or action spaces. Large and high-dimensional state spaces are difficult to explore efficiently. Finally, some problems are naturally formulated as having a sparse reward. That is, the agent only receives the reward when it completes the tasks. These problems are challenging because there are no environmental signals that will guide the agent to the goal. So how do we solve these problems? One approach is to use an idea from animal psychology called intrinsic motivation.

# Intrinsic Motivation in Humans and Animals

According to (some) psychologists, motivation can be characterized as the set of Forces that influence an organism to act and direct its behavior toward certain activities. There are two types of motivation: Extrinsic Motivation and Intrinsic Motivation. Extrinsic motivation refers to an organism being driven to do something because of an external reward. Intrinsic motivation, on the other hand, refers to an organism being driven to do something simply because it is enjoyable. Research studies in animals and humans suggest that intrinsic motivation is largely independent from the necessary biological drives such as satisfying hunger, seeking shelter, etc. Furthermore, it is currently thought that extrinsic motivation needs to be learned, while intrinsic motivation is inherent. As such, intrinsic motivation is likely used to augment an organisms knowledge and/or skill in a way that has biological utility at a later point in life. Indeed, according to a classic paper by Robert White:

So the natural question is: what is intrinsically motivating (e.g. rewarding) about certain activities and how can organism can use that motivation to learn? Most experts have settled on organisms using some notion of novelty, surprise, or incongruity to update what it knows about the world. In the next section we will look at how an artificial agent can use intrinsic motivation to learn.

# Intrinsic Motivation in Reinforcement Learning

To understand intrinsic motivation and its role in learning autonomous behaviors, let’s first refine our view of reinforcement learning. Here, we can imagine the environment being split into two separate parts: an external environment and an internal environment. As with the traditional view of RL, the external environment provides sensations (e.g. observations) and reward signals that are external to the agent. Those external signals are then passed a critic within the internal environment, which then generates internal reward signals. We can think of the external reward signals as things such as food or money, while we can think of the internal rewards as biochemical signals generated within the brain. Graphic credit to Andrew Barto

We can then view the reward signal as a combination of the the extrinsic and intrinsic rewards. This can be viewed as adding an exploration bonus to the reward provided by the environment:

As mentioned above, intrinsic motivation usually reflects some notion of novelty or surprise that the agent experiences internally while exploring. The fact that intrinsic motivation does not come from the external environment means that intrinsic motivation is independent of the task. It also means that the intrinsic reward $r^{i}_{t}(s, a, s')$ should change as the agent explores and is no longer surprised by its experiences. Note that this is in contrast to the extrinsic reward $r^{e}_{t}(s, a, s')$ which is always the same given the same $(s,a,s')$ tuple.

So how do we measure intrinsic motivation, novelty, or surprise for autonomously learning agents? We’ll discuss some recent works below.

### Count-based exploration of novel states

One form of novelty involves some measure of how “surprised” the agent is upon entering some state $s$. For simple discrete environments, this amounts to keeping track of how many times the agent has visited each state $s$ ($N(s)$), and modeling surprise as some function $h(s) = \frac{1}{[N(s)]^p}f(s)$ and $p \gt 0$. As the state-visitation count $N(s)$ increases, the state $s$ becomes less novel and the agent receives less reward.

While this may work well for small discrete Markov Decision Processes, the above count-based method will fail in large continuous state spaces. This is because any given state is rarely visited multiple times in high dimensional spaces. To get around this, Bellamare et al showed that it is possible to learn a parameterized density model that can be used to derive what the authors call pseudo-counts. The pseudo-counts are a measure or how often the agent has visited similar states.

### Curiosity as a measure of prediction error

Another form of novelty is related to curiosity. The curious agent needs to learn how to predict the next state given its current state and current action. Here, curiosity is modeled as the prediction error between the agent’s prediction and the actual next state. In other words, the agent needs to learn a forward dynamics model of the environment. Indeed, according to Schmidhuber,

Curiosity, therefore encourages the agent to select actions that will reduce the uncertainty in the agent’s ability to predict future consequences. The uncertainty is highest in unexplored regions of the state space, or in regions that have complex dynamics. So how does the agent actually learn to make these predictions?

Instead of predicting the next state in the raw pixel space, Pathak et al learned what they call an Intrinsic Curiosity Module (ICM). The ICM consists of two neural networks that model the forward and inverse dynamics of the environment. The ICM learns a lower dimensional feature space of the environment by learning how to predict actions given state, next state feature representation. This module eventually learns to only model the important aspects of the environment such as the agent itself as well as objects that affect the agent. Finally, the learned features of the current state along with the current action are then used to predict the features of the next state. As mentioned above, the prediction error of this forward dynamics model is an intrinsic reward that encourages the agent’s curiosity.