# Abstract

Many robotics problems are naturally formulated such that the extrinsic rewards to the agent are either sparse or missing altogether. These problems can be extremely difficult to solve as the environment provides little to no feedback to guide the agent toward accomplishing its goal. Previous works have shown that agents that train using prediction error as an intrinsic reward are able to learn across a wide range of domains, including Atari games and continuous control tasks [1, 2, 3]. In this project, I use curiosity-driven exploration to solve challenging robotics tasks with sparse rewards. Following these previous works, I formulate the intrinsic reward as the error in the agent’s ability to predict its next state, given its current state and executed action. My results demonstrate that this approach is capable of solving several difficult robotic manipulation tasks in simulation.

# Introduction

The goal of reinforcement learning is for an agent to learn to solve a given task by maximizing its total expected reward. Instead of relying on external instructions, the agent learns how to choose actions by exploring and interacting directly with the environment. Reinforcement learning problems can roughly be sorted into two categories: 1) where the agent receives dense rewards and 2) where the agent receives sparse (or no) rewards.

In the first case, the environment provides a continuous source of feedback to the agent in the form of dense scalar rewards. These rewards, which are received at every time step, guide the agent toward choosing the best actions to solve the task. This approach has seen many successes, including solving challenging Atari games [4] and physical control problems with continuous state and action spaces [5].

In the second case, which is the focus of this project, the environment provides little-to-no feedback to the agent. Robotics problems are great examples of the sparse-reward settings that are so common to the real-world. Consider, for example, a robotic agent tasked with clearing a table. Rather than designing a complicated reward function that considers the myriad of subtasks, a more natural approach is to supply a reward only once every item has been removed and the goal is finally met.

Therefore, the agent must learn the requisite skills in the absence of any feedback from the environment, but is unlikely to randomly stumble upon a good policy by chance. One way to overcome this challenge is by carefully engineering a reward function that generates extrinsic rewards to guide the agent’s progress. This approach, however, is saddled with the difficult chore of designing a custom reward functions for every environment; and a design that is hand-crafted may inadvertently fail to specify the task well enough to deter undesirable behaviors from the agent. Alternatively, we may opt for methods that will encourage the agent to explore and learn new skills in the absence of any external rewards from the environment. In this project, I explore learning a reward function that is intrinsic to the agent in order to solve sparse reward problems.

# Intrinsic Motivation in Reinforcement Learning

One source of inspiration for solving sparse reward problems has come from the field of developmental psychology, namely motivation. There are two types of motivation: extrinsic motivation and intrinsic motivation.

In reinforcement learning, intrinsic motivation, or curiosity, is often formulated to encourage the agent to perform actions that lead to the discovery of novel states [6, 7, 8, 9, 10, 11]. For simple discrete environments, this amounts to keeping track of the state-visitation counts $N(s)$, and modeling novelty as some function $h(s)=\frac{1}{[N(s)]^p}f(s)$ where $p>0$ [11, 12]. As $N(s)$ increases, the state $s$ becomes less novel and the agent receives less reward. While this works well for small discrete Markov Decision Processes, this counts-based method will fail in large and continuous state spaces where an agent is unlikely to visit a given state multiple times. Bellemare et al solved this problem by learning a parameterized density model that can be used to derive an approximation to state-visitation counts called pseudo-counts [11]. Their pseudo-counts measured how often an agent has visited similar states, and were converted into exploration bonuses that significantly improved exploration for a number of challenging Atari games.

Another formulation for an intrinsic reward encourages the agent to perform actions that reduce the error in its ability to predict the outcome of its actions [1, 2, 3, 13, 14]. Using deep neural networks to predict the agent’s next state given its current state and action, [1] showed that intrinsic reward (i.e. prediction error) can be used to learn good exploration policies, even in the absence of extrinsic rewards. Burda et al systematically investigated how the choice of feature representation affects curiosity-driven learning across a diverse range of environments [2]. They showed that random features work well for evaluating trained tasks, but learned features tend generalize better to unseen scenarios. Follow-up work by Burda et al showed that distilling features from a randomly initialized network and combining non-episodic extrinsic rewards and intrinsic rewards with different discount factors was able to solve the challenging Atari game Montezuma’s Revenge [3].

# Proposed Methods

The agent is composed two different submodules: a policy $\pi_{\theta_{P}}(s_t)$ and a dynamics model $f_{\theta_{D}}(s_t, a_t)$. The policy takes as input the current state and ouputs the action to be executed. The dynamics models takes as input the agent’s current state and action and outputs a prediction for the next state.

The prediction error of the dynamics model is used to generate the dense intrinsic rewards $r^{i}_{t}$ at every timestep. We calculate it as follows:

The reward $R_t$ that is supplied to the agent at each timestep is the sum of the extrinsic reward $r_{e}$ from the environment and the intrinsic reward $r_{i}$:

The network parameters $\theta_{D}$ of the dynamics model are optimized by minimizing the loss function:

Thus, the optimization problem that we need to solve for the agent is:

Although it is possible to use any on-policy reinforcement learning algorithm with this formulation, I chose Proximal Policy Optimization (PPO) [15] to train the agent’s policy due to its ease of implementation and improved stability over other policy gradient algorithms.

# Experimental Setup

This section describes the experimental design and environments that I used to evaluate whether adding intrinsic rewards as an exploration bonus aids in solving sparse robotics tasks.

## Environments

For this project, I used the standard Fetch robotics environments [16] provided in the OpenAI gym [17]. The Fetch robotics environments are a set of benchmark tasks for continuous control of robotic manipulation. The tasks include reaching, pushing, pick-and-place and sliding. The Fetch robot is a 7-degrees-of-freedom robotic arm, containing a parallel pinch gripper as its end effector. The robot’s arm is controlled using position control of the end effector. The action space consists of four dimensions: three dimensions controlling the change in position along each of the ordinal axes while the fourth dimension controls the opening and closing of the gripper. The state space of each environment includes the position ($x$, $y$, $z$) and velocity ($v_x$, $v_y$ $v_z$) of the gripper, and the joint states and velocities of the gripper fingers. If an object is present in the scene, the state will also include the position, linear velocity, and angular velocity of the object as well as the position and linear velocity of the object relative to the gripper.

Random Agents

### Reaching

The Reaching task is the simplest task in the Fetch robotics suite. The agent can control the $x$, $y$, and $z$ position of the robot’s gripper while the gripper fingers are blocked. This task is considered solved when the robot’s gripper is within some thresholded distance from a given target. This environment is 13 dimensions.

### Pushing

The Pushing task is more complex than the Reaching task, as it includes an additional object in the scene and hence more degrees of freedom. Like the Reaching task, only the 3D position of the gripper is controlled with the fingers are blocked. The Pushing task is considered solved once the agent has pushed a block to within some thresholded distance from a given target location. Note that the given target is within the workspace of the robot. This environment is 28 dimensions.

### Pick and Place

The Pick and Place task requires the agent grab a block in its workspace and lift it to a specified location. This environment is 28 dimensions.

### Sliding

Similar to the Pushing task, the Sliding task is solved once the robot slides a puck to a given target location. Unlike the Pushing task, the target location is not within the workspace of the robot. This environment is 28 dimensions.

## Environment Modifications

I made a number modifications to the environments that allow more control over setting the extrinsic reward, observation type, and termination of an episode.

### Rewards

The environments are classified based on the types of extrinsic rewards provided by the environment:

• Dense - at every time step the agent receives a reward that is proportional to the distance from a desired goal
• Sparse - the agent receives a reward of -1 at every time step and a reward of 0 if it solves the task
• Very Sparse - the agent receives a reward of 0 at every time step and a reward of 1 if it solves the task

### Resets

• Never resets the environment before the max number of time steps
• Only resets the environment early after successful attempts
• Only resets the environment early when the task becomes impossible
• Resets the environment early after successes and failures

### Observation types

The following observations can be used:

• State space (i.e. kinematics) of the robot (default)
• RGB images from three different views
• Depth maps corresponding the RGB images above
• Touch sensor data to measure contacts between the gripper and the environment

## Network Architecture and Training Details

All agents in this project are trained using kinematic inputs, i.e. the positions and velocities of the robot’s end effector as well as any objects in the scene. The agent’s policy and value function networks are both parameterized as Feedforward Neural Networks. The policy network consists of two hidden layers: the first layer has a ReLU activation, while the second layer has a Tanh activation. The output of the second layers is fed to another module that models the robot’s actions as a Gaussian distribution (i.e. predicts the mean and standard deviation). The value function network also contains two hidden layers, both of which have ReLU activations. Both the policy and value networks take the robot’s state as input.

The agent’s dynamics model is a three-layer neural network with ReLU activations on the two hidden layers and Tanh activation on the last layer. The dynamics model takes as input the state ($s_t$) of the robot and its current action ($a_t$) and outputs a prediction of either $1$) the robot’s next state ($\hat s_{t+1}$) or $2$) the change in the robot’s state ($\Delta \hat s_t$).

All networks are trained separately using the Adam optimizer with different learning rates.

# Results

As mentioned above, the reaching task is the simplest task in the Fetch robotics suite. It is therefore a great environment to verify the baseline PPO implementation, as well as investigate whether adding dense intrinsic rewards will improve exploration and learning of sparse tasks. As shown in the plots below, all agents readily solve the Reaching task, converging to 100% accuracy by the end of training.

Learning progress on the Reaching Task

Unlike the reaching task, the baseline PPO was unable to solve the pushing task (red curve in the plots below). However, adding intrinsic rewards as an exploration bonus led to rapid convergence. By the end of training the agent was able to solve $\gt 96\%$ of the episodes. It is interesting to note that some of the learned policies had a propensity to roll the block while pushing it to the target. I think this is because the environment does not provide detailed information about the block such as its size and moment of inertia. Under these conditions, the dynamics model cannot accurately predict the state of the block. The agent is therefore incentivized to move the block in intersting ways as it pushes it to the goal.

Learning progress on the Pushing Task

## Analysis of Hyperparameters for the Pushing Task

The Pushing Task is the simplest non-trivial task in the OpenAI Robotics Suite. It therefore provided a good environment to investigate how tuning the various hyperparameters affects overall performance. This section details the analyses used to select the hyperparameters used in most of the experiments.

### Number of Hidden Units

Tuning the number of neurons in the hidden layers is an extremely important aspect of training neural networks. If the network is too large, you risk over-fitting the data. Likewise, if the network is too small, it can be difficult to learn anything at all. In this experiment, I varied the size of the hidden layers from 16 to 1024 neurons. In general, bigger networks were observed to be more performant than smaller networks. For most of the experiments shown, I settled on training networks with 64 neurons. This is because 64-unit networks have similar performance to the larger networks, but they have fewer parameters and require less wall time to train.

### Number of Parallel Workers

Since PPO is on-policy, training requires data collected with the current policy. This is difficult to do because the distributions of states, rewards, and actions change every time the model is updated. One way to get around this issue is to parallelize the environment and use multiple workers to collect a large and varied amount of training data. The number of workers required to solve a given task is likely proportional to the complexity of the task. For example, solving the reaching task with four or fewer workers is relatively easy, while solving the pushing task with so few workers is vastly more difficult (data not shown). In this experiment, I tuned the number of parallel workers collecting rollout data using the current policy. With only 8 workers, the model is only able to learn up to a certain point until performance begins to decrease. In contrast, increasing the number of works to 16 and beyond greatly improved algorithm convergence. Based on these results and the number of cores available on my CPU, I chose to use 32 parallel workers in all of the experiments.

### Discount Factor $\gamma$

The discount factor $\gamma$ is the critical hyperparameter that determines how much the agent cares about future rewards. I performed experiments comparing $\gamma = 0.99, 0.95$, and $0.90.$ Of the three values, $0.95$ had the best performance.

### Dynamics Model Learning Rate

The prediction error of the dynamics model is used to generate the intrinsic reward that encourages the agent to explore. As such, the intrinsic rewards should be relatively large at the beginning of training when agent lacks understanding of the environment, and smaller once the agent is able to predict the consequences of its actions. If the dynamics model learns too quickly, the intrinsic rewards will prematurely decrease in size and the agent won’t explore efficiently. If, on the other hand, the dynamics model learns too slowly, large intrinsic rewards will result in excessive exploration by the agent. To determine which learning would be best for Fetch Robotics tasks, I varied the dynamics model learning rate over three orders of magnitude. I observed that, while the agent was able to learn for every tuning of the learning rate, smaller learning rates caused the agent to learn fastest.

## Solving the Pick and Place Task

The Pick and Place task was readily solved using the same set of hyperparameters that were used to solve the Pushing task. As can be seen in the plots below, agents trained on the Pick and Place task were able to solve $\approx 80 \%$ of the episodes by the end of training.

Learning progress on the Pick and Place Task

The Sliding Task seemed to be the most challenging. The difficulty of this task is likely due to the agent having fewer timesteps to interact with the puck due the low friction between the puck and the surface of the table. Unlike Reaching and Pushing, learned agents were only able to solve $\sim 80 \%$ of episodes near the end of training. Furthermore, networks with hidden layers of size 64 were insufficient in solving the Sliding task (data not shown). Instead, sliding required networks with at least 128 neurons. The sliding task was also sensitive to the dynamics model learning rate. It is also worth mentioning that the Sliding task required 40 million environment steps. This is roughly double the number of environment interactions required to solved the Pushing task.

Learning progress on the Sliding Task

# Summary

Consistent with previous works, my results demonstrate that curiosity-driven exploration can be used to solve challenging tasks with sparse rewards. In particular, I have shown that using the prediction error as the intrinsic reward can encourage an agent to solve a diverse set of robotics tasks. In addition to the intrinsic reward signal, properly tuning various hyperparameters plays a significant role in ensuring that the agent can learn in the presence of sparse rewards. This was especially true for the Sliding task, as it seemed the most sensitive to the selection of hyperparameters. Surprisingly, while one set of hyperparameters was sufficient for solving Reaching, Pushing, and Pick and Place, Sliding required significant tuning and many more environment interactions.

# Future Directions

The work presented here provides an excellent starting point for future research. In this section, I will briefly discuss a few ideas that are motivating my next set of experiments.

## Learning from Pixels

Learning control policies from pixels is currently an active area of research. Building off of my current results, I will perform experiments in which I compare how learning different feature representations for images my affect intrinsic motivation and exploration. In particular, I will investigate using feature representations such as random convolutional neural network features (RF), variational autoencoder (VAE) features, and inverse dynamics features (IDF) [2].

## Combining Multiple Modalities

While previous papers have largely focused on learning a feature space for a single modality such as images for Atari games or joint kinematics for continuous control, it is not clear which modalities are most important for solving robotic manipulation tasks. Robots are often equipped with one or more sensors that measure various aspects of their state and environment: (1) encoders that measure positions and velocities of its joints; (2) cameras that provide visual perception; and (3) tactile sensors that measure contacts with the environment. I therefore intend to address the following question:

• Does including additional sensor modalities in the feature space result in better exploration policies?

Although there are numerous ways to combine the feature representations for the different modalities, I will initially focus on learning vector representations, and concatenating those features as input to the dynamics model to generate intrinsic rewards. In a separate set of experiments I will also investigate whether the learned multimodal features are able to improve policy learning.

Updated: