Many robotics problems are naturally formulated such that the extrinsic rewards to the agent are either sparse or missing altogether. These problems can be extremely difficult to solve as the environment provides little to no feedback to guide the agent toward accomplishing its goal. Previous works have shown that agents that train using prediction error as an intrinsic reward are able to learn across a wide range of domains, including Atari games and continuous control tasks [1, 2, 3]. In this project, I use curiosity-driven exploration to solve challenging robotics tasks with sparse rewards. Following these previous works, I formulate the intrinsic reward as the error in the agent’s ability to predict its next state, given its current state and executed action. My results demonstrate that this approach is capable of solving several difficult robotic manipulation tasks in simulation.
The goal of reinforcement learning is for an agent to learn to solve a given task by maximizing its total expected reward. Instead of relying on external instructions, the agent learns how to choose actions by exploring and interacting directly with the environment. Reinforcement learning problems can roughly be sorted into two categories: 1) where the agent receives dense rewards and 2) where the agent receives sparse (or no) rewards.
In the first case, the environment provides a continuous source of feedback to the agent in the form of dense scalar rewards. These rewards, which are received at every time step, guide the agent toward choosing the best actions to solve the task. This approach has seen many successes, including solving challenging Atari games  and physical control problems with continuous state and action spaces .
In the second case, which is the focus of this project, the environment provides little-to-no feedback to the agent. Robotics problems are great examples of the sparse-reward settings that are so common to the real-world. Consider, for example, a robotic agent tasked with clearing a table. Rather than designing a complicated reward function that considers the myriad of subtasks, a more natural approach is to supply a reward only once every item has been removed and the goal is finally met.
Therefore, the agent must learn the requisite skills in the absence of any feedback from the environment, but is unlikely to randomly stumble upon a good policy by chance. One way to overcome this challenge is by carefully engineering a reward function that generates extrinsic rewards to guide the agent’s progress. This approach, however, is saddled with the difficult chore of designing a custom reward functions for every environment; and a design that is hand-crafted may inadvertently fail to specify the task well enough to deter undesirable behaviors from the agent. Alternatively, we may opt for methods that will encourage the agent to explore and learn new skills in the absence of any external rewards from the environment. In this project, I explore learning a reward function that is intrinsic to the agent in order to solve sparse reward problems.
Intrinsic Motivation in Reinforcement Learning
One source of inspiration for solving sparse reward problems has come from the field of developmental psychology, namely motivation. There are two types of motivation: extrinsic motivation and intrinsic motivation.
In reinforcement learning, intrinsic motivation, or curiosity, is often formulated to encourage the agent to perform actions that lead to the discovery of novel states [6, 7, 8, 9, 10, 11]. For simple discrete environments, this amounts to keeping track of the state-visitation counts , and modeling novelty as some function where [11, 12]. As increases, the state becomes less novel and the agent receives less reward. While this works well for small discrete Markov Decision Processes, this counts-based method will fail in large and continuous state spaces where an agent is unlikely to visit a given state multiple times. Bellemare et al solved this problem by learning a parameterized density model that can be used to derive an approximation to state-visitation counts called pseudo-counts . Their pseudo-counts measured how often an agent has visited similar states, and were converted into exploration bonuses that significantly improved exploration for a number of challenging Atari games.
Another formulation for an intrinsic reward encourages the agent to perform actions that reduce the error in its ability to predict the outcome of its actions [1, 2, 3, 13, 14]. Using deep neural networks to predict the agent’s next state given its current state and action,  showed that intrinsic reward (i.e. prediction error) can be used to learn good exploration policies, even in the absence of extrinsic rewards. Burda et al systematically investigated how the choice of feature representation affects curiosity-driven learning across a diverse range of environments . They showed that random features work well for evaluating trained tasks, but learned features tend generalize better to unseen scenarios. Follow-up work by Burda et al showed that distilling features from a randomly initialized network and combining non-episodic extrinsic rewards and intrinsic rewards with different discount factors was able to solve the challenging Atari game Montezuma’s Revenge .
The agent is composed two different submodules: a policy and a dynamics model . The policy takes as input the current state and ouputs the action to be executed. The dynamics models takes as input the agent’s current state and action and outputs a prediction for the next state.
The prediction error of the dynamics model is used to generate the dense intrinsic rewards at every timestep. We calculate it as follows:
The reward that is supplied to the agent at each timestep is the sum of the extrinsic reward from the environment and the intrinsic reward :
The network parameters of the dynamics model are optimized by minimizing the loss function:
Thus, the optimization problem that we need to solve for the agent is:
Although it is possible to use any on-policy reinforcement learning algorithm with this formulation, I chose Proximal Policy Optimization (PPO)  to train the agent’s policy due to its ease of implementation and improved stability over other policy gradient algorithms.
This section describes the experimental design and environments that I used to evaluate whether adding intrinsic rewards as an exploration bonus aids in solving sparse robotics tasks.
For this project, I used the standard Fetch robotics environments  provided in the OpenAI gym . The Fetch robotics environments are a set of benchmark tasks for continuous control of robotic manipulation. The tasks include reaching, pushing, pick-and-place and sliding. The Fetch robot is a 7-degrees-of-freedom robotic arm, containing a parallel pinch gripper as its end effector. The robot’s arm is controlled using position control of the end effector. The action space consists of four dimensions: three dimensions controlling the change in position along each of the ordinal axes while the fourth dimension controls the opening and closing of the gripper. The state space of each environment includes the position (, , ) and velocity (, ) of the gripper, and the joint states and velocities of the gripper fingers. If an object is present in the scene, the state will also include the position, linear velocity, and angular velocity of the object as well as the position and linear velocity of the object relative to the gripper.
The Reaching task is the simplest task in the Fetch robotics suite. The agent can control the , , and position of the robot’s gripper while the gripper fingers are blocked. This task is considered solved when the robot’s gripper is within some thresholded distance from a given target. This environment is 13 dimensions.
The Pushing task is more complex than the Reaching task, as it includes an additional object in the scene and hence more degrees of freedom. Like the Reaching task, only the 3D position of the gripper is controlled with the fingers are blocked. The Pushing task is considered solved once the agent has pushed a block to within some thresholded distance from a given target location. Note that the given target is within the workspace of the robot. This environment is 28 dimensions.
Pick and Place
The Pick and Place task requires the agent grab a block in its workspace and lift it to a specified location. This environment is 28 dimensions.
Similar to the Pushing task, the Sliding task is solved once the robot slides a puck to a given target location. Unlike the Pushing task, the target location is not within the workspace of the robot. This environment is 28 dimensions.
I made a number modifications to the environments that allow more control over setting the extrinsic reward, observation type, and termination of an episode.
The environments are classified based on the types of extrinsic rewards provided by the environment:
- Dense - at every time step the agent receives a reward that is proportional to the distance from a desired goal
- Sparse - the agent receives a reward of -1 at every time step and a reward of 0 if it solves the task
- Very Sparse - the agent receives a reward of 0 at every time step and a reward of 1 if it solves the task
I made versions of each task that:
- Never resets the environment before the max number of time steps
- Only resets the environment early after successful attempts
- Only resets the environment early when the task becomes impossible
- Resets the environment early after successes and failures
The following observations can be used:
- State space (i.e. kinematics) of the robot (default)
- RGB images from three different views
- Depth maps corresponding the RGB images above
- Touch sensor data to measure contacts between the gripper and the environment
Network Architecture and Training Details
All agents in this project are trained using kinematic inputs, i.e. the positions and velocities of the robot’s end effector as well as any objects in the scene. The agent’s policy and value function networks are both parameterized as Feedforward Neural Networks. The policy network consists of two hidden layers: the first layer has a ReLU activation, while the second layer has a Tanh activation. The output of the second layers is fed to another module that models the robot’s actions as a Gaussian distribution (i.e. predicts the mean and standard deviation). The value function network also contains two hidden layers, both of which have ReLU activations. Both the policy and value networks take the robot’s state as input.
The agent’s dynamics model is a three-layer neural network with ReLU activations on the two hidden layers and Tanh activation on the last layer. The dynamics model takes as input the state () of the robot and its current action () and outputs a prediction of either ) the robot’s next state () or ) the change in the robot’s state ().
All networks are trained separately using the Adam optimizer with different learning rates.
Solving the Reaching Task
As mentioned above, the reaching task is the simplest task in the Fetch robotics suite. It is therefore a great environment to verify the baseline PPO implementation, as well as investigate whether adding dense intrinsic rewards will improve exploration and learning of sparse tasks. As shown in the plots below, all agents readily solve the Reaching task, converging to 100% accuracy by the end of training.
Learning progress on the Reaching Task
Solving the Pushing Task
Unlike the reaching task, the baseline PPO was unable to solve the pushing task (red curve in the plots below). However, adding intrinsic rewards as an exploration bonus led to rapid convergence. By the end of training the agent was able to solve of the episodes. It is interesting to note that some of the learned policies had a propensity to roll the block while pushing it to the target. I think this is because the environment does not provide detailed information about the block such as its size and moment of inertia. Under these conditions, the dynamics model cannot accurately predict the state of the block. The agent is therefore incentivized to move the block in intersting ways as it pushes it to the goal.
Learning progress on the Pushing Task
Analysis of Hyperparameters for the Pushing Task
The Pushing Task is the simplest non-trivial task in the OpenAI Robotics Suite. It therefore provided a good environment to investigate how tuning the various hyperparameters affects overall performance. This section details the analyses used to select the hyperparameters used in most of the experiments.
Number of Hidden Units
Tuning the number of neurons in the hidden layers is an extremely important aspect of training neural networks. If the network is too large, you risk over-fitting the data. Likewise, if the network is too small, it can be difficult to learn anything at all. In this experiment, I varied the size of the hidden layers from 16 to 1024 neurons. In general, bigger networks were observed to be more performant than smaller networks. For most of the experiments shown, I settled on training networks with 64 neurons. This is because 64-unit networks have similar performance to the larger networks, but they have fewer parameters and require less wall time to train.
Number of Parallel Workers
Since PPO is on-policy, training requires data collected with the current policy. This is difficult to do because the distributions of states, rewards, and actions change every time the model is updated. One way to get around this issue is to parallelize the environment and use multiple workers to collect a large and varied amount of training data. The number of workers required to solve a given task is likely proportional to the complexity of the task. For example, solving the reaching task with four or fewer workers is relatively easy, while solving the pushing task with so few workers is vastly more difficult (data not shown). In this experiment, I tuned the number of parallel workers collecting rollout data using the current policy. With only 8 workers, the model is only able to learn up to a certain point until performance begins to decrease. In contrast, increasing the number of works to 16 and beyond greatly improved algorithm convergence. Based on these results and the number of cores available on my CPU, I chose to use 32 parallel workers in all of the experiments.
The discount factor is the critical hyperparameter that determines how much the agent cares about future rewards. I performed experiments comparing , and Of the three values, had the best performance.
Dynamics Model Learning Rate
The prediction error of the dynamics model is used to generate the intrinsic reward that encourages the agent to explore. As such, the intrinsic rewards should be relatively large at the beginning of training when agent lacks understanding of the environment, and smaller once the agent is able to predict the consequences of its actions. If the dynamics model learns too quickly, the intrinsic rewards will prematurely decrease in size and the agent won’t explore efficiently. If, on the other hand, the dynamics model learns too slowly, large intrinsic rewards will result in excessive exploration by the agent. To determine which learning would be best for Fetch Robotics tasks, I varied the dynamics model learning rate over three orders of magnitude. I observed that, while the agent was able to learn for every tuning of the learning rate, smaller learning rates caused the agent to learn fastest.
Solving the Pick and Place Task
The Pick and Place task was readily solved using the same set of hyperparameters that were used to solve the Pushing task. As can be seen in the plots below, agents trained on the Pick and Place task were able to solve of the episodes by the end of training.
Learning progress on the Pick and Place Task
Solving the Sliding Task
The Sliding Task seemed to be the most challenging. The difficulty of this task is likely due to the agent having fewer timesteps to interact with the puck due the low friction between the puck and the surface of the table. Unlike Reaching and Pushing, learned agents were only able to solve of episodes near the end of training. Furthermore, networks with hidden layers of size 64 were insufficient in solving the Sliding task (data not shown). Instead, sliding required networks with at least 128 neurons. The sliding task was also sensitive to the dynamics model learning rate. It is also worth mentioning that the Sliding task required 40 million environment steps. This is roughly double the number of environment interactions required to solved the Pushing task.
Learning progress on the Sliding Task
Consistent with previous works, my results demonstrate that curiosity-driven exploration can be used to solve challenging tasks with sparse rewards. In particular, I have shown that using the prediction error as the intrinsic reward can encourage an agent to solve a diverse set of robotics tasks. In addition to the intrinsic reward signal, properly tuning various hyperparameters plays a significant role in ensuring that the agent can learn in the presence of sparse rewards. This was especially true for the Sliding task, as it seemed the most sensitive to the selection of hyperparameters. Surprisingly, while one set of hyperparameters was sufficient for solving Reaching, Pushing, and Pick and Place, Sliding required significant tuning and many more environment interactions.
The work presented here provides an excellent starting point for future research. In this section, I will briefly discuss a few ideas that are motivating my next set of experiments.
Learning from Pixels
Learning control policies from pixels is currently an active area of research. Building off of my current results, I will perform experiments in which I compare how learning different feature representations for images my affect intrinsic motivation and exploration. In particular, I will investigate using feature representations such as random convolutional neural network features (RF), variational autoencoder (VAE) features, and inverse dynamics features (IDF) .
Combining Multiple Modalities
While previous papers have largely focused on learning a feature space for a single modality such as images for Atari games or joint kinematics for continuous control, it is not clear which modalities are most important for solving robotic manipulation tasks. Robots are often equipped with one or more sensors that measure various aspects of their state and environment: (1) encoders that measure positions and velocities of its joints; (2) cameras that provide visual perception; and (3) tactile sensors that measure contacts with the environment. I therefore intend to address the following question:
- Does including additional sensor modalities in the feature space result in better exploration policies?
Although there are numerous ways to combine the feature representations for the different modalities, I will initially focus on learning vector representations, and concatenating those features as input to the dynamics model to generate intrinsic rewards. In a separate set of experiments I will also investigate whether the learned multimodal features are able to improve policy learning.
- Curiosity-driven Exploration by Self-supervised Prediction
- Large-Scale Study of Curiosity-Driven Learning
- Exploration by Random Network Distillation
- Playing Atari with Deep Reinforcement Learning
- Continuous control with deep reinforcement learning
- Curious model-building control systems
- Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments
- Formal Theory of Creativity, Fun, and Intrinsic Motivation
- Reinforcement Driven Information Acquisition In Non-Deterministic Environments
- Variational Intrinsic Control
- Unifying Count-Based Exploration and Intrinsic Motivation
- Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
- Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks
- EMI: Exploration with Mutual Information
- Proximal Policy Optimization Algorithms
- Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research
- OpenAI Gym