# Dolly Docking using Reinforcement Learning¶

Note

This application is experimental and significant changes might be expected from version to version.

The goal of the Dolly Docking application is to teach a robot to navigate under a cart placed in the line of sight of the robot using a deep neural network (DNN). The input to the DNN is a history of the occupancy grid of the environment in front of the robot, along with the target pose, velocity, and acceleration vectors. The output of the neural network is a velocity profile for the next three timesteps.

This application provides a reference for the modular reinforcement learning workflow in Isaac SDK. It showcases how to train policies (DNNs) using multi-agent scenarios and then deploy them using frozen models. This document first describes how to quickly start with inference and training, then presents details regarding the neural network policy, training workflow, codelets, and gym state machine.

Deep learning policies allow users to combine learning-based planning methods with classical control approaches, effectively utilizing the best of both worlds. DNN policies can be made robust to inconsistencies and errors from computer vision models, as well as odometry and state estimation, by simulating and learning from those errors in simulation. The exploration and exploitation strategies that reinforcement learning algorithms typically employ to learn an optimal policy can help in situations where the map of the environment is unavailable or the costmap of the environment is not trivial to produce or compute.

## Quick Start¶

### Inference¶

The Dolly Docking application includes two sample scenes for performing inference in IsaacSim Unity3D: a Factory of the Future scene and a multi-agent scene.

#### Factory of the Future Scene¶

Navigate to the isaac_sim_unity3d/builds folder and run the following command:

bob@desktop:~/isaac_sim_unity3d/builds$./factory_of_the_future.x86_64 --scene Factory01 --scenario 6  To run inference for the scene, open another terminal window, navigate to the Isaac SDK root folder, and run the following command: bob@desktop:~/isaac$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation -- --config='packages/rl/apps/dolly_navigation/py_components/fof_pb_inference.json'


In the simulator, the robot and cart are spawned in the factory floor, with the robot repeatedly navigating to the center of the cart and then stopping at the center for some time before being respawned again and repeating the action. A minimum of ~10 FPS is needed in simulation for optimal inference performance. If the robot hits the dolly wheels, it is respawned at its starting position with the cart at a slightly different orientation.

#### Multi-Agent Scene¶

Navigate to the Isaac Sim Unity3D root folder and run the following command from the builds folder:

bob@desktop:~/isaac_sim_unity3d/builds$./sample.x86_64 --scene dolly_docking_training  To run inference for the scene, open another terminal window, navigate to the Isaac SDK root folder, and run the following command: bob@desktop:~/isaac$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation -- --config='packages/rl/apps/dolly_navigation/py_components/multi_agent_pb_inference.json'


In the simulator, nine robots and carts are spawned in the scene, along with walls randomly placed on the edges of the dolly. Most of the robots in the scene repeatedly navigate to the center of the cart and then stop at the center for some time before being respawned and repeating the action. If the robot hits the dolly wheels or the walls, it is respawned at a new starting position with the cart at a different orientation.

### Training¶

To perform training, navigate to the Isaac Sim Unity3D root folder and run the following command from the builds folder

bob@desktop:~/isaac_sim_unity3d/builds$./sample.x86_64 --scene dolly_docking_training  To get started with training, open another terminal window, navigate to the Isaac SDK root folder, and enter the command below: bob@desktop:~/isaac$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation


This will launch a TensorFlow instance and train it with the data received over TCP. The training configuration can be altered through the config file located at /packages/rl/apps/dolly_navigation/py_components/trainer.config.json.

Once training starts, Tensorflow will periodically output logs to the /tmp/rl_logs folder and checkpoints to the /tmp/rl_checkpoints folder by default. We recommend changing these to persistent storage paths in the trainer.config.json file.

Each checkpoint instance is composed of three files:

• .meta file: Denotes the graph structure of the model.

• .data file: Stores the values of all saved variables.

• .index file: Stores the list of variable names and shapes.

To view the training progress on Tensorboard, run the following command and open http://localhost:6006 in a browser.

tensorboard --logdir=/tmp/rl_logs


The rl_logs directory also stores the rewards for each epoch in a .txt file. The rewards tend to decrease for the first few iterations before starting to increase.

To stop training, terminate the application using Ctrl+C.

Note

Model training requires a non-trivial amount of resources. We recommend training neural networks on the NVIDIA DGX or a multi-GPU virtual machine instance with atleast 100 GB of RAM. Even with a powerful machine, it takes a non-trivial amount of time to process the data and train the model. The saved checkpoints can occupy significant disk space if not deleted periodically.

#### Running Inference from a Trained Checkpoint¶

Once training is terminated, the generated checkpoint files (*.meta, *.index, *.data) can be used for inference.

To run inference using a generated checkpoint in the multi-agent scene, edit the JSON file at packages/rl/apps/dolly_navigation/py_components/multi_agent_py_inference.json. With the simulation running, modify the JSON variable restore_path to include the path to the stored checkpoint.

Tip

If a checkpoint is stored as agent42.ckpt.meta, agent42.ckpt.index, and agent42.ckpt.data in the /tmp/rl_checkpoints folder, the restore_path parameter needs to be set to /tmp/rl_checkpoints/agent42.ckpt

Then run the following command to run inference :

bob@desktop:~/isaac$bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation -- --config='packages/rl/apps/dolly_navigation/py_components/multi_agent_py_inference.json'  To run inference in the Factory of the Future simulation, you will need to edit the JSON file at packages/rl/apps/dolly_navigation/py_components/fof_py_inference.json. Modify the JSON variable restore_path to include the path to the stored checkpoint. Then, after starting the factory simulation as described above, run the Isaac SDK app for inference with the following command: bob@desktop:~/isaac$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation -- --config='packages/rl/apps/dolly_navigation/py_components/fof_py_inference.json'


## Codelets¶

The reinforcement learning workflow in Isaac SDK is inspired from the OpenAI Gym interface.

The following codelets collect information from the simulation for running the reinforcement learning loop:

Once state information has been accumulated, it is passed through codelets that are equivalent to OpenAI Gym before being sent to the Sample Accumulator for storage or to the Python code for inference:

Once Gym publishes the prediction, it flows through the following codelets:

The codelets are connected as shown below:

Note

The auxiliary tensor allows users to store and pass information and flags from the simulation or other SDK components to all the subsequent codelets in the pipeline. The isaac::rl::DollyDockingAuxDecoder is used for this purpose in the docking pipeline.

The reinforcement learning application utilizes the following messages:

## Simulation¶

The Isaac SDK and simulator communicate using a pub/sub architecture: Data is passed back and forth between the two processes by setting up TCP publishers on the side where the data is created and TCP subscribers on the side where the data is ingested.

For Unity 3D simulation, the application that publishes the ground truth data is packages/navsim/apps/navsim.app.json. This is directly loaded by the dolly_docking_training scene in NavSim.

The application publishes the sensor and environment data to a user-defined port using a TcpPublisher. This data is used by the training application, which in turn sends teleportation and control commands to the NavSim application, which are received through a TcpSubscriber node.

In the training scene, each robot publishes its state and environment messages with an index number appended to the end of the channel name starting from an index of 1. For example, the teleport messages for the first robot are received on a TCP channel named teleport1 and so on. All the channel names and connections between SDK and simulation are available in the Python application for dolly docking.

Being able to generate unlimited data points through simulation is a powerful asset, bridging the “reality gap” that separates simulated robotics from real experiments. Domain randomization attempts to bridge the reality gap through improved availability of varied data. In the training scene, domain randomization can be achieved in several ways:

• Occupancy grid randomization: Spawn blocks around the dolly and robot for getting varied lidar maps.

• Dolly randomization: Rotate the dolly and translate it from its center pose before each run.

• Target pose noise randomization: Add noise to the target pose of the dolly to simulate errors in pose detection.

• Odometry noise randomization: To simulate noise in odometry typically seen in hardware, add random noise to odometry information computed from simulation

## Gym State Machine Flow in Isaac SDK¶

Central to the reinforcement learning workflow of Isaac SDK is a state machine called Gym, which controls the entire lifecycle of all the robots in the multi-agent scene and provides training data for the policy. The state machine performs six distinct steps as shown in the diagram below and provides three components–Birth, Death and Reward–whose children are pluggable dynamically at runtime into the state machine. The base classes for these components are located in packages/rl/base_components. Gym expects one copy of these components for every agent and calls their functions at the appropriate stage.

The steps of the state machine are as follows:

1. When the state machine begins, it needs to spawn the agents and their environments (dolly, walls, and obstacles) at their initial pose in simulation. To do so, it calls the Birth::spawn function once for each agent in the simulation. The DollyDockingBirth component (child of isaac::rl::Birth) is responsible for sending to simulation teleport messages that place the agents and their environments at the desired pose. It is also responsible for domain randomization in the scene by appending noise to the object poses before publishing the teleport message.

2. Next, the state machine waits to receive the aggregated state tensor from the TensorAggregator codelet. This tensor consists of the latest state of all the agents in simulation, along with any auxiliary information that might be needed.

3. Once the latest agent states are received, the state machine evaluates if any agent should be killed and respawned. It calls the Death::is_dead function once for every agent in simulation. The DollyDockingDeath component (child of isaac::rl::Death) contains logic that decides what constitutes an invalid agent state: For example, if an agent registers a CollisionProto from a collision in simulation, or if an agent has been alive for too long. Agents for whom the function returns true are reset to new poses before the next step.

4. After deciding if an agent is dead or alive, the state machine collects the rewards for each agent. It calls the Reward::evaluate function once for every agent in simulation. The DollyDockingReward component (child of isaac::rl::Reward) contains logic for assigning a reward (float) to an agent based on the most recently concluded transition.

5. Once all the data related to the last transition is collected, the current state, reward, and dead flag along with the auxiliary tensor are published in aggregate form.

6. The neural network receives the current state of the agents and performs a forward pass to output the action tensor, consisting of target velocities that the agents should achieve in the near future. This action tensor is passed through Gym to the TensorDeaggregator.

The cycle continues again from Step 1, resetting only those agents that have just died in the last step.

## Reinforcement Learning Policy¶

OpenAI Spinning Up is a great introduction to reinforcement learning. Spinning Up provides a clear and concise implementation of popular reinforcement learning algorithms. Due to its sample efficiency and robustness to various environments and sim2real applications, the Soft Actor Critic algorithm is used for training the docking policy.

The deep neural network policy is designed as follows:

The network takes in occupancy maps of the environment for the past three timesteps and passes them through a convolutional backbone network. The flattened output of the backbone is appended to the target pose of the dolly (in robot frame), along with velocity and acceleration vectors for the last three timesteps. This combined linear tensor is then passed through two fully connected layers, each of size 256. It then outputs a one-dimensional tensor of size 6, which are considered the target velocities for the next three timesteps.

## JSON Pipeline Parameters¶

The following table describes the JSON file parameters in /packages/rl/apps/dolly_navigation/py_components:

• action_dimension: The size of the neural network output tensor (in this case, linear and angular velocity over 3 future timesteps)

• action_scale: The output range of the neural network (+0.5 to -0.5 in this case). This output is scaled to true robot output by the TensorToCompositeVelocityProfile codelet and has configurable parameters below.

• agent_spawn_randomization: The randomization of the robot pose from its center in x, y, and angle coordinates. This parameter lists the maximum displacement allowed in each of those coordinates

• agents_per_row: The number of agents in each row of the scene

• angle_allowance: The maximum allowable radians that the agent can rotate on either direction of the axis before being killed

• aux_dimension: The size of the auxiliary tensor

• aux_end_of_episode_flag: The position of an aux flag that signifies if an episode or trial has ended because the agent has reached its maximum age

• batch_size: The training batch size for the neural network

• bias: The coefficients for the x-cordinate, y-cordinate, and the angle (radians) in the reward equation: $$ax + b|y| - ( |y| / |c.tan(|angle|)| )$$

• buffer_threshold: The minimum number of samples collected in the sample accumulator before training begins

• cell_size: The size of a cell in the dynamic observation map in meters

• checkpoint_directory: The directory for storing tensorflow checkpoints

• collision_penalty: The penalty for colliding with scene objects

• delay_sending: The time to wait before starting the Gym State Machine. This ensures all nodes have had sufficient time to start.

• dividing_space: The distance between two robot-dolly setups in simulation along x and y coordinates

• experience_buffer_size: The size of the Experience Buffer / Sample accumulator

• gamma: Reward discount factor for reinforcement learning, usually fixed at 0.99

• ideal_docking_pose: Perfect docking coordinate of the robot with respect to the center of the dolly in (x,y)

• learning_rate: The target learning rate for the reinforcement learning policy

• look_back: The number of past steps to store in the state tensor as an input to the neural network

• max_episode_length: The maximum number of steps that constitute a single trial

• max_steps_per_epoch: The number of training iterations per epoch

• mode: Run the Isaac application in one of its many forms. Set its value to “train” for training a new policy, “inference_pb” for frozen model inference and “inference_py” for python checkpoint inference

• num_agents_per_sim: The number of agents present inside a single simulator instance. Note that the number of agents inside a particular scene is fixed, so altering this config does not increase or decrease the number of agents in the scene.

• observation_map_dimension: A side of the square occupancy map generated from flatscan and fed to the neural network

• obstacle_scale_randomization: The randomization scale of the blocks from their original sizes. This parameter lists the maximum displacement allowed in the elements of the scale vector

• obstacle_separation: The spawn position of obstacles (walls, blocks, etc) from the center of the dolly along x and y coordinates before randomization is applied

• obstacle_spawn_randomization: The pose randomization of the walls and blocks from their centers in x, y, and angle coordinates. This parameter lists the maximum displacement allowed in each of these coordinates

• output_angular_velocity_range: The range to rescale the received angular velocity indices

• output_linear_velocity_range: The range to rescale the received linear velocity indices

• polyak: The averaging coefficient (usually between 0.98 - 0.99). This parameter determines how much of the duplicate network weight gets copied to the original network.

• reaction_time: Delay in seconds after publication of the action message that the state should

be recorded by Gym

• restore_path: The path to restore a tensorflow checkpoint or frozen model

• reward_clip_range: The range to which reward values are trimmed if exceeded. The values are trimmed on both sides of number scale.

• sim_instances: The number of Unity instances running in parallel

• start_coordinate: The coordinate from which to spawn agents in simulation

• success_reward: The reward received at every timestep on achieving the goal

• target_pose_noise: The noise to add to the ground truth pose of the dolly to simulate pose detection errors

• target_separation: The distance between the target(dolly) and the robot along x and y coordinates

• target_spawn_randomization: The pose randomization of the target dolly from its center in x, y, and angle coordinates. This parameter lists the maximum displacement allowed in each of these coordinates

• tensorboard_log_directory: The directory to store all logs

• tick_period: The tick period for codelets in the pipeline

• timestamp_profile: Target timesteps to append to each predicted output of the neural network

• tolerance: Deltas along the x and y directions from the center of the dolly that are considered to be successful docking end poses

• use_pretrained_model: A flag that, if true, indicates a checkpoint model needs to be restored

• wall_thickness: The thickness of the area behind which a hit is marked as solid when integrating a flatscan

• x_allowance: The maximum allowable distance that the agent can move along the respective negative and positive x directions before being killed

• y_allowance: The maximum allowable distance that the agent can move along the respective negative and positive y directions before being killed