This application is experimental and significant changes might be expected from version to version.
The goal of the Dolly Docking application is to teach a robot to navigate under a cart placed in the line of sight of the robot using a deep neural network (DNN). The input to the DNN is a history of the occupancy grid of the environment in front of the robot, along with the target pose, velocity, and acceleration vectors. The output of the neural network is a velocity profile for the next three timesteps.
This application provides a reference for the modular reinforcement learning workflow in Isaac SDK. It showcases how to train policies (DNNs) using multi-agent scenarios and then deploy them using frozen models. This document first describes how to quickly start with inference and training, then presents details regarding the neural network policy, training workflow, codelets, and gym state machine.
Deep learning policies allow users to combine learning-based planning methods with classical control approaches, effectively utilizing the best of both worlds. DNN policies can be made robust to inconsistencies and errors from computer vision models, as well as odometry and state estimation, by simulating and learning from those errors in simulation. The exploration and exploitation strategies that reinforcement learning algorithms typically employ to learn an optimal policy can help in situations where the map of the environment is unavailable or the costmap of the environment is not trivial to produce or compute.
Inference
The Dolly Docking application includes two sample scenes for performing inference in Isaac Sim Unity3D: a Factory of the Future scene and a multi-agent scene.
Factory of the Future Scene
Navigate to the isaac_sim_unity3d/builds
folder and run the following command:
bob@desktop:~/isaac_sim_unity3d/builds$ ./factory_of_the_future.x86_64 --scene Factory01 --scenario 6
To run inference for the scene, open another terminal window, navigate to the Isaac SDK root folder, and run the following command:
bob@desktop:~/isaac/sdk$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation -- --config='packages/rl/apps/dolly_navigation/py_components/fof_pb_inference.json'
In the simulator, the robot and cart are spawned in the factory floor, with the robot repeatedly navigating to the center of the cart and then stopping at the center for some time before being respawned again and repeating the action. A minimum of ~10 FPS is needed in simulation for optimal inference performance. If the robot hits the dolly wheels, it is respawned at its starting position with the cart at a slightly different orientation.
Multi-Agent Scene
Navigate to the Isaac Sim Unity3D root folder and run the following command from the
builds
folder:
bob@desktop:~/isaac_sim_unity3d/builds$ ./sample.x86_64 --scene dolly_docking_training
To run inference for the scene, open another terminal window, navigate to the Isaac SDK root folder, and run the following command:
bob@desktop:~/isaac/sdk$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation -- --config='packages/rl/apps/dolly_navigation/py_components/multi_agent_pb_inference.json'
In the simulator, nine robots and carts are spawned in the scene, along with walls randomly placed on the edges of the dolly. Most of the robots in the scene repeatedly navigate to the center of the cart and then stop at the center for some time before being respawned and repeating the action. If the robot hits the dolly wheels or the walls, it is respawned at a new starting position with the cart at a different orientation.
Training
To perform training, navigate to the Isaac Sim Unity3D root folder and run the following command
from the builds
folder
bob@desktop:~/isaac_sim_unity3d/builds$ ./sample.x86_64 --scene dolly_docking_training
To get started with training, open another terminal window, navigate to the Isaac SDK root folder, and enter the command below:
bob@desktop:~/isaac/sdk$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation
This will launch a TensorFlow instance and train it with the data received over TCP. The training
configuration can be altered through the config file located at
/packages/rl/apps/dolly_navigation/py_components/trainer.config.json
.
Once training starts, Tensorflow will periodically output logs to the /tmp/rl_logs
folder
and checkpoints to the /tmp/rl_checkpoints
folder by default. We recommend changing these
to persistent storage paths in the trainer.config.json
file.
Each checkpoint instance is composed of three files:
- .meta file: Denotes the graph structure of the model.
- .data file: Stores the values of all saved variables.
- .index file: Stores the list of variable names and shapes.
To view the training progress on Tensorboard, run the
following command and open http://localhost:6006
in a browser.
tensorboard --logdir=/tmp/rl_logs
The rl_logs
directory also stores the rewards for each epoch in a .txt file.
The rewards tend to decrease for the first few iterations before starting to increase.
To stop training, terminate the application using Ctrl+C.
Model training requires a non-trivial amount of resources. We recommend training neural networks on the NVIDIA DGX or a multi-GPU virtual machine instance with atleast 100 GB of RAM. Even with a powerful machine, it takes a non-trivial amount of time to process the data and train the model. The saved checkpoints can occupy significant disk space if not deleted periodically.
Running Inference from a Trained Checkpoint
Once training is terminated, the generated checkpoint files (*.meta, *.index, *.data
)
can be used for inference.
To run inference using a generated checkpoint in the multi-agent scene, edit the JSON file at
packages/rl/apps/dolly_navigation/py_components/multi_agent_py_inference.json
.
With the simulation running, modify the JSON variable restore_path
to include the path to
the stored checkpoint.
If a checkpoint is stored as agent42.ckpt.meta
, agent42.ckpt.index
, and
agent42.ckpt.data
in the /tmp/rl_checkpoints
folder,
the restore_path
parameter needs to be set to
/tmp/rl_checkpoints/agent42.ckpt
Then run the following command to run inference :
bob@desktop:~/isaac/sdk$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation -- --config='packages/rl/apps/dolly_navigation/py_components/multi_agent_py_inference.json'
To run inference in the Factory of the Future simulation, you will need to edit the JSON file at
packages/rl/apps/dolly_navigation/py_components/fof_py_inference.json
.
Modify the JSON variable restore_path
to include the path to the stored checkpoint.
Then, after starting the factory simulation as described above, run the Isaac SDK app for
inference with the following command:
bob@desktop:~/isaac/sdk$ bazel run packages/rl/apps/dolly_navigation/py_components:dolly_navigation -- --config='packages/rl/apps/dolly_navigation/py_components/fof_py_inference.json'
The reinforcement learning workflow in Isaac SDK is inspired from the OpenAI Gym interface.
The following codelets collect information from the simulation for running the reinforcement learning loop:
- DifferentialBaseOdometry
- RangeScanFlattening
- RangeScanToObservationMap
- DollyDockingStateDecoder
- DollyDockingAuxDecoder
Once state information has been accumulated, it is passed through codelets that are equivalent to OpenAI Gym before being sent to the Sample Accumulator for storage or to the Python code for inference:
- TensorAggregator
- TensorDeaggregator
- StateMachineGymFlow
- DollyDockingBirth
- DollyDockingDeath
- DollyDockingReward
- DollyDockingStateNoiser
- TemporalBatching
Once Gym publishes the prediction, it flows through the following codelets:
- TensorToCompositeVelocityProfile
- DifferentialBaseVelocityIntegrator
- CompositeToDifferentialTrajectoryConverter
- DifferentialBaseControl
The codelets are connected as shown below:

The auxiliary tensor allows users to store and pass information and flags from the
simulation or other SDK components to all the subsequent codelets in the pipeline. The
isaac::rl::DollyDockingAuxDecoder
is used for this purpose in the docking pipeline.
The reinforcement learning application utilizes the following messages:
The Isaac SDK and simulator communicate using a pub/sub architecture: Data is passed back and forth between the two processes by setting up TCP publishers on the side where the data is created and TCP subscribers on the side where the data is ingested.
For Unity 3D simulation, the application that publishes the ground truth data is
packages/navsim/apps/navsim.app.json
. This is directly loaded by the
dolly_docking_training
scene in NavSim.
The application publishes the sensor and environment data to a user-defined port using a TcpPublisher. This data is used by the training application, which in turn sends teleportation and control commands to the NavSim application, which are received through a TcpSubscriber node.
In the training scene, each robot publishes its state and environment messages with an index number
appended to the end of the channel name starting from an index of 1. For example, the teleport
messages for the first robot are received on a TCP channel named teleport1
and so on.
All the channel names and connections between SDK and simulation are available in the Python
application for dolly docking.
Being able to generate unlimited data points through simulation is a powerful asset, bridging the “reality gap” that separates simulated robotics from real experiments. Domain randomization attempts to bridge the reality gap through improved availability of varied data. In the training scene, domain randomization can be achieved in several ways:
- Occupancy grid randomization: Spawn blocks around the dolly and robot for getting varied lidar maps.
- Dolly randomization: Rotate the dolly and translate it from its center pose before each run.
- Target pose noise randomization: Add noise to the target pose of the dolly to simulate errors in pose detection.
- Odometry noise randomization: To simulate noise in odometry typically seen in hardware, add random noise to odometry information computed from simulation
Central to the reinforcement learning workflow of Isaac SDK is a state machine called Gym,
which controls the entire lifecycle of all the robots in the multi-agent scene and
provides training data for the policy. The state machine performs
six distinct steps as shown in the diagram below and provides three components–Birth, Death
and Reward–whose children are pluggable dynamically at runtime into the state machine. The base
classes for these components are located in packages/rl/base_components
. Gym expects one
copy of these components for every agent and calls their functions at the appropriate stage.

The steps of the state machine are as follows:
1. When the state machine begins, it needs to spawn the agents and their environments (dolly,
walls, and obstacles) at their initial pose in simulation. To do so, it calls the
Birth::spawn
function once for each agent in the simulation.
The DollyDockingBirth component (child of isaac::rl::Birth
) is responsible for sending
to simulation teleport messages that place the agents
and their environments at the desired pose. It is also responsible for domain randomization
in the scene by appending noise to the object poses before publishing the teleport message.
2. Next, the state machine waits to receive the aggregated state tensor from the TensorAggregator codelet. This tensor consists of the latest state of all the agents in simulation, along with any auxiliary information that might be needed.
3. Once the latest agent states are received, the state machine evaluates if any agent should
be killed and respawned. It calls the Death::is_dead
function once for every agent in
simulation. The DollyDockingDeath component (child of isaac::rl::Death
) contains logic that
decides what constitutes an invalid agent state: For example, if an agent registers a
CollisionProto from a collision in simulation, or if an agent has been alive for too long.
Agents for whom the function returns true are reset to new poses before the next step.
4. After deciding if an agent is dead or alive, the state machine collects the rewards for
each agent. It calls the Reward::evaluate
function once for every agent in simulation.
The DollyDockingReward component (child of isaac::rl::Reward
) contains logic for assigning a
reward (float) to an agent based on the most recently concluded transition.
5. Once all the data related to the last transition is collected, the current state, reward, and dead flag along with the auxiliary tensor are published in aggregate form.
6. The neural network receives the current state of the agents and performs a forward pass to output the action tensor, consisting of target velocities that the agents should achieve in the near future. This action tensor is passed through Gym to the TensorDeaggregator.
The cycle continues again from Step 1, resetting only those agents that have just died in the last step.
OpenAI Spinning Up is a great introduction to reinforcement learning. Spinning Up provides a clear and concise implementation of popular reinforcement learning algorithms. Due to its sample efficiency and robustness to various environments and sim2real applications, the Soft Actor Critic algorithm is used for training the docking policy.
The deep neural network policy is designed as follows:

The network takes in occupancy maps of the environment for the past three timesteps and passes them through a convolutional backbone network. The flattened output of the backbone is appended to the target pose of the dolly (in robot frame), along with velocity and acceleration vectors for the last three timesteps. This combined linear tensor is then passed through two fully connected layers, each of size 256. It then outputs a one-dimensional tensor of size 6, which are considered the target velocities for the next three timesteps.
The following table describes the JSON file parameters in
/packages/rl/apps/dolly_navigation/py_components
:
action_dimension
: The size of the neural network output tensor (in this case, linear and angular velocity over 3 future timesteps)action_scale
: The output range of the neural network (+0.5 to -0.5 in this case). This output is scaled to true robot output by the TensorToCompositeVelocityProfile codelet and has configurable parameters below.agent_spawn_randomization
: The randomization of the robot pose from its center in x, y, and angle coordinates. This parameter lists the maximum displacement allowed in each of those coordinatesagents_per_row
: The number of agents in each row of the sceneangle_allowance
: The maximum allowable radians that the agent can rotate on either direction of the axis before being killedaux_dimension
: The size of the auxiliary tensoraux_end_of_episode_flag
: The position of an aux flag that signifies if an episode or trial has ended because the agent has reached its maximum agebatch_size
: The training batch size for the neural networkbias
: The coefficients for the x-cordinate, y-cordinate, and the angle (radians) in the reward equation: \(ax + b|y| - ( |y| / |c.tan(|angle|)| )\)buffer_threshold
: The minimum number of samples collected in the sample accumulator before training beginscell_size
: The size of a cell in the dynamic observation map in meterscheckpoint_directory
: The directory for storing tensorflow checkpointscollision_penalty
: The penalty for colliding with scene objectsdelay_sending
: The time to wait before starting the Gym State Machine. This ensures all nodes have had sufficient time to start.dividing_space
: The distance between two robot-dolly setups in simulation along x and y coordinatesexperience_buffer_size
: The size of theExperience Buffer / Sample
accumulatorgamma
: Reward discount factor for reinforcement learning, usually fixed at 0.99ideal_docking_pose
: Perfect docking coordinate of the robot with respect to the center of the dolly in (x,y)learning_rate
: The target learning rate for the reinforcement learning policylook_back
: The number of past steps to store in the state tensor as an input to the neural networkmax_episode_length
: The maximum number of steps that constitute a single trialmax_steps_per_epoch
: The number of training iterations per epochmode
: Run the Isaac application in one of its many forms. Set its value to “train” for training a new policy, “inference_pb” for frozen model inference and “inference_py” for python checkpoint inferencenum_agents_per_sim
: The number of agents present inside a single simulator instance. Note that the number of agents inside a particular scene is fixed, so altering this config does not increase or decrease the number of agents in the scene.observation_map_dimension
: A side of the square occupancy map generated from flatscan and fed to the neural networkobstacle_scale_randomization
: The randomization scale of the blocks from their original sizes. This parameter lists the maximum displacement allowed in the elements of the scale vectorobstacle_separation
: The spawn position of obstacles (walls, blocks, etc) from the center of the dolly along x and y coordinates before randomization is appliedobstacle_spawn_randomization
: The pose randomization of the walls and blocks from their centers in x, y, and angle coordinates. This parameter lists the maximum displacement allowed in each of these coordinatesoutput_angular_velocity_range
: The range to rescale the received angular velocity indicesoutput_linear_velocity_range
: The range to rescale the received linear velocity indicespolyak
: The averaging coefficient (usually between 0.98 - 0.99). This parameter determines how much of the duplicate network weight gets copied to the original network.reaction_time
: Delay in seconds after publication of the action message that the state should
restore_path
: The path to restore a tensorflow checkpoint or frozen modelreward_clip_range
: The range to which reward values are trimmed if exceeded. The values are trimmed on both sides of number scale.sim_instances
: The number of Unity instances running in parallelstart_coordinate
: The coordinate from which to spawn agents in simulationsuccess_reward
: The reward received at every timestep on achieving the goaltarget_pose_noise
: The noise to add to the ground truth pose of the dolly to simulate pose detection errorstarget_separation
: The distance between the target(dolly) and the robot along x and y coordinatestarget_spawn_randomization
: The pose randomization of the target dolly from its center in x, y, and angle coordinates. This parameter lists the maximum displacement allowed in each of these coordinatestensorboard_log_directory
: The directory to store all logstick_period
: The tick period for codelets in the pipelinetimestamp_profile
: Target timesteps to append to each predicted output of the neural networktolerance
: Deltas along the x and y directions from the center of the dolly that are considered to be successful docking end posesuse_pretrained_model
: A flag that, if true, indicates a checkpoint model needs to be restoredwall_thickness
: The thickness of the area behind which a hit is marked as solid when integrating a flatscanx_allowance
: The maximum allowable distance that the agent can move along the respective negative and positive x directions before being killedy_allowance
: The maximum allowable distance that the agent can move along the respective negative and positive y directions before being killed