Object detection and 3D pose estimation play a crucial role in robotics. They are needed in a variety of applications such as navigation, object manipulation, and inspection. The 3D Object Pose Estimation application in Isaac SDK provides the framework to train pose estimation for any model completely in simulation, and to test and run the inference in simulation as well as the real world.
This application uses the Isaac SDK Pose CNN Decoder architecture for its 3D pose estimation model. This architecture is based on the work by Sundermeyer et al. and PoseCNN by Xiang et al. Given an RGB image, the algorithm first detects objects from a known set of objects with available 3D CAD models using any object detection model available in Isaac SDK, then estimates their 3D poses using the Pose CNN Decoder model. This application and network architecture aims at low-latency, real-time object detection and 3D pose estimation by leveraging GPU acceleration while achieving good accuracy.
Compared to the autoencoder-based 3D pose estimation algorithm in Isaac SDK, this application has the following attributes:
More user-friendly: It eliminates an extra intermediate step to generate the codebook and perform codebook lookup.
Can handle multiple object classes with a single trained model: On the other hand, the autoencoder requires a separate trained model and codebook for each object class.
More robust against occlusions: The network learns to directly predict the object center from the bounding box parameters and input image.

3D Pose Estimation Algorithm
This application consists of two modules:
An inference module
A Pose CNN Decoder training module
Inference Module
The inference module consists of two sub-modules: object detection and 3D pose estimation. The object detection sub-module accepts an RGB image and determines bounding boxes for objects of interest using any object detection model available in Isaac SDK. Currently the sample application files support the Isaac’s DetectNet object detection inference model. The 3D pose estimation sub-module crops the images based on the bounding boxes and estimates the 3D poses of the objects in each of the cropped images. To estimate the 3D pose of an object, this module requires a trained Pose CNN Decoder neural network that accepts as inputs the cropped images and their corresponding bounding box parameters, namely the bounding box size and center.
Pose CNN Decoder Training Module
The Pose CNN Decoder training module trains the neural network for the object or list of objects. This module uses a variant of the encoder-decoder architecture, along with the regression layers, to estimate the 3D pose of the object. It trains the decoder neural network to reconstruct the input RGB image as a segmentation image that is invariant to occlusions, thus forcing the network to learn the object shape irrespective of the occlusions and other noise like background and lighting conditions.
The training module requires simulated color and segmentation image data based on the user-provided 3D CAD model of the object. This application also provides a simulated data generation module using the Unity Engine and IsaacSim Unity3D, which generates simulated data for an object given its 3D CAD model.
The Pose CNN Decoder network takes two inputs: A cropped RGB image and the bounding box parameters. The network uses two encoders to extract the features of the image, as well as the bounding box, and concatenates the feature space. The network uses the concatenated feature space to decode the cropped image as a de-noised segmentation image and estimates the translation and rotation parameters of the 3D pose of the object by posing it as a regression problem.

The decoder that outputs the segmentation image is trained to remove occlusions and fill the occluded pixels of the object. The output translation parameters are the object center x and y coordinates in the camera image frame and depth from the camera in the real world. The four rotation parameters are the quaternions of the rotation of the object in the camera frame. The segmentation mask is used again in the pose refinement step to crop out the relevant depth-image point cloud, and the 3D pose estimate is used as a starting point to refine the pose.
Acquiring and labeling the real-world data required to train models for 3D pose estimation is extremely challenging, time consuming, and prone to error. The problem is especially aggravated for robotics: robots are required to perform pose estimation in a wide variety of specialized scenarios, so collecting large amounts of accurate and labeled real-world data for every such scenario is prohibitive, which in turn slows down the rate of adoption of these models.
This application uses simulated data, training the network for 3D pose estimation entirely in simulation, to bridge the sim-to-real gap through features available in simulators such as domain randomization and procedural generation techniques.
The Pose CNN Decoder training stage involves two broad steps:
Generating simulated data
Running the Pose CNN Decoder-training pipeline
Generating Simulated Data
Training the autoencoder model requires four different types of data:
A rendered RGB image
A corresponding segmentation image
A de-noised version of the segmentation image (i.e. a version of the segmentation image that is invariant to occlusions)
Generating data for sample objects from scene binary: Industrial Dolly and Industrial Box
A sample “factory of the future” scene binary, which generates the above data for two
objects, a dolly and box, is available in the isaac_sim_unity3d
repository in
builds
. Run the binary with the following command within the IsaacSim Unity3D release
build to start sending the samples for dolly pose estimation training:
bob@desktop:~/isaac_sim_unity3d$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 8
This command runs the Factory01 Unity scene and scenario 8, which is the dolly pose estimation training scenario.
Use the following command to run the binary within the IsaacSim Unity3D release build to start sending the samples for box pose estimation training:
bob@desktop:~/isaac_sim_unity3d$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 10
IsaacSim Unity3D communicates with the Isaac SDK via TCP sockets. The
isaac/packages/navsim/apps/navsim.app.json
application sets up a connection to publish
the simulation data to a user-defined port using a TcpPublisher node.
The data is received by the training
application at packages/object_pose_estimation/apps/pose_cnn_decoder/training/training.app.json
using TcpSubscriber.
Generating data for custom objects from scene source file
A sample scene to generate training data for custom objects is available in the
isaac_sim_unity3d
repository in packages/Nvidia/Samples/ObjectPoseEstimation/
.
Follow the instructions on the IsaacSim Unity3D page to open the
pose_cnn_decoder_training.unity
scene in Unity Editor. This sample
scene can generate data with randomized backgrounds, occluding objects, lighting conditions,
and camera poses.
The scene is configured to stream all four required labels for each frame to Isaac SDK using TCP. The default object of interest in the scene is a dolly. Follow these steps to set up the scene for a new object:
If the object of interest is already in the group in
packages/Nvidia/Samples/ObjectPoseEstimation/PoseEstimationObjectsGroup
,
you can skip steps 1-3.
Load Unity, upload the 3D CAD model and textures of the object, and create a prefab.
In the prefab, click Add Component and add the
LabelSetter
script. Enter a Label Name of your choice. This script creates segmentation images and computes bounding boxes.Add the prefab to the
PoseEstimationObjectsGroup
inpackages/Nvidia/Samples/ObjectPoseEstimation
. To accomplish this, either increase the element size by 1 or swap an existing prefab with the prefab of your object by dragging and dropping the prefab into the list.Drag the
packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity
scene into the Hierarchy panel and remove any existing scenes in the panel.Provide the label name set to the prefab in the Class Label Rules > dolly > Expression field. The default name is “dolly”, which is the label name of the “Dolly” prefab. The label name is used in IsaacSim to render the segmentation image of the object in a scene.
Set a Name for the Dolly element in the Class Labels section. This name is used as the label for the object in Isaac SDK.
Save the scene and play it.
In Isaac SDK, set the :code:’scene’ parameter in the
data.simulation.scenario_manager
component to the scene namepose_cnn_decoder_training
.In Isaac SDK, add the prefab name of the object to the
robot_prefab
parameter in thedata.simulation.scenario_manager
component and also to theclass_names
parameter in thedata.detection3_convertor
component inpackages/object_pose_estimation/apps/pose_cnn_decoder/training/training.app.json
.NoteThis is the name of the object prefab and not the Class Label name.
For example, if you are using the “TrashCan02.prefab” Unity object with the
LabelSetter name in the prefab “trashcan”, then the robot_prefab
parameter in the Isaac
app should be TrashCan02
and the expression name in ClassLabelManager in Unity can be any
name of your choice.
Adding the prefab name to scenario manager in the Isaac app will enable spawning of
the object in the scene when you run the training app as described in the sections below.
Adding the prefab name to data.detection3_convertor
is used to filter and encode
the ground truth poses used to train the network.
At this point, you should see the camera positions, objects, lighting, and background substances for walls and floors changing every frame. You can adjust the randomization frequency of the backgrounds, camera positions, etc. as needed in the procedural camera GameObject and its child CameraGroup GameObject.
Running the Pose CNN Decoder training pipeline
Run the training application with the following command:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder/training:pose_estimation_cnn_training
The training configuration can be set in the
packages/object_pose_estimation/apps/pose_cnn_decoder/training/training_config.json
file.
The logs and checkpoints are stored in /tmp/autoenc_logs/ckpts
by default, but this path
can be changed using the train_logdir
config option in the training_config.json
file. By default, the training application runs from training iteration 0 until the iteration
number equals the training_step
value given in the training_config.json
file.
To start the training from an intermediate checkpoint, set the checkpoint
config option to
the corresponding checkpoint filename, which is by default in the form
model-<checkpoint_number>
. The script extracts the number, appending “model-” as the
checkpoint number, and restarts from that iteration number. For example, if you want to restart
from iteration number 10000, set the checkpoint
to model-10000
: The step number
will then start from 10000 and end at the value of the training_steps
config option,
which is 50000 by default.
To view the training progress on TensorBoard, run the following command in the terminal:
tensorboard --logdir=/tmp/autoenc_logs/ckpts
The visualization interface can be accessed at http://localhost:6006
.
The images received from Isaac_sim_unity3d, along with the bounding boxes, can be visualized in
Sight at http://localhost:3000
.
If the decoder output in TensorBoard looks completely black or completely white
within 10 iterations, change the add_noise_to_image
setting to “true”
in the training_config.json
file. This adds noise to the images, which prevents
the training variables from saturating. You should now start seeing a combination of black
and white pixels instead of a completely black/white image. You can reset the option to
“false” after a few hundred iterations to train on cleaner images if desired.
Depending on the type of object, the model may not need the default of 50000 training steps. You can end training sooner if the reconstructed decoder image, which is the output of the model, is clear over multiple training samples. This is a good check to determine the quality of the model. You can view the reconstructed image in the TensorBoard Images section.
The frozen TensorFlow model is only generated at the end of training (i.e. when the number of
iterations equals the value of the training_steps
config option). So, if you want to end
training before the given steps are finished, you can generate the frozen model from this
checkpoint by running the following command:
bob@desktop:~/isaac$ bazel run packages/ml/tools:freeze_tensorflow_model_tool -- --out /tmp/autoenc_logs/ckpts/model-24000 --output_node_name 'translation_output','rotation_output_1','decoder_output' /tmp/autoenc_logs/ckpts/model-24000
The above command sets the last checkpoint number to 24000. You can change this value as needed.
Conversion to UFF Model
The UFF package contains a set of utilities to convert trained models from various frameworks to a common UFF format. In this application, the UFF parser converts the Tensorflow model to UFF so it can be used for codebook generation and inference. Refer to the NVIDIA TensorRT Documentation for more details.
At the end of the training iterations, the Tensorflow model is saved as a .pb file. You then
need to convert it to the UFF model using the python script and UFF parser.
For example, at the end of 24000 iterations, the Tensorflow model is saved as
model-24000.pb
and can be converted to a UFF model using the following command:
bob@desktop:~/isaac$ bazel run packages/ml/tools:tensorflow_to_tensorrt_tool -- --out /tmp/autoenc_logs/ckpts/pose_cnn_model.uff --input_node_name 'encoder_input' --input_node_name 'encoder_bbox_input' --output_node_name 'translation_output' --output_node_name 'rotation_output_1' --output_node_name 'decoder_output' /tmp/autoenc_logs/ckpts/model-24000-frozen.pb
Running inference
There are four available modes of inference based on data collection for object pose estimation: simulation, camera feed, log replay, and image feed. These modes can be run via Python script. As with training, the package provides sample inference for two objects: a dolly for the navigation use case and a box for the manipulation use case.
The inference Python script and the config files for dolly and box are located in
/packages/object_pose_estimation/apps/pose_cnn_decoder
. The default config file for
inference is located at
/packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json
for the dolly object. Commands to run the four inference modes for both dolly
and box are provided below:
Inference on simulation data is a good first test to check the accuracy of the model. To run the inference application on simulation data for both detection and pose estimation, first play the pose estimation training scene binary, then run the inference command within Isaac SDK. The following are input arguments needed for this mode:
--mode
: This should be set tosim
for this mode.--image_channel
: The name of the image channel sending the color image from simulation. The default name iscolor
.--rows
and--cols
: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.
For dolly inference, run scenario 12 using the following command from inside the
isaac_sim_unity3d
release folder:bob@desktop:~/isaac$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 12
Then run dolly inference on simulated data using this command:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel color
For box inference, run inference scenario 11 using the following command from inside the
isaac_sim_unity3d
release folder:bob@desktop:~/isaac$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 11
Then run box inference on simulated data using this command:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel encoder_color --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json
To run inference on a real-time camera feed from a RealSense camera, the following input arguments are required:
--mode
: This should be set torealsense
for this mode.--rows
and--cols
: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.--fps
: The frame rate of the RGB channel. Currently, only two frame rates, 15 and 30, are supported for the RealSense camera. The default value is 15.
Run dolly inference on a camera feed using the following command inside Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode realsense
Run box inference on a camera feed using the following command inside Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode realsense --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json
To run the inference application using recorded logs, the following input arguments are required:
--mode
: This should be set tocask
for this mode.--cask_directory
: The path to the sample replay log. The default path is set to the dolly replay log.--rows
and--cols
: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.
Run dolly inference on sample logs using the following command inside Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode cask --rows 480 --cols 848
Run box inference on sample logs using the following command inside Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode cask --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json --cask_directory external/sortbot_pose_estimation_data/klt_logs
To run the inference application using sample images, the following input arguments are required
--mode
: This should be set toimage
for this mode.--image_directory
: The path to the sample image. The default is set to the path for the dolly sample image.--rows
and--cols
: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.--optical_center_rows
and--optical_center_cols
: The optical center specified as a row and column in the input image. The default values are 360 and 640, respectively.--focal_length
: The focal length of the camera used to capture the image in pixels. The default value is 925.74.
NoteThe last three parameters are needed to construct the pinhole model of the camera.
Run dolly inference on a sample image using the following command inside Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode image --rows 480 --cols 848 --optical_center_rows 240 --optical_center_cols 424 --focal_length 614.573
Run box inference on a sample image using the following command inside Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode image --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json --image_directory external/sortbot_pose_estimation_data/klt_sample_images/klt_box.png
The inference applications take an RGB image from different input sources
and outputs the estimated 3D poses of the detected objects along with the segmentation mask of the
cropped image. Sample data of the dolly object is provided, allowing you to run the last two
inference apps using images and logs. You can visualize the estimated pose in Sight at
http://localhost:3000
.
There are two types of visualization for the estimated pose:
A 3D bounding box, which requires specification of both the 3D bounding box size at zero orientation and the transformation from the object center to the bounding box center. Configure these parameters in the
viewers/Detections3Viewer
component in the inference application files.A rendering of the CAD model in the scene, which requires the path to the object CAD model and file names. These correspond to the
assetroot
andassets
parameters respectively in thewebsight
component in the inference application files.
If you notice lag in mesh versus image rendering in Sight, increase the Change delay
value
slightly to synchronize the image, rendered mesh, and bounding box. To find this option, right-click
the rendering window in Sight.
Sample inference

