Object detection and 3D pose estimation play a crucial role in robotics. They are needed in a variety of applications such as navigation, object manipulation, and inspection. The 3D Object Pose Estimation application in the Isaac SDK provides the framework to train pose estimation for any model completely in simulation, and to test and run the inference in simulations, as well as the real world.
This application uses the Isaac SDK Pose CNN Decoder architecture for its 3D pose estimation model. This architecture is based on the work by Sundermeyer et al. and PoseCNN by Xiang et al. Given an RGB image, the algorithm first detects objects from a known set of objects with available 3D CAD models using any object detection model available in Isaac SDK, then estimates their 3D poses using the Pose CNN Decoder model. This application and network architecture aims at low-latency, real-time object detection and 3D pose estimation by leveraging GPU acceleration while achieving good accuracy.
Compared to the autoencoder-based 3D pose estimation algorithm in Isaac SDK, this application has the following attributes:
- More user-friendly: It eliminates an extra intermediate step to generate the codebook and perform codebook lookup.
- Can handle multiple object classes with a single trained model: On the other hand, the autoencoder requires a separate trained model and codebook for each object class.
- More robust against occlusions: The network learns to directly predict the object center from the bounding box parameters and input image.

3D Pose Estimation Algorithm
This application consists of two modules:
- An inference module
- A Pose CNN Decoder training module
Inference Module
The inference module consists of two sub-modules: object detection and 3D pose estimation. The object detection sub-module accepts an RGB image and determines bounding boxes for objects of interest using any object detection model available in the Isaac SDK. Currently, the sample application files support the Isaac’s DetectNet object detection inference model. The 3D pose estimation sub-module crops the images based on the bounding boxes and estimates the 3D poses of the objects in each of the cropped images. To estimate the 3D pose of an object, this module requires a trained Pose CNN Decoder neural network that accepts as inputs the cropped images and their corresponding bounding box parameters, namely the bounding box size and center.
Pose CNN Decoder Training Module
The Pose CNN Decoder training module trains the neural network for the object or list of objects. This module uses a variant of the encoder-decoder architecture, along with the regression layers, to estimate the 3D pose of the object. It trains the decoder neural network to reconstruct the input RGB image as a segmentation image that is invariant to occlusions, thus forcing the network to learn the object shape irrespective of the occlusions and other noise, background, and lighting conditions.
The training module requires simulated color and segmentation image data based on the user-provided 3D CAD model of the object. This application also provides a simulated data generation module using the Unity Engine and Isaac Sim Unity3D, which generates simulated data for an object given its 3D CAD model. However, the training application is agnostic to the simulation platform used to generate the data and can be replaced with any other simulation engines like Isaac Sim based on NVIDIA’s Omniverse Kit.
The Pose CNN Decoder network takes two inputs: A cropped RGB image and the bounding box parameters. The network uses two encoders to extract the features of the image, as well as the bounding box, and concatenates the feature space. The network uses the concatenated feature space to decode the cropped image as a de-noised segmentation image and estimates the translation and rotation parameters of the 3D pose of the object by posing it as a regression problem.

3D Pose Estimation DNN Architecture
The decoder that outputs the segmentation image is trained to remove occlusions and fill the occluded pixels of the object. The output translation parameters are the object center x and y coordinates in the camera image frame, and depth from the camera in the real world. The four rotation parameters are the quaternions of the rotation of the object in the camera frame. The segmentation mask is used again in the pose refinement step to crop out the relevant depth-image point cloud, and the 3D pose estimate is used as a starting point to refine the pose.
Acquiring and labeling the real-world data required to train models for 3D pose estimation is extremely challenging, time consuming, and prone to error. The problem is especially aggravated for robotics: robots are required to perform pose estimation in a wide variety of specialized scenarios, so collecting large amounts of accurate and labeled real-world data for every such scenario is prohibitive, which in turn slows down the rate of adoption of these models.
This application uses simulated data, training the network for 3D pose estimation entirely in simulation, to bridge the sim-to-real gap through features available in simulators, such as domain randomization and procedural generation techniques.
The Pose CNN Decoder training stage involves two broad steps:
- Generating simulated data
- Running the Pose CNN Decoder-training pipeline
Generating Simulated Data
Training the Pose CNN Decoder model requires three different types of data:
- A rendered RGB image
- A de-noised version of the segmentation image (i.e., a version of the segmentation image that is invariant to occlusions)
- 3D poses of camera and the object of interest to compute the 3D pose of the object with respect to camera
Generating data for sample objects from scene binary: Industrial Dolly and Industrial Box
A sample “factory of the future” scene binary, which generates the above data for two
objects, a dolly and box, is available in the isaac_sim_unity3d
repository in
builds
. Run the binary with the following command within the Isaac Sim Unity3D release
build to start sending the samples for dolly pose estimation training:
bob@desktop:~/isaac_sim_unity3d$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 8
This command runs the Factory01 Unity scene and scenario 8, the dolly pose estimation training scenario. The training data in this scenario is collected with view points of the camera attached to the robot.
Use the following command to run the binary within the Isaac Sim Unity3D release build to start sending the samples for box pose estimation training:
bob@desktop:~/isaac_sim_unity3d$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 10
Isaac Sim Unity3D communicates with the Isaac SDK via TCP sockets. The
isaac/sdk/packages/navsim/apps/navsim.app.json
application sets up a connection to publish
the simulation data to a user-defined port using a TcpPublisher node.
The data is received by the training
application at
packages/object_pose_estimation/apps/pose_cnn_decoder/training/training.app.json
using TcpSubscriber.
Generating data for custom objects from scene source file
A sample scene to generate training data for custom objects is available in the
isaac_sim_unity3d
repository in packages/Nvidia/Samples/ObjectPoseEstimation
.
Follow the instructions on the Isaac Sim Unity3D page to open the
pose_cnn_decoder_training.unity
scene in Unity Editor. This sample
scene can generate data with randomized backgrounds, occluding objects, lighting conditions,
and camera poses.
The scene is configured to stream all three required labels for each frame to Isaac SDK using TCP. The default object of interest in the scene is a dolly. Follow these steps to set up the scene for a new object:
If the object of interest is already in the group in
packages/Nvidia/Samples/ObjectPoseEstimation/PoseEstimationObjectsGroup
,
you can skip steps 1-3.
- Load Unity, upload the 3D CAD model and textures of the object, and create a prefab.
- In the prefab, click Add Component and add the
LabelSetter
script. Enter a Label Name of your choice. This script creates segmentation images and computes bounding boxes. - Add the prefab to the
PoseEstimationObjectsGroup
inpackages/Nvidia/Samples/ObjectPoseEstimation
. To accomplish this, either increase the element size by 1 or swap an existing prefab with the prefab of your object by dragging and dropping the prefab into the list. - Drag the
packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity
scene into the Hierarchy panel and remove any existing scenes in the panel. - Provide the label name set to the prefab in the Class Label Rules > dolly > Expression field. The default name is “dolly”, which is the label name of the “Dolly” prefab. The label name is used in Isaac Sim to render the segmentation image of the object in a scene. The expression name in ClassLabelManager can be any name of your choice and is set by default to “Target”.
- Save the scene and play it.
At this point, you should see the camera positions, objects, lighting, and background
substances for walls and floors changing every frame. You can adjust the randomization frequency
of the backgrounds, camera positions, etc. as needed in the procedural_camera GameObject
and its child CameraGroup, as well as procedural GameObjects.
It is recommended to adjust the UniformRange
values for Position
and Target
configuration in CameraGroup child Object in
procedural camera based on the size of the object. The goal is to ensure that
the data samples comprise of all the possible view points and distance ranges from camera.
Adding non-zero uniform range to Target
parameter ensures that the object is not
always rendered in the center of the image frame, which is important to get diverse training
samples for translation.
The camera model is set by default to use vertical FOV of 42 degrees and image resolution of
1280 (Width) x 720 (Height). If you wish to train for some other camera model, these values can
be changed in Field of View
, and Width
and Height
in
EncoderColorCamera and DecoderSegmentationCamera that are child objects to
CameraGroup.
A basic scene setup for pose estimation training is also available in Isaac Sim using OV Kit to get started using sample cube. Details on the scene setup are provided in the documentation here. The scene is scripted to send the poses, encoder, and decoder images needed for a sample cube through TCP using the same channel names as Unity scene setup. Hence, the training pipeline in Isaac SDK can be re-used for the Omniverse based training scene as well. However, it is up to the users to add all the required domain randomizations like background, lighting, texture, color, and set up the camera pose randomizations needed for the training to ensure that the data samples cover the requirements of the inference setup. More details on domain randomization in Isaac Sim can be found here.
Running the Pose CNN Decoder Training Pipeline

Pose CNN Decoder Training Application Graph
Isaac SDK offers two modes for training the pose estimation model: online and offline. During online training, the data is generated in simulation on the fly and ingested into the training pipeline. The main advantage of this model of training is the availability of infinite amount of data to train the networks. However, running simulation for data generation and training the network simultaneously is a computation-intensive task and can slow down the training in the absence of availability of more than one GPU. In such cases, it is useful to perform training offline, where the users first generate and save a finite amount of labeled data for training and then perform training separately. The figure above illustrates the training application for online training and the codelet connections. In case of offline training, the SampleAccumulator node at the end is replaced with the data generation node.
Storing generated data for offline training and validation
Before offline training starts, a
fixed amount of data is generated and saved by running the simulation scene and
the provided data generation application
(packages/object_pose_estimation/apps/pose_cnn_decoder/training/generate_training_validation_data.app.json
)
in the Isaac SDK repository.
This application also generates validation data to assess the performance of the trained network
on an unseen fixed dataset as the training progresses. You can skip this step if you
want to do online training without validation during training.
Before running the application, set the number of required training and validation data samples
in the GeneratePoseCnnDecoderData
component with the respective config parameters
num_training_samples
and num_validation_samples
in the application
packages/object_pose_estimation/apps/pose_cnn_decoder/training/generate_training_validation_data.app.json
.
In case of online training, set the num_training_samples
to 0.
Also, set the scene
parameter to the scene name pose_cnn_decoder_training
when using the custom Unity scene and the prefab name of the object to the
robot_prefab
parameter in the data.simulation.scenario_manager
component.
To use factory of the future scene instead for industrial dolly and box, set the scene
name to Factory01
.
To generate the data, run the
application using the following command inside the Isaac SDK repository
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder/training:generate_training_validation_data
The training and validation data are saved by default in /tmp/pose_data/training
and
/tmp/pose_data/validation
, respectively.
Running training application
For online training, the simulation scene must be still running in parallel.
The training configuration can be set in the
packages/object_pose_estimation/apps/pose_cnn_decoder/training/training_config.json
file.
The training mode is configured by setting the "online_training"
config parameter
to “true” in case of online training and “false” for offline training.
In case of online training, set the config parameters scene
to
pose_cnn_decoder_training
and robot_prefab
to prefab name of the
object to to be spawned in the simulation for custom Unity scene.
The default scene name is “Factory01” that corresponds to the factory of the future scene.
To train Industrial Dolly online using the provided factory of the future scene, set
decoding_full_object
parameter to false in the training config file. This enables the
decoder to train to decode only the frame of the dolly without the wheels
so that the pose is more robust to wheel randomization for this use-case.
In case of offline training,
the path to saved training data is set in
training_data_dir
. The default directory is /tmp/pose_data/training
, which
is the directory generated for storing validation data during the offline data generation step.
The gpu to run the training can be set using config parameter gpu_visible_device
and is
set by default to “0”. If you have multiple GPUs on your system, it is recommended to run
the training on a gpu different from the one running the simulation in case of online training
to increase the training speed.
The logs and checkpoints are stored in /tmp/autoenc_logs/ckpts
by default, but this path
can be changed using the train_logdir
config option in the configuration
file. By default, the training application runs from training iteration 0 until the iteration
number equals the training_step
value given in the training_config.json
file.
To start the training from an intermediate checkpoint, set the checkpoint
config option to
the corresponding checkpoint filename, which is by default in the form
model-<checkpoint_number>
. The script extracts the number, appending “model-” as the
checkpoint number, and restarts from that iteration number. For example, if you want to restart
from iteration number 10000, set the checkpoint
to model-10000
. The step number
will then start from 10000 and end at the value of the training_steps
config option,
which is set to 50000 by default.
The default learning rate of the training is set to 2e-4 in the config option learning_rate
and can be reduced to 1e-4 after first 20000 iterations
or so to facilitate the training and reduce the errors further.
If you need to validate the model during training on a fixed simulation data, set the
validation
config option to “true” and specify the path to saved validation data in
validation_data_dir
. This is set to /tmp/pose_data/validation
by default, which
is the directory generated for storing validation data during the offline data generation step.
During the validation step, the pose estimation error metrics like median error in translation
and rotation are computed and stored, along with the corresponding iteration number in the logs
directory. The number of validation data samples and the frequency of the validation are set in
config options num_validation_data
and validation_every
respectively.
Computing the error metrics from the model outputs require information about the pinhole model
of the actual camera and the dimensions of the image before cropping–these values are provided in
the config options pinhole_focal
, pinhole_center
and image_dim
. The default
options for these parameters are set to the training scene setup with camera vertical FOV of 42
degrees and image resolution of 720 x 1280. The user also needs to specify the rotation and
translation error thresholds to compute the accuracy metric (fraction of validation data within the
set error thresholds) using the config options rotation_err_threshold
(default
is 10 degrees)
and translation_err_threshold
, respectively. The error metrics are computed over the
samples whose ground truth depth (distance from the camera) is within the range of interest
specified by the depth_roi
config option. The default is set to [1.7, 3.5], which
means the mean error metrics of the pose are computed only across the samples in the
validation data whose distance of the object from the camera is in range 1.7 m to 3.5 m.
Setting it to “None” computes the metrics across all the validation data samples.
The parameters after the validation config options are mainly related to the pose
network architecture and can be left with the default values, unless you need to experiment with
the network architecture values, such as the number of filters in the encoder, decoder
architectures, filter sizes, latent vector, and fully connected layer sizes that change the size of
the network. There are also options to set the translation and rotation loss functions using
the config params translation_loss
and rotation_loss
. The available options for
translation loss are “L1”, “L2”, which compute the L1 and L2 losses
of translation values. The following are available options for rotation loss:
- “AcosLoss”: The default value, where the error function is the arccosine of the quaternion dot product of the prediction and estimated labels, which is the rotation error in radians.
- “PLoss”: This loss function computes the L2 error of the distance of the corners of unit cube after rotating them with predicted and ground truth rotations.
Lastly, there are two other important config options corresponding to the axis of symmetry of the
object: symmetry_axes
, which is set to the index of the rotation axis of symmetry with 0,
1 and 2 representing x, y, and z axes respectively, and num_rotation_symmetry
, which is
set to the number of axis of symmetry around the “symmetry_axes”. These config parameters are
important for computing the rotation loss function by considering the symmetry of the object.
Now, run the training application with the following command:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder/training:pose_estimation_cnn_training
To view the training progress on TensorBoard, run the following command in the terminal:
tensorboard --logdir=/tmp/autoenc_logs/ckpts
The visualization interface can be accessed at http://localhost:6006
.
The images received from simulation engine, along with the bounding boxes, can be visualized in
Sight at http://localhost:3000
.
If the decoder output in TensorBoard looks completely black or completely white
within 10 iterations, change the add_noise_to_image
setting to “true”
in the training_config.json
file. This adds noise to the images, which prevents
the training variables from saturating. You should now start seeing a combination of
black and white pixels instead of a completely black/white image. You can reset the
option to “false” after a few hundred iterations to train on cleaner images if desired.
Depending on the type of object, the model may not need the default of 50000 training steps. You can end training sooner if the reconstructed decoder image, which is the output of the model, is clear over multiple training samples. This is a good check to determine the quality of the model. You can view the reconstructed image in the TensorBoard Images section. Also, you can evaluate the model on sample simulation data to make sure the error metrics are within acceptable limits for the use case. Instructions for performing pose estimation evaluation are detailed in the final sections of this chapter.
The frozen TensorFlow model is only generated at the end of training (i.e. when the number of
iterations equals the value of the training_steps
config option). So, if you want to end
training before the given steps are finished, you can generate the frozen model from this
checkpoint by running the following command:
bob@desktop:~/isaac/sdk$ bazel run packages/ml/tools:freeze_tensorflow_model_tool -- --out /tmp/autoenc_logs/ckpts/model-24000 --output_node_name 'translation_output','rotation_output_1','decoder_output' /tmp/autoenc_logs/ckpts/model-24000
The above command sets the last checkpoint number to 24000. You can change this value as needed.
Conversion to UFF Model
The UFF package contains a set of utilities to convert trained models from various frameworks to a common UFF format. In this application, the UFF parser converts the Tensorflow model to UFF so it can be used for codebook generation and inference. Refer to the NVIDIA TensorRT Documentation for more details.
At the end of the training iterations, the Tensorflow model is saved as a .pb file. You then
need to convert it to the UFF model using the python script and UFF parser.
For example, at the end of 24000 iterations, the Tensorflow model is saved as
model-24000.pb
and can be converted to a UFF model using the following command:
bob@desktop:~/isaac/sdk$ bazel run packages/ml/tools:tensorflow_to_tensorrt_tool -- --out /tmp/autoenc_logs/ckpts/pose_cnn_model.uff --input_node_name 'encoder_input' --input_node_name 'encoder_bbox_input' --output_node_name 'translation_output' --output_node_name 'rotation_output_1' --output_node_name 'decoder_output' /tmp/autoenc_logs/ckpts/model-24000-frozen.pb
Running Inference
It is advised to use the same camera for inference as well as the training.
The inference application handles changes in image resolution
and focal length as long as the aspect ratio of the inference image is the same
as the training data. The default
training camera model uses a vertical FOV of 42 degrees. However, if the camera model
parameters for the inference camera differ widely from the training camera, we recommend
training the model with inference camera parameters and tuning the camera randomization
parameters in the training scene such that the data samples cover the object of interest
spawned across the complete image frame. If the FOV or the image dimensions of the training
data are changed, these parameters must be set in
detection_pose_estimation.object_pose_estimation.pose_estimation
component
in the inference configuration file used.
The end-to-end inference for the estimating the pose of the object using Pose CNN Decoder
architecture requires the inference of object detection model followed by pose estimation.
Before running the end-to-end inference with both the models, a sample inference application
in simulation is provided with ground truth detections so that the users can first test the
performance of the pose estimation model trained using the instructions provided in the sections
above. The same training scene can be used to run this inference application. To do this, first
play the scene used for training (for example, the factory_of_the_future scene with scenario 11
for the dolly). You can also use the custom scene
packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity
in Unity for your object of interest. In Isaac SDK, set the class label name in
object_pose_estimation.detection_filter
component with config parameter
allowlist_labels
in
sdk/packages/object_pose_estimation/apps/pose_cnn_decoder/pose_estimation_inference_sim_groundtruth_detection.app.json
.
Then run the pose estimation inference application from Isaac SDK repository using the following
command:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:pose_estimation_inference_sim_groundtruth_detection
Visualize the inference in Sight at localhost:3000.

Pose Estimation Inference Application Graph
The figure above illustrates a full end-to-end sample inference application.
For the full end-to-end inference of the pose estimation framework,
there are four available modes of inference based on data collection for object pose estimation:
simulation, camera feed, log replay, and image feed. These modes can be run via Python script. As
with training, the package provides sample inference for two objects: a dolly for the navigation
usecase and a box for the manipulation use case. In case of inference for custom object, modify
the configuration parameters in the default inference configuration file
/packages/object_pose_estimation/apps/pose_cnn_decoder/ detection_pose_estimation_cnn_inference.config.json
and use this to run the different inference modes in place of provided sample config files
for dolly and box. Parameters like path to generated uff file from pose estimation training,
parameters for object detection model like path to trained model, confidence threshold, etc. are
all set in this config file.
Current release of the Isaac SDK uses the Jetpack version of TensorRT 7.1.3,
which does not support dynamic batch size
for UFF models. Hence, currently, the pose estimation inference is set up to infer the
pose of the single object instance that overlaps the most with the user-defined center
portion of the image in case of multiple instances of the object in the image.
The extent of the size of this middle portion of the image can be set with the
"focus_factor"
config parameter in
"detection_pose_estimation.object_pose_estimation.detection_filter"
.
Then, the focus region to check for the extent of overlap is the pixel column range
[(1 - focus_factor)/2, (1 + focus_factor)/2] * image_columns.
The inference Python script and the config files for dolly and box are located in
/packages/object_pose_estimation/apps/pose_cnn_decoder
.
Commands to run the four inference modes for both industrial dolly and box are provided below:
Inference on simulation data is a good first test to check the accuracy of the model. To run the inference application on simulation data for both detection and pose estimation, first play the pose estimation training scene binary, then run the inference command within Isaac SDK. The following are input arguments needed for this mode:
--mode
: This should be set tosim
for this mode.--image_channel
: The name of the image channel sending the color image from simulation. The default name iscolor
.--rows
and--cols
: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.--scenario_scene
and--scenario_robot_prefab
: The name of the simulation scene to load and the prefab name of the object interest to spawn respectively, if the object needs to be spawned from Isaac SDK. The default option is set to None and is valid if the object of interest is already set up in the scene like the factory_of_the_future scene.
For dolly inference, run scenario 12 using the following command from inside the
isaac_sim_unity3d
release folder:bob@desktop:~/isaac/sdk$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 12
Then run industrial dolly inference on simulated data using this command:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel color --intrinsics_channel color_intrinsics --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json
For box inference, run inference scenario 11 using the following command from inside the
isaac_sim_unity3d
release folder:bob@desktop:~/isaac/sdk$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 11
Then run box inference on simulated data using this command:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel encoder_color --intrinsics_channel encoder_color_intrinsics --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json
For inference on a custom object using the sample scene in Unity, open and play the
packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity
scene in Unity Editor. You can increase theFrame Interval
config option number for both procedural and procedural camera Game Objects so that the data is not randomized every frame. Then run the inference on simulated data, for example for Dolly.prefab using this command:bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel encoder_color --intrinsics_channel encoder_color_intrinsics --scenario_scene pose_cnn_decoder_training --scenario_robot_prefab Dolly --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference.config.json
To run inference on a real-time camera feed from a RealSense camera, the following input arguments are required:
--mode
: This should be set torealsense
for this mode.--rows
and--cols
: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.--fps
: The frame rate of the RGB channel. Currently, only two frame rates, 15 and 30, are supported for the RealSense camera. The default value is 15.
Run dolly inference on a camera feed using the following command inside Isaac SDK:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode realsense --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json
Run box inference on a camera feed using the following command inside Isaac SDK:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode realsense --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json
To run the inference application using recorded logs, the following input arguments are required:
--mode
: This should be set tocask
for this mode.--cask_directory
: The path to the sample replay log. The default path is set to the dolly replay log.--rows
and--cols
: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.
Run dolly inference on sample logs using the following command inside Isaac SDK:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode cask --rows 480 --cols 848 --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json
Run box inference on sample logs using the following command inside Isaac SDK:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode cask --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json --cask_directory external/sortbot_pose_estimation_data/klt_logs
To run the inference application using sample images, the following input arguments are required
--mode
: This should be set toimage
for this mode.--image_directory
: The path to the sample image. The default is set to the path for the dolly sample image.--rows
and--cols
: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.--optical_center_rows
and--optical_center_cols
: The optical center specified as a row and column in the input image. The default values are 360 and 640, respectively.--focal_length
: The focal length of the camera used to capture the image in pixels. The default value is 925.74.
The last three parameters are needed to construct the pinhole model of the camera.
Run dolly inference on a sample image using the following command inside Isaac SDK:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode image --rows 480 --cols 848 --optical_center_rows 240 --optical_center_cols 424 --focal_length 614.573 --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json
Run box inference on a sample image using the following command inside Isaac SDK:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode image --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json --image_directory external/industrial_dolly_sortbot_sample_images/klt_sample_images/klt_box.png
To run the inference application using ROS input topics, the following input ROS topic names are used by default
NoteROS must be installed before running the test. See the ROSBridge section for more info.
/depth
/rgb
/camera_info
Once the above ROS topics are published, run dolly inference on them using the following command inside Isaac SDK:
bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_ros_bridge_app
The inference applications take an RGB image from different input sources
and outputs the estimated 3D poses of the detected objects along with the segmentation mask of the
cropped image. Sample data of the dolly and box objects is provided, allowing you to run the last
two inference apps using images and logs. You can visualize the estimated pose in Sight at
http://localhost:3000
.
There are two types of visualization for the estimated pose:
- A 3D bounding box, which requires specification of both the 3D bounding box size
at zero orientation and the transformation from the object center to the bounding box center.
Configure these parameters in the
viewers/Detections3Viewer
component in the inference application files. The parameterbox_dimensions
must be set to the dimensions of the object of interest. Parameterobject_T_box_center
must be set to the pose of the object center. It is a pose vector of size 7, the first four representing the rotation quaternion in order (w, x, y, z) and the last three represent the x, y, z components of the translation vector. Hence, the default vector [1, 0, 0, 0, 0, 0, 0] represents zero rotation and translation pose and can be left to this setting if the origin of the CAD model / OBJ file of the object is at its center. If not, appropriate offset to the center must be added in the translation component of the pose vector. - A rendering of the CAD model in the scene, which requires the path to the
object CAD model and file names. These correspond to the
assetroot
andassets
parameters respectively in thewebsight
component in the inference application files.
If you notice lag in mesh versus image rendering in Sight, increase the Change delay
value
slightly to synchronize the image, rendered mesh, and bounding box. To find this option,
right-click the rendering window in Sight.
Another inference application
/packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_deploy_app.json
is provided without including the sample log data for industrial dolly and box that helps in faster
deployment of the package. So, it is recommended to use this application instead of
detection_pose_estimation_cnn_inference_app.json
in case of package deployment.
To use the app, replace detection_pose_estimation_cnn_inference_app
in the commands listed
above with detection_pose_estimation_cnn_inference_deploy_app
to run inference in different
modes. The app includes sample images for industrial dolly and box to test the inference
in image mode.
Sample inference


Model evaluation is crucial for both improving the accuracy and robustness of the model and estimating the capability and limitations of the model for performing certain task. For evaluation of pose estimation models, three primary metrics are used in Isaac SDK for assessing the accuracy of the model to perform a required task:
- Median absolute translation error of x, y, and z components
- Median absolute rotation error
- Model accuracy, which represents the fraction of the evaluation data that meet the thresholds set for median translation and rotation errors.
The evaluation pipeline also outputs two kinds of plots:
- Box plots of the error distribution of the translation and rotation errors for data at different object distances from the camera.
- Accuracy plots for translation and rotation errors showing the model accuracy (i.e. the fraction of data meeting the set error threshold) for different thresholds.
Finally, the outlier data samples along with the 3D bounding boxes of the predicted poses are saved as animation that helps in analyzing the failure cases of the model.

Diagram of Pose Estimation Model Evaluation Pipeline in Isaac SDK
The pose evaluation pipeline in Isaac SDK is set up to ingest data in Isaac log/cask format to ensure that the pipeline is agnostic to the source of evaluation data, i.e. simulation or real data. The figure above shows the three stages of the evaluation framework in Isaac SDK: evaluation data generation, which generates the image and ground truth casks, pose estimation inference along with recording, which generates the predicted pose casks, and finally the computation of evaluation metrics over this generated cask data.
Evaluation Data Collection Using Simulation
Simulation provides easy access to an unlimited amount of labeled data for evaluation. The evaluation pipeline is set up to consume data in Isaac log/cask format, so data samples from simulation along with the ground truth pose are recorded in cask format using an Isaac SDK application.
For inference on a custom object using the sample scene in Unity, open and play the
packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity
scene in Unity Editor. This collects the evaluation data from the same training scene.
You can make any number of changes to the scene settings, including changing the seed number for
randomization of camera, obstacles, etc. during inference to collect diverse samples for training.
At this point, we assume that the prefab of the object of interest is already created and is also listed in the PoseEstimationObjects Group during training. If not, follow steps 1 through 6 in the Generating data for custom objects from scene source file section above before proceeding further.
Run the simulation data recording application in Isaac SDK. The following are important input arguments needed for this mode:
-
--mode
: This should be set topose
for pose estimation data collection. -
--image_channel
: The name of the image channel sending the RGB data in simulation. -
--intrinsics_channel
: The name of the image instrinsics channel containing the pinhole model of the camera. This must be set to the same name as theimage_channel
set above, usually with an_intrinsics
extension. For example, if theimage_channel
param is set toencoder_color
, then theintrinsics_channel
must be set toencoder_color_intrinsics
. -
--load_scene
: The name of the scene to be loaded using scenario manager. -
--load_robot_prefab
: The name of the object prefab to be loaded in the scene. -
--detections3viewer_box_dimensions
: Dimensions of the 3D bounding box to view the groundtruth pose from sim. For example, for industrial dolly, this can be set to “[0.9, 1.32, 0.25]”. -
--detections3viewer_object_T_box_center
: 7 element pose vector of the object to bounding box center transformation to view the 3D bounding box in Sight. The default is set to “[1, 0, 0, 0, 0, 0, 0]”, indicating zero rotation and translation. -
--runtime
: Run time of the simulation to collect the data samples. The default is set to 20 seconds. Increase this value to collect more evaluation data samples.
Then the evaluation data can be collected using the following command in Isaac SDK, which uses
Dolly.prefab
as an example:
bob@desktop:~/isaac/sdk$ bazel run packages/ml/apps/record_sim_ground_truth:record_sim_ground_truth -- --mode pose --image_channel encoder_color --intrinsics_channel encoder_color_intrinsics --scenario_scene pose_cnn_decoder_training --scenario_robot_prefab Dolly
The command connects to the simulation engine, collects RGB images along with ground truth pose of the dolly, and records it as an Isaac Log that can be later replayed during model evaluation. Note that the evaluation pipeline currently supports only single-instance data samples to make it easier to associate the ground truth pose with the object of interest. By default, the image data cask is saved to /tmp/data/raw and contains one channel (‘color’). The ground truth data cask is saved to /tmp/data/ground_truth and contains one channel (‘bodies’). Each run of the application saves one image and ground truth data cask pair. This way, multiple sets of data can be collected for evaluation in the form of multiple logs.
Pose Estimation Evaluation
Once we have the collected evaluation data along with the ground truth for the pose, the evaluation of the model is performed in two stages.
- Pose estimation inference recording
- Evaluation metric computation
Pose Estimation Inference Recording
In this step, the application replays all the image cask files in a given input directory, runs the pose estimation inference application, and records the estimated pose as a cask log for each input image cask file.
The following are the important input arguments needed for this script:
-
--inference_app
: The path to the application file that replays a log, performs inference, and records results. -
--config
: The config file to load for inference parameters for the above inference app. -
--raci_metadata
: Saves JSON metadata along with cask to <app_uuid>_md.json. This is required to use the following steps in the evaluation pipeline. -
--no_raci_metadata
: No metadata saved. -
--input_cask_workspace
: The workspace containing the input cask files. Input image logs must be in thedata/raw
directory inside this workspace. -
--output_cask_workspace
: The output cask files are written indata/<output_directory_name>
inside this workspace. If this parameter is not set, it is assumed to be same as theinput_cask_workspace
. -
--output_directory_name
: Base directory name to which to write the predictions cask output. Cask files are created in<output_cask_workspace>/data/<output_directory_name>
. The default is set to “predictions”.
Assuming that the input image casks are stored in /tmp/data/raw
, the predicted pose casks
can be generated by running the following command from the Isaac SDK directory:
bob@desktop:~/isaac/sdk$ bazel run packages/ml/apps/evaluation_inference:evaluation_inference -- --inference_config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference.config.json --inference_app packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation/pose_cnn_decoder_inference_record.app.json --input_cask_workspace /tmp
With the above command, the output prediction pose casks are stored in the path
/tmp/data/predictions
. The output pose estimation casks are named with the same name as
input image casks, with an additional tag so that the image cask and the corresponding pose
casks can be associated in the next step for model evaluation.
Sample logs for industrial dolly in factory of future scene from robot camera view points are also provided to try out the evaluation pipeline with the provided industrial dolly pose detection models. To run inference and record the predictions on the sample log, run the following command:
bob@desktop:~/isaac/sdk$ bazel run packages/ml/apps/evaluation_inference:evaluation_inference -- --inference_config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference.config.json --inference_app packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation/pose_cnn_decoder_inference_record.app.json --input_cask_workspace external/industrial_dolly_pose_evaluation_data --output_cask_workspace /tmp
The output prediction casks are stored in /tmp/data/predictions
.
Evaluation Metrics Computation
Once the image, ground truth, and prediction data are collected, the evaluation metrics can be
computed. This step reads the full list of casks in the image cask directory and their
corresponding ground truth data and predictions, computes framewise metrics, and aggregates
the metrics. The configuration file located
in packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation/evaluation_config.json
is used to set the translation and rotation error thresholds, the pinhole camera model, and the
object dimensions. The default values are set to the values for the dolly model.
The following are input arguments needed for this script:
-
config
: The path to the config file for evaluation. The default is set topackages/object_pose_estimation/apps/pose_cnn_decoder/evaluation/evaluation_config.json
. -
--use_2d_detections
: If set totrue
, predicted 2D detections are assumed to be one of the channels of the predicted poses cask. The detections are visualized for the outliers. -
--image_cask_dir
: The path to the image cask directory. Only the image logs must be placed in this directory. The data is aggregated over all the logs in this directory. The default path is set to/tmp/data/raw
. -
--gt_cask_dir
: The path to the ground truth pose cask directory corresponding to the image casks in theimage_cask_dir
. The default path is set to/tmp/data/ground_truth
. -
--predicted_cask_dir
: The path to the predicted pose cask directory corresponding to the image casks in theimage_cask_dir
. It is expected to contain 2D detections as well ifuse_2d_detections
is set totrue
. The default path is set to/tmp/data/predictions
. -
--results_dir
: The path to store the evaluation results. The directory is created if not already present. The default path is set to/tmp/pose_evaluation_results
.
The name of the channels used for the image cask for color image, ground truth pose,
and predicted pose are set in the evaluation config file with the config parameters
image_channel
, gt_pose_channel
, and pred_pose_channel
, respectively. The
default channel names are color
, gt_poses
, and predicted_poses
,
respectively.
The pose evaluation from the generated cask files can be run with the following command:
bob@desktop:~/isaac/sdk$ bazel run //packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation:pose_cask_evaluation -- --image_cask_dir='/tmp/data/raw' --gt_cask_dir='/tmp/data/ground_truth' --predicted_cask_dir='/tmp/data/predictions'
To compute evaluation metrics over the provided sample logs instead, run the following command:
bob@desktop:~/isaac/sdk$ bazel run //packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation:pose_cask_evaluation -- --image_cask_dir='external/industrial_dolly_pose_evaluation_data/data/raw' --gt_cask_dir='external/industrial_dolly_pose_evaluation_data/data/ground_truth' --predicted_cask_dir='/tmp/data/predictions'
Pose metrics like median absolute rotation, translation errors, and the
accuracy of the model are written in JSON file. It also has a KPI pass/fail metric that
indicates if the errors and model accuracy met the set thresholds in the config file.
In addition to the metrics JSON file, the box plots of the error distribution and the
accuracy plots along with the outlier samples are saved in /tmp/pose_evaluation_results
.
This evaluation pipeline is provided as a tool to aid in improving model performance. If evaluation results do not meet standards, consider modifying the training data to better reflect the distribution of the data used for evaluation. The outlier results can aid in error analysis: Investigate the failure cases and outliers on which the model performs poorly, and use these cases to inspire the training scene in simulation.