# 3D Object Pose Estimation with Pose CNN Decoder

Object detection and 3D pose estimation play a crucial role in robotics. They are needed in a variety of applications such as navigation, object manipulation, and inspection. The 3D Object Pose Estimation application in the Isaac SDK provides the framework to train pose estimation for any model completely in simulation, and to test and run the inference in simulations, as well as the real world.

This application uses the Isaac SDK Pose CNN Decoder architecture for its 3D pose estimation model. This architecture is based on the work by Sundermeyer et al. and PoseCNN by Xiang et al. Given an RGB image, the algorithm first detects objects from a known set of objects with available 3D CAD models using any object detection model available in Isaac SDK, then estimates their 3D poses using the Pose CNN Decoder model. This application and network architecture aims at low-latency, real-time object detection and 3D pose estimation by leveraging GPU acceleration while achieving good accuracy.

Compared to the autoencoder-based 3D pose estimation algorithm in Isaac SDK, this application has the following attributes:

• More user-friendly: It eliminates an extra intermediate step to generate the codebook and perform codebook lookup.
• Can handle multiple object classes with a single trained model: On the other hand, the autoencoder requires a separate trained model and codebook for each object class.
• More robust against occlusions: The network learns to directly predict the object center from the bounding box parameters and input image.

## Application Overview

3D Pose Estimation Algorithm

This application consists of two modules:

• An inference module
• A Pose CNN Decoder training module

### Inference Module

The inference module consists of two sub-modules: object detection and 3D pose estimation. The object detection sub-module accepts an RGB image and determines bounding boxes for objects of interest using any object detection model available in the Isaac SDK. Currently, the sample application files support the Isaac’s DetectNet object detection inference model. The 3D pose estimation sub-module crops the images based on the bounding boxes and estimates the 3D poses of the objects in each of the cropped images. To estimate the 3D pose of an object, this module requires a trained Pose CNN Decoder neural network that accepts as inputs the cropped images and their corresponding bounding box parameters, namely the bounding box size and center.

### Pose CNN Decoder Training Module

The Pose CNN Decoder training module trains the neural network for the object or list of objects. This module uses a variant of the encoder-decoder architecture, along with the regression layers, to estimate the 3D pose of the object. It trains the decoder neural network to reconstruct the input RGB image as a segmentation image that is invariant to occlusions, thus forcing the network to learn the object shape irrespective of the occlusions and other noise, background, and lighting conditions.

The training module requires simulated color and segmentation image data based on the user-provided 3D CAD model of the object. This application also provides a simulated data generation module using the Unity Engine and Isaac Sim Unity3D, which generates simulated data for an object given its 3D CAD model. However, the training application is agnostic to the simulation platform used to generate the data and can be replaced with any other simulation engines like Isaac Sim based on NVIDIA’s Omniverse Kit.

## Pose CNN Decoder Architecture

The Pose CNN Decoder network takes two inputs: A cropped RGB image and the bounding box parameters. The network uses two encoders to extract the features of the image, as well as the bounding box, and concatenates the feature space. The network uses the concatenated feature space to decode the cropped image as a de-noised segmentation image and estimates the translation and rotation parameters of the 3D pose of the object by posing it as a regression problem.

3D Pose Estimation DNN Architecture

The decoder that outputs the segmentation image is trained to remove occlusions and fill the occluded pixels of the object. The output translation parameters are the object center x and y coordinates in the camera image frame, and depth from the camera in the real world. The four rotation parameters are the quaternions of the rotation of the object in the camera frame. The segmentation mask is used again in the pose refinement step to crop out the relevant depth-image point cloud, and the 3D pose estimate is used as a starting point to refine the pose.

## Pose CNN Decoder Training

Acquiring and labeling the real-world data required to train models for 3D pose estimation is extremely challenging, time consuming, and prone to error. The problem is especially aggravated for robotics: robots are required to perform pose estimation in a wide variety of specialized scenarios, so collecting large amounts of accurate and labeled real-world data for every such scenario is prohibitive, which in turn slows down the rate of adoption of these models.

This application uses simulated data, training the network for 3D pose estimation entirely in simulation, to bridge the sim-to-real gap through features available in simulators, such as domain randomization and procedural generation techniques.

The Pose CNN Decoder training stage involves two broad steps:

1. Generating simulated data
2. Running the Pose CNN Decoder-training pipeline

### Generating Simulated Data

Training the Pose CNN Decoder model requires three different types of data:

• A rendered RGB image
• A de-noised version of the segmentation image (i.e., a version of the segmentation image that is invariant to occlusions)
• 3D poses of camera and the object of interest to compute the 3D pose of the object with respect to camera

#### Generating data for sample objects from scene binary: Industrial Dolly and Industrial Box

A sample “factory of the future” scene binary, which generates the above data for two objects, a dolly and box, is available in the isaac_sim_unity3d repository in builds. Run the binary with the following command within the Isaac Sim Unity3D release build to start sending the samples for dolly pose estimation training:

Copy
Copied!

bob@desktop:~/isaac_sim_unity3d$./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 8  This command runs the Factory01 Unity scene and scenario 8, the dolly pose estimation training scenario. The training data in this scenario is collected with view points of the camera attached to the robot. Use the following command to run the binary within the Isaac Sim Unity3D release build to start sending the samples for box pose estimation training: Copy Copied!  bob@desktop:~/isaac_sim_unity3d$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 10


Isaac Sim Unity3D communicates with the Isaac SDK via TCP sockets. The isaac/sdk/packages/navsim/apps/navsim.app.json application sets up a connection to publish the simulation data to a user-defined port using a TcpPublisher node. The data is received by the training application at packages/object_pose_estimation/apps/pose_cnn_decoder/training/training.app.json using TcpSubscriber.

#### Generating data for custom objects from scene source file

A sample scene to generate training data for custom objects is available in the isaac_sim_unity3d repository in packages/Nvidia/Samples/ObjectPoseEstimation. Follow the instructions on the Isaac Sim Unity3D page to open the pose_cnn_decoder_training.unity scene in Unity Editor. This sample scene can generate data with randomized backgrounds, occluding objects, lighting conditions, and camera poses.

The scene is configured to stream all three required labels for each frame to Isaac SDK using TCP. The default object of interest in the scene is a dolly. Follow these steps to set up the scene for a new object:

Note

If the object of interest is already in the group in packages/Nvidia/Samples/ObjectPoseEstimation/PoseEstimationObjectsGroup, you can skip steps 1-3.

1. Load Unity, upload the 3D CAD model and textures of the object, and create a prefab.
2. In the prefab, click Add Component and add the LabelSetter script. Enter a Label Name of your choice. This script creates segmentation images and computes bounding boxes.
3. Add the prefab to the PoseEstimationObjectsGroup in packages/Nvidia/Samples/ObjectPoseEstimation. To accomplish this, either increase the element size by 1 or swap an existing prefab with the prefab of your object by dragging and dropping the prefab into the list.
4. Drag the packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity scene into the Hierarchy panel and remove any existing scenes in the panel.
5. Provide the label name set to the prefab in the Class Label Rules > dolly > Expression field. The default name is “dolly”, which is the label name of the “Dolly” prefab. The label name is used in Isaac Sim to render the segmentation image of the object in a scene. The expression name in ClassLabelManager can be any name of your choice and is set by default to “Target”.
6. Save the scene and play it.

At this point, you should see the camera positions, objects, lighting, and background substances for walls and floors changing every frame. You can adjust the randomization frequency of the backgrounds, camera positions, etc. as needed in the procedural_camera GameObject and its child CameraGroup, as well as procedural GameObjects. It is recommended to adjust the UniformRange values for Position and Target configuration in CameraGroup child Object in procedural camera based on the size of the object. The goal is to ensure that the data samples comprise of all the possible view points and distance ranges from camera. Adding non-zero uniform range to Target parameter ensures that the object is not always rendered in the center of the image frame, which is important to get diverse training samples for translation.

Note

The camera model is set by default to use vertical FOV of 42 degrees and image resolution of 1280 (Width) x 720 (Height). If you wish to train for some other camera model, these values can be changed in Field of View, and Width and Height in EncoderColorCamera and DecoderSegmentationCamera that are child objects to CameraGroup.

A basic scene setup for pose estimation training is also available in Isaac Sim using OV Kit to get started using sample cube. Details on the scene setup are provided in the documentation here. The scene is scripted to send the poses, encoder, and decoder images needed for a sample cube through TCP using the same channel names as Unity scene setup. Hence, the training pipeline in Isaac SDK can be re-used for the Omniverse based training scene as well. However, it is up to the users to add all the required domain randomizations like background, lighting, texture, color, and set up the camera pose randomizations needed for the training to ensure that the data samples cover the requirements of the inference setup. More details on domain randomization in Isaac Sim can be found here.

### Running the Pose CNN Decoder Training Pipeline

Pose CNN Decoder Training Application Graph

Isaac SDK offers two modes for training the pose estimation model: online and offline. During online training, the data is generated in simulation on the fly and ingested into the training pipeline. The main advantage of this model of training is the availability of infinite amount of data to train the networks. However, running simulation for data generation and training the network simultaneously is a computation-intensive task and can slow down the training in the absence of availability of more than one GPU. In such cases, it is useful to perform training offline, where the users first generate and save a finite amount of labeled data for training and then perform training separately. The figure above illustrates the training application for online training and the codelet connections. In case of offline training, the SampleAccumulator node at the end is replaced with the data generation node.

#### Storing generated data for offline training and validation

Before offline training starts, a fixed amount of data is generated and saved by running the simulation scene and the provided data generation application (packages/object_pose_estimation/apps/pose_cnn_decoder/training/generate_training_validation_data.app.json) in the Isaac SDK repository. This application also generates validation data to assess the performance of the trained network on an unseen fixed dataset as the training progresses. You can skip this step if you want to do online training without validation during training.

Before running the application, set the number of required training and validation data samples in the GeneratePoseCnnDecoderData component with the respective config parameters num_training_samples and num_validation_samples in the application packages/object_pose_estimation/apps/pose_cnn_decoder/training/generate_training_validation_data.app.json. In case of online training, set the num_training_samples to 0. Also, set the scene parameter to the scene name pose_cnn_decoder_training when using the custom Unity scene and the prefab name of the object to the robot_prefab parameter in the data.simulation.scenario_manager component. To use factory of the future scene instead for industrial dolly and box, set the scene name to Factory01. To generate the data, run the application using the following command inside the Isaac SDK repository

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder/training:generate_training_validation_data  The training and validation data are saved by default in /tmp/pose_data/training and /tmp/pose_data/validation, respectively. #### Running training application For online training, the simulation scene must be still running in parallel. The training configuration can be set in the packages/object_pose_estimation/apps/pose_cnn_decoder/training/training_config.json file. The training mode is configured by setting the "online_training" config parameter to “true” in case of online training and “false” for offline training. In case of online training, set the config parameters scene to pose_cnn_decoder_training and robot_prefab to prefab name of the object to to be spawned in the simulation for custom Unity scene. The default scene name is “Factory01” that corresponds to the factory of the future scene. To train Industrial Dolly online using the provided factory of the future scene, set decoding_full_object parameter to false in the training config file. This enables the decoder to train to decode only the frame of the dolly without the wheels so that the pose is more robust to wheel randomization for this use-case. In case of offline training, the path to saved training data is set in training_data_dir. The default directory is /tmp/pose_data/training, which is the directory generated for storing validation data during the offline data generation step. The gpu to run the training can be set using config parameter gpu_visible_device and is set by default to “0”. If you have multiple GPUs on your system, it is recommended to run the training on a gpu different from the one running the simulation in case of online training to increase the training speed. The logs and checkpoints are stored in /tmp/autoenc_logs/ckpts by default, but this path can be changed using the train_logdir config option in the configuration file. By default, the training application runs from training iteration 0 until the iteration number equals the training_step value given in the training_config.json file. To start the training from an intermediate checkpoint, set the checkpoint config option to the corresponding checkpoint filename, which is by default in the form model-<checkpoint_number>. The script extracts the number, appending “model-” as the checkpoint number, and restarts from that iteration number. For example, if you want to restart from iteration number 10000, set the checkpoint to model-10000. The step number will then start from 10000 and end at the value of the training_steps config option, which is set to 50000 by default. The default learning rate of the training is set to 2e-4 in the config option learning_rate and can be reduced to 1e-4 after first 20000 iterations or so to facilitate the training and reduce the errors further. If you need to validate the model during training on a fixed simulation data, set the validation config option to “true” and specify the path to saved validation data in validation_data_dir. This is set to /tmp/pose_data/validation by default, which is the directory generated for storing validation data during the offline data generation step. During the validation step, the pose estimation error metrics like median error in translation and rotation are computed and stored, along with the corresponding iteration number in the logs directory. The number of validation data samples and the frequency of the validation are set in config options num_validation_data and validation_every respectively. Computing the error metrics from the model outputs require information about the pinhole model of the actual camera and the dimensions of the image before cropping–these values are provided in the config options pinhole_focal, pinhole_center and image_dim. The default options for these parameters are set to the training scene setup with camera vertical FOV of 42 degrees and image resolution of 720 x 1280. The user also needs to specify the rotation and translation error thresholds to compute the accuracy metric (fraction of validation data within the set error thresholds) using the config options rotation_err_threshold (default is 10 degrees) and translation_err_threshold, respectively. The error metrics are computed over the samples whose ground truth depth (distance from the camera) is within the range of interest specified by the depth_roi config option. The default is set to [1.7, 3.5], which means the mean error metrics of the pose are computed only across the samples in the validation data whose distance of the object from the camera is in range 1.7 m to 3.5 m. Setting it to “None” computes the metrics across all the validation data samples. The parameters after the validation config options are mainly related to the pose network architecture and can be left with the default values, unless you need to experiment with the network architecture values, such as the number of filters in the encoder, decoder architectures, filter sizes, latent vector, and fully connected layer sizes that change the size of the network. There are also options to set the translation and rotation loss functions using the config params translation_loss and rotation_loss. The available options for translation loss are “L1”, “L2”, which compute the L1 and L2 losses of translation values. The following are available options for rotation loss: • “AcosLoss”: The default value, where the error function is the arccosine of the quaternion dot product of the prediction and estimated labels, which is the rotation error in radians. • “PLoss”: This loss function computes the L2 error of the distance of the corners of unit cube after rotating them with predicted and ground truth rotations. Lastly, there are two other important config options corresponding to the axis of symmetry of the object: symmetry_axes, which is set to the index of the rotation axis of symmetry with 0, 1 and 2 representing x, y, and z axes respectively, and num_rotation_symmetry, which is set to the number of axis of symmetry around the “symmetry_axes”. These config parameters are important for computing the rotation loss function by considering the symmetry of the object. Now, run the training application with the following command: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder/training:pose_estimation_cnn_training


To view the training progress on TensorBoard, run the following command in the terminal:

Copy
Copied!

tensorboard --logdir=/tmp/autoenc_logs/ckpts


The visualization interface can be accessed at http://localhost:6006. The images received from simulation engine, along with the bounding boxes, can be visualized in Sight at http://localhost:3000.

Note

If the decoder output in TensorBoard looks completely black or completely white within 10 iterations, change the add_noise_to_image setting to “true” in the training_config.json file. This adds noise to the images, which prevents the training variables from saturating. You should now start seeing a combination of black and white pixels instead of a completely black/white image. You can reset the option to “false” after a few hundred iterations to train on cleaner images if desired.

Note

Depending on the type of object, the model may not need the default of 50000 training steps. You can end training sooner if the reconstructed decoder image, which is the output of the model, is clear over multiple training samples. This is a good check to determine the quality of the model. You can view the reconstructed image in the TensorBoard Images section. Also, you can evaluate the model on sample simulation data to make sure the error metrics are within acceptable limits for the use case. Instructions for performing pose estimation evaluation are detailed in the final sections of this chapter.

The frozen TensorFlow model is only generated at the end of training (i.e. when the number of iterations equals the value of the training_steps config option). So, if you want to end training before the given steps are finished, you can generate the frozen model from this checkpoint by running the following command:

Copy
Copied!



### Running Inference

Note

It is advised to use the same camera for inference as well as the training. The inference application handles changes in image resolution and focal length as long as the aspect ratio of the inference image is the same as the training data. The default training camera model uses a vertical FOV of 42 degrees. However, if the camera model parameters for the inference camera differ widely from the training camera, we recommend training the model with inference camera parameters and tuning the camera randomization parameters in the training scene such that the data samples cover the object of interest spawned across the complete image frame. If the FOV or the image dimensions of the training data are changed, these parameters must be set in detection_pose_estimation.object_pose_estimation.pose_estimation component in the inference configuration file used.

The end-to-end inference for the estimating the pose of the object using Pose CNN Decoder architecture requires the inference of object detection model followed by pose estimation. Before running the end-to-end inference with both the models, a sample inference application in simulation is provided with ground truth detections so that the users can first test the performance of the pose estimation model trained using the instructions provided in the sections above. The same training scene can be used to run this inference application. To do this, first play the scene used for training (for example, the factory_of_the_future scene with scenario 11 for the dolly). You can also use the custom scene packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity in Unity for your object of interest. In Isaac SDK, set the class label name in object_pose_estimation.detection_filter component with config parameter allowlist_labels in sdk/packages/object_pose_estimation/apps/pose_cnn_decoder/pose_estimation_inference_sim_groundtruth_detection.app.json. Then run the pose estimation inference application from Isaac SDK repository using the following command:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:pose_estimation_inference_sim_groundtruth_detection  Visualize the inference in Sight at localhost:3000. Pose Estimation Inference Application Graph The figure above illustrates a full end-to-end sample inference application. For the full end-to-end inference of the pose estimation framework, there are four available modes of inference based on data collection for object pose estimation: simulation, camera feed, log replay, and image feed. These modes can be run via Python script. As with training, the package provides sample inference for two objects: a dolly for the navigation usecase and a box for the manipulation use case. In case of inference for custom object, modify the configuration parameters in the default inference configuration file /packages/object_pose_estimation/apps/pose_cnn_decoder/ detection_pose_estimation_cnn_inference.config.json and use this to run the different inference modes in place of provided sample config files for dolly and box. Parameters like path to generated uff file from pose estimation training, parameters for object detection model like path to trained model, confidence threshold, etc. are all set in this config file. Note Current release of the Isaac SDK uses the Jetpack version of TensorRT 7.1.3, which does not support dynamic batch size for UFF models. Hence, currently, the pose estimation inference is set up to infer the pose of the single object instance that overlaps the most with the user-defined center portion of the image in case of multiple instances of the object in the image. The extent of the size of this middle portion of the image can be set with the "focus_factor" config parameter in "detection_pose_estimation.object_pose_estimation.detection_filter". Then, the focus region to check for the extent of overlap is the pixel column range [(1 - focus_factor)/2, (1 + focus_factor)/2] * image_columns. The inference Python script and the config files for dolly and box are located in /packages/object_pose_estimation/apps/pose_cnn_decoder. Commands to run the four inference modes for both industrial dolly and box are provided below: 1. Inference on simulation data is a good first test to check the accuracy of the model. To run the inference application on simulation data for both detection and pose estimation, first play the pose estimation training scene binary, then run the inference command within Isaac SDK. The following are input arguments needed for this mode: • --mode: This should be set to sim for this mode. • --image_channel: The name of the image channel sending the color image from simulation. The default name is color. • --rows and --cols: The number of rows and columns of the input image. The default values are 720 and 1280, respectively. • --scenario_scene and --scenario_robot_prefab: The name of the simulation scene to load and the prefab name of the object interest to spawn respectively, if the object needs to be spawned from Isaac SDK. The default option is set to None and is valid if the object of interest is already set up in the scene like the factory_of_the_future scene. For dolly inference, run scenario 12 using the following command from inside the isaac_sim_unity3d release folder: Copy Copied!  bob@desktop:~/isaac/sdk$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 12


Then run industrial dolly inference on simulated data using this command:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel color --intrinsics_channel color_intrinsics --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json  For box inference, run inference scenario 11 using the following command from inside the isaac_sim_unity3d release folder: Copy Copied!  bob@desktop:~/isaac/sdk$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 11


Then run box inference on simulated data using this command:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel encoder_color --intrinsics_channel encoder_color_intrinsics --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json  For inference on a custom object using the sample scene in Unity, open and play the packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity scene in Unity Editor. You can increase the Frame Interval config option number for both procedural and procedural camera Game Objects so that the data is not randomized every frame. Then run the inference on simulated data, for example for Dolly.prefab using this command: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel encoder_color --intrinsics_channel encoder_color_intrinsics --scenario_scene pose_cnn_decoder_training --scenario_robot_prefab Dolly --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference.config.json


2. To run inference on a real-time camera feed from a RealSense camera, the following input arguments are required:

• --mode: This should be set to realsense for this mode.
• --rows and --cols: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.
• --fps: The frame rate of the RGB channel. Currently, only two frame rates, 15 and 30, are supported for the RealSense camera. The default value is 15.

Run dolly inference on a camera feed using the following command inside Isaac SDK:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode realsense --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json  Run box inference on a camera feed using the following command inside Isaac SDK: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode realsense --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json


3. To run the inference application using recorded logs, the following input arguments are required:

• --mode: This should be set to cask for this mode.
• --cask_directory: The path to the sample replay log. The default path is set to the dolly replay log.
• --rows and --cols: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.

Run dolly inference on sample logs using the following command inside Isaac SDK:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode cask --rows 480 --cols 848 --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json  Run box inference on sample logs using the following command inside Isaac SDK: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode cask --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json --cask_directory external/sortbot_pose_estimation_data/klt_logs


4. To run the inference application using sample images, the following input arguments are required

• --mode: This should be set to image for this mode.
• --image_directory: The path to the sample image. The default is set to the path for the dolly sample image.
• --rows and --cols: The number of rows and columns of the input image. The default values are 720 and 1280, respectively.
• --optical_center_rows and --optical_center_cols: The optical center specified as a row and column in the input image. The default values are 360 and 640, respectively.
• --focal_length: The focal length of the camera used to capture the image in pixels. The default value is 925.74.

The last three parameters are needed to construct the pinhole model of the camera.

Run dolly inference on a sample image using the following command inside Isaac SDK:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode image --rows 480 --cols 848 --optical_center_rows 240 --optical_center_cols 424 --focal_length 614.573 --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json  Run box inference on a sample image using the following command inside Isaac SDK: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode image --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json --image_directory external/industrial_dolly_sortbot_sample_images/klt_sample_images/klt_box.png


5. To run the inference application using ROS input topics, the following input ROS topic names are used by default

Note

ROS must be installed before running the test. See the ROSBridge section for more info.

• /depth
• /rgb
• /camera_info

Once the above ROS topics are published, run dolly inference on them using the following command inside Isaac SDK:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_ros_bridge_app  The inference applications take an RGB image from different input sources and outputs the estimated 3D poses of the detected objects along with the segmentation mask of the cropped image. Sample data of the dolly and box objects is provided, allowing you to run the last two inference apps using images and logs. You can visualize the estimated pose in Sight at http://localhost:3000. There are two types of visualization for the estimated pose: • A 3D bounding box, which requires specification of both the 3D bounding box size at zero orientation and the transformation from the object center to the bounding box center. Configure these parameters in the viewers/Detections3Viewer component in the inference application files. The parameter box_dimensions must be set to the dimensions of the object of interest. Parameter object_T_box_center must be set to the pose of the object center. It is a pose vector of size 7, the first four representing the rotation quaternion in order (w, x, y, z) and the last three represent the x, y, z components of the translation vector. Hence, the default vector [1, 0, 0, 0, 0, 0, 0] represents zero rotation and translation pose and can be left to this setting if the origin of the CAD model / OBJ file of the object is at its center. If not, appropriate offset to the center must be added in the translation component of the pose vector. • A rendering of the CAD model in the scene, which requires the path to the object CAD model and file names. These correspond to the assetroot and assets parameters respectively in the websight component in the inference application files. If you notice lag in mesh versus image rendering in Sight, increase the Change delay value slightly to synchronize the image, rendered mesh, and bounding box. To find this option, right-click the rendering window in Sight. Another inference application /packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_deploy_app.json is provided without including the sample log data for industrial dolly and box that helps in faster deployment of the package. So, it is recommended to use this application instead of detection_pose_estimation_cnn_inference_app.json in case of package deployment. To use the app, replace detection_pose_estimation_cnn_inference_app in the commands listed above with detection_pose_estimation_cnn_inference_deploy_app to run inference in different modes. The app includes sample images for industrial dolly and box to test the inference in image mode. #### Sample inference ## 3D Object Pose Estimation Evaluation Model evaluation is crucial for both improving the accuracy and robustness of the model and estimating the capability and limitations of the model for performing certain task. For evaluation of pose estimation models, three primary metrics are used in Isaac SDK for assessing the accuracy of the model to perform a required task: 1. Median absolute translation error of x, y, and z components 2. Median absolute rotation error 3. Model accuracy, which represents the fraction of the evaluation data that meet the thresholds set for median translation and rotation errors. The evaluation pipeline also outputs two kinds of plots: 1. Box plots of the error distribution of the translation and rotation errors for data at different object distances from the camera. 2. Accuracy plots for translation and rotation errors showing the model accuracy (i.e. the fraction of data meeting the set error threshold) for different thresholds. Finally, the outlier data samples along with the 3D bounding boxes of the predicted poses are saved as animation that helps in analyzing the failure cases of the model. Diagram of Pose Estimation Model Evaluation Pipeline in Isaac SDK The pose evaluation pipeline in Isaac SDK is set up to ingest data in Isaac log/cask format to ensure that the pipeline is agnostic to the source of evaluation data, i.e. simulation or real data. The figure above shows the three stages of the evaluation framework in Isaac SDK: evaluation data generation, which generates the image and ground truth casks, pose estimation inference along with recording, which generates the predicted pose casks, and finally the computation of evaluation metrics over this generated cask data. ### Evaluation Data Collection Using Simulation Simulation provides easy access to an unlimited amount of labeled data for evaluation. The evaluation pipeline is set up to consume data in Isaac log/cask format, so data samples from simulation along with the ground truth pose are recorded in cask format using an Isaac SDK application. For inference on a custom object using the sample scene in Unity, open and play the packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity scene in Unity Editor. This collects the evaluation data from the same training scene. You can make any number of changes to the scene settings, including changing the seed number for randomization of camera, obstacles, etc. during inference to collect diverse samples for training. Note At this point, we assume that the prefab of the object of interest is already created and is also listed in the PoseEstimationObjects Group during training. If not, follow steps 1 through 6 in the Generating data for custom objects from scene source file section above before proceeding further. Run the simulation data recording application in Isaac SDK. The following are important input arguments needed for this mode: • --mode: This should be set to pose for pose estimation data collection. • --image_channel: The name of the image channel sending the RGB data in simulation. • --intrinsics_channel: The name of the image instrinsics channel containing the pinhole model of the camera. This must be set to the same name as the image_channel set above, usually with an _intrinsics extension. For example, if the image_channel param is set to encoder_color, then the intrinsics_channel must be set to encoder_color_intrinsics. • --load_scene: The name of the scene to be loaded using scenario manager. • --load_robot_prefab: The name of the object prefab to be loaded in the scene. • --detections3viewer_box_dimensions: Dimensions of the 3D bounding box to view the groundtruth pose from sim. For example, for industrial dolly, this can be set to “[0.9, 1.32, 0.25]”. • --detections3viewer_object_T_box_center: 7 element pose vector of the object to bounding box center transformation to view the 3D bounding box in Sight. The default is set to “[1, 0, 0, 0, 0, 0, 0]”, indicating zero rotation and translation. • --runtime: Run time of the simulation to collect the data samples. The default is set to 20 seconds. Increase this value to collect more evaluation data samples. Then the evaluation data can be collected using the following command in Isaac SDK, which uses Dolly.prefab as an example: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/ml/apps/record_sim_ground_truth:record_sim_ground_truth -- --mode pose --image_channel encoder_color --intrinsics_channel encoder_color_intrinsics --scenario_scene pose_cnn_decoder_training --scenario_robot_prefab Dolly


The command connects to the simulation engine, collects RGB images along with ground truth pose of the dolly, and records it as an Isaac Log that can be later replayed during model evaluation. Note that the evaluation pipeline currently supports only single-instance data samples to make it easier to associate the ground truth pose with the object of interest. By default, the image data cask is saved to /tmp/data/raw and contains one channel (‘color’). The ground truth data cask is saved to /tmp/data/ground_truth and contains one channel (‘bodies’). Each run of the application saves one image and ground truth data cask pair. This way, multiple sets of data can be collected for evaluation in the form of multiple logs.

### Pose Estimation Evaluation

Once we have the collected evaluation data along with the ground truth for the pose, the evaluation of the model is performed in two stages.

1. Pose estimation inference recording
2. Evaluation metric computation

#### Pose Estimation Inference Recording

In this step, the application replays all the image cask files in a given input directory, runs the pose estimation inference application, and records the estimated pose as a cask log for each input image cask file.

The following are the important input arguments needed for this script:

• --inference_app: The path to the application file that replays a log, performs inference, and records results.
• --config: The config file to load for inference parameters for the above inference app.
• --raci_metadata: Saves JSON metadata along with cask to <app_uuid>_md.json. This is required to use the following steps in the evaluation pipeline.
• --no_raci_metadata: No metadata saved.
• --input_cask_workspace: The workspace containing the input cask files. Input image logs must be in the data/raw directory inside this workspace.
• --output_cask_workspace: The output cask files are written in data/<output_directory_name> inside this workspace. If this parameter is not set, it is assumed to be same as the input_cask_workspace.
• --output_directory_name: Base directory name to which to write the predictions cask output. Cask files are created in <output_cask_workspace>/data/<output_directory_name>. The default is set to “predictions”.

Assuming that the input image casks are stored in /tmp/data/raw, the predicted pose casks can be generated by running the following command from the Isaac SDK directory:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/ml/apps/evaluation_inference:evaluation_inference -- --inference_config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference.config.json --inference_app packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation/pose_cnn_decoder_inference_record.app.json --input_cask_workspace /tmp  With the above command, the output prediction pose casks are stored in the path /tmp/data/predictions. The output pose estimation casks are named with the same name as input image casks, with an additional tag so that the image cask and the corresponding pose casks can be associated in the next step for model evaluation. Sample logs for industrial dolly in factory of future scene from robot camera view points are also provided to try out the evaluation pipeline with the provided industrial dolly pose detection models. To run inference and record the predictions on the sample log, run the following command: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/ml/apps/evaluation_inference:evaluation_inference -- --inference_config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference.config.json --inference_app packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation/pose_cnn_decoder_inference_record.app.json --input_cask_workspace external/industrial_dolly_pose_evaluation_data --output_cask_workspace /tmp


The output prediction casks are stored in /tmp/data/predictions.

#### Evaluation Metrics Computation

Once the image, ground truth, and prediction data are collected, the evaluation metrics can be computed. This step reads the full list of casks in the image cask directory and their corresponding ground truth data and predictions, computes framewise metrics, and aggregates the metrics. The configuration file located in packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation/evaluation_config.json is used to set the translation and rotation error thresholds, the pinhole camera model, and the object dimensions. The default values are set to the values for the dolly model.

The following are input arguments needed for this script:

• config: The path to the config file for evaluation. The default is set to packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation/evaluation_config.json.
• --use_2d_detections: If set to true, predicted 2D detections are assumed to be one of the channels of the predicted poses cask. The detections are visualized for the outliers.
• --image_cask_dir: The path to the image cask directory. Only the image logs must be placed in this directory. The data is aggregated over all the logs in this directory. The default path is set to /tmp/data/raw.
• --gt_cask_dir: The path to the ground truth pose cask directory corresponding to the image casks in the image_cask_dir. The default path is set to /tmp/data/ground_truth.
• --predicted_cask_dir: The path to the predicted pose cask directory corresponding to the image casks in the image_cask_dir. It is expected to contain 2D detections as well if use_2d_detections is set to true. The default path is set to /tmp/data/predictions.
• --results_dir: The path to store the evaluation results. The directory is created if not already present. The default path is set to /tmp/pose_evaluation_results.

The name of the channels used for the image cask for color image, ground truth pose, and predicted pose are set in the evaluation config file with the config parameters image_channel, gt_pose_channel, and pred_pose_channel, respectively. The default channel names are color, gt_poses, and predicted_poses, respectively.

The pose evaluation from the generated cask files can be run with the following command:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run //packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation:pose_cask_evaluation -- --image_cask_dir='/tmp/data/raw' --gt_cask_dir='/tmp/data/ground_truth' --predicted_cask_dir='/tmp/data/predictions'  To compute evaluation metrics over the provided sample logs instead, run the following command: Copy Copied!  bob@desktop:~/isaac/sdk$  bazel run //packages/object_pose_estimation/apps/pose_cnn_decoder/evaluation:pose_cask_evaluation -- --image_cask_dir='external/industrial_dolly_pose_evaluation_data/data/raw' --gt_cask_dir='external/industrial_dolly_pose_evaluation_data/data/ground_truth' --predicted_cask_dir='/tmp/data/predictions'


Pose metrics like median absolute rotation, translation errors, and the accuracy of the model are written in JSON file. It also has a KPI pass/fail metric that indicates if the errors and model accuracy met the set thresholds in the config file. In addition to the metrics JSON file, the box plots of the error distribution and the accuracy plots along with the outlier samples are saved in /tmp/pose_evaluation_results.

This evaluation pipeline is provided as a tool to aid in improving model performance. If evaluation results do not meet standards, consider modifying the training data to better reflect the distribution of the data used for evaluation. The outlier results can aid in error analysis: Investigate the failure cases and outliers on which the model performs poorly, and use these cases to inspire the training scene in simulation.

© Copyright 2018-2020, NVIDIA Corporation. Last updated on Feb 1, 2023.