# 3D Object Pose Estimation with Pose CNN Decoder¶

Object detection and 3D pose estimation play a crucial role in robotics. They are needed in a variety of applications such as navigation, object manipulation, and inspection. The 3D Object Pose Estimation application in Isaac SDK provides the framework to train pose estimation for any model completely in simulation, and to test and run the inference in simulation as well as the real world.

This application uses the Isaac SDK Pose CNN Decoder architecture for its 3D pose estimation model. This architecture is based on the work by Sundermeyer et al. and PoseCNN by Xiang et al. Given an RGB image, the algorithm first detects objects from a known set of objects with available 3D CAD models using any object detection model available in Isaac SDK, then estimates their 3D poses using the Pose CNN Decoder model. This application and network architecture aims at low-latency, real-time object detection and 3D pose estimation by leveraging GPU acceleration while achieving good accuracy.

Compared to the autoencoder-based 3D pose estimation algorithm in Isaac SDK, this application has the following attributes:

• More user-friendly: It eliminates an extra intermediate step to generate the codebook and perform codebook lookup.

• Can handle multiple object classes with a single trained model: On the other hand, the autoencoder requires a separate trained model and codebook for each object class.

• More robust against occlusions: The network learns to directly predict the object center from the bounding box parameters and input image.

## Application Overview¶

3D Pose Estimation Algorithm

This application consists of two modules:

• An inference module

• A Pose CNN Decoder training module

### Inference Module¶

The inference module consists of two sub-modules: object detection and 3D pose estimation. The object detection sub-module accepts an RGB image and determines bounding boxes for objects of interest using any object detection model available in Isaac SDK. Currently the sample application files support the Isaac’s DetectNet object detection inference model. The 3D pose estimation sub-module crops the images based on the bounding boxes and estimates the 3D poses of the objects in each of the cropped images. To estimate the 3D pose of an object, this module requires a trained Pose CNN Decoder neural network that accepts as inputs the cropped images and their corresponding bounding box parameters, namely the bounding box size and center.

### Pose CNN Decoder Training Module¶

The Pose CNN Decoder training module trains the neural network for the object or list of objects. This module uses a variant of the encoder-decoder architecture, along with the regression layers, to estimate the 3D pose of the object. It trains the decoder neural network to reconstruct the input RGB image as a segmentation image that is invariant to occlusions, thus forcing the network to learn the object shape irrespective of the occlusions and other noise like background and lighting conditions.

The training module requires simulated color and segmentation image data based on the user-provided 3D CAD model of the object. This application also provides a simulated data generation module using the Unity Engine and IsaacSim Unity3D, which generates simulated data for an object given its 3D CAD model.

## Pose CNN Decoder Architecture¶

The Pose CNN Decoder network takes two inputs: A cropped RGB image and the bounding box parameters. The network uses two encoders to extract the features of the image, as well as the bounding box, and concatenates the feature space. The network uses the concatenated feature space to decode the cropped image as a de-noised segmentation image and estimates the translation and rotation parameters of the 3D pose of the object by posing it as a regression problem.

The decoder that outputs the segmentation image is trained to remove occlusions and fill the occluded pixels of the object. The output translation parameters are the object center x and y coordinates in the camera image frame and depth from the camera in the real world. The four rotation parameters are the quaternions of the rotation of the object in the camera frame. The segmentation mask is used again in the pose refinement step to crop out the relevant depth-image point cloud, and the 3D pose estimate is used as a starting point to refine the pose.

## Pose CNN Decoder Training¶

Acquiring and labeling the real-world data required to train models for 3D pose estimation is extremely challenging, time consuming, and prone to error. The problem is especially aggravated for robotics: robots are required to perform pose estimation in a wide variety of specialized scenarios, so collecting large amounts of accurate and labeled real-world data for every such scenario is prohibitive, which in turn slows down the rate of adoption of these models.

This application uses simulated data, training the network for 3D pose estimation entirely in simulation, to bridge the sim-to-real gap through features available in simulators such as domain randomization and procedural generation techniques.

The Pose CNN Decoder training stage involves two broad steps:

1. Generating simulated data

2. Running the Pose CNN Decoder-training pipeline

### Generating Simulated Data¶

Training the autoencoder model requires four different types of data:

• A rendered RGB image

• A corresponding segmentation image

• A de-noised version of the segmentation image (i.e. a version of the segmentation image that is invariant to occlusions)

#### Generating data for sample objects from scene binary: Industrial Dolly and Industrial Box¶

A sample “factory of the future” scene binary, which generates the above data for two objects, a dolly and box, is available in the isaac_sim_unity3d repository in builds. Run the binary with the following command within the IsaacSim Unity3D release build to start sending the samples for dolly pose estimation training:

bob@desktop:~/isaac_sim_unity3d$./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 8  This command runs the Factory01 Unity scene and scenario 8, which is the dolly pose estimation training scenario. Use the following command to run the binary within the IsaacSim Unity3D release build to start sending the samples for box pose estimation training: bob@desktop:~/isaac_sim_unity3d$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 10


IsaacSim Unity3D communicates with the Isaac SDK via TCP sockets. The isaac/packages/navsim/apps/navsim.app.json application sets up a connection to publish the simulation data to a user-defined port using a TcpPublisher node. The data is received by the training application at packages/object_pose_estimation/apps/pose_cnn_decoder/training/training.app.json using TcpSubscriber.

#### Generating data for custom objects from scene source file¶

A sample scene to generate training data for custom objects is available in the isaac_sim_unity3d repository in packages/Nvidia/Samples/ObjectPoseEstimation/. Follow the instructions on the IsaacSim Unity3D page to open the pose_cnn_decoder_training.unity scene in Unity Editor. This sample scene can generate data with randomized backgrounds, occluding objects, lighting conditions, and camera poses.

The scene is configured to stream all four required labels for each frame to Isaac SDK using TCP. The default object of interest in the scene is a dolly. Follow these steps to set up the scene for a new object:

Note

If the object of interest is already in the group in packages/Nvidia/Samples/ObjectPoseEstimation/PoseEstimationObjectsGroup, you can skip steps 1-3.

1. Load Unity, upload the 3D CAD model and textures of the object, and create a prefab.

2. In the prefab, click Add Component and add the LabelSetter script. Enter a Label Name of your choice. This script creates segmentation images and computes bounding boxes.

3. Add the prefab to the PoseEstimationObjectsGroup in packages/Nvidia/Samples/ObjectPoseEstimation. To accomplish this, either increase the element size by 1 or swap an existing prefab with the prefab of your object by dragging and dropping the prefab into the list.

4. Drag the packages/Nvidia/Samples/ObjectPoseEstimation/pose_cnn_decoder_training.unity scene into the Hierarchy panel and remove any existing scenes in the panel.

5. Provide the label name set to the prefab in the Class Label Rules > dolly > Expression field. The default name is “dolly”, which is the label name of the “Dolly” prefab. The label name is used in IsaacSim to render the segmentation image of the object in a scene.

6. Set a Name for the Dolly element in the Class Labels section. This name is used as the label for the object in Isaac SDK.

7. Save the scene and play it.

8. In Isaac SDK, set the :code:’scene’ parameter in the data.simulation.scenario_manager component to the scene name pose_cnn_decoder_training.

9. In Isaac SDK, add the prefab name of the object to the robot_prefab parameter in the data.simulation.scenario_manager component and also to the class_names parameter in the data.detection3_convertor component in packages/object_pose_estimation/apps/pose_cnn_decoder/training/training.app.json.

Note

This is the name of the object prefab and not the Class Label name.

For example, if you are using the “TrashCan02.prefab” Unity object with the LabelSetter name in the prefab “trashcan”, then the robot_prefab parameter in the Isaac app should be TrashCan02 and the expression name in ClassLabelManager in Unity can be any name of your choice.

Adding the prefab name to scenario manager in the Isaac app will enable spawning of the object in the scene when you run the training app as described in the sections below. Adding the prefab name to data.detection3_convertor is used to filter and encode the ground truth poses used to train the network.

At this point, you should see the camera positions, objects, lighting, and background substances for walls and floors changing every frame. You can adjust the randomization frequency of the backgrounds, camera positions, etc. as needed in the procedural camera GameObject and its child CameraGroup GameObject.

### Running the Pose CNN Decoder training pipeline¶

Run the training application with the following command:

bob@desktop:~/isaac$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder/training:pose_estimation_cnn_training  The training configuration can be set in the packages/object_pose_estimation/apps/pose_cnn_decoder/training/training_config.json file. The logs and checkpoints are stored in /tmp/autoenc_logs/ckpts by default, but this path can be changed using the train_logdir config option in the training_config.json file. By default, the training application runs from training iteration 0 until the iteration number equals the training_step value given in the training_config.json file. To start the training from an intermediate checkpoint, set the checkpoint config option to the corresponding checkpoint filename, which is by default in the form model-<checkpoint_number>. The script extracts the number, appending “model-” as the checkpoint number, and restarts from that iteration number. For example, if you want to restart from iteration number 10000, set the checkpoint to model-10000: The step number will then start from 10000 and end at the value of the training_steps config option, which is 50000 by default. To view the training progress on TensorBoard, run the following command in the terminal: tensorboard --logdir=/tmp/autoenc_logs/ckpts  The visualization interface can be accessed at http://localhost:6006. The images received from Isaac_sim_unity3d, along with the bounding boxes, can be visualized in Sight at http://localhost:3000. Note If the decoder output in TensorBoard looks completely black or completely white within 10 iterations, change the add_noise_to_image setting to “true” in the training_config.json file. This adds noise to the images, which prevents the training variables from saturating. You should now start seeing a combination of black and white pixels instead of a completely black/white image. You can reset the option to “false” after a few hundred iterations to train on cleaner images if desired. Note Depending on the type of object, the model may not need the default of 50000 training steps. You can end training sooner if the reconstructed decoder image, which is the output of the model, is clear over multiple training samples. This is a good check to determine the quality of the model. You can view the reconstructed image in the TensorBoard Images section. The frozen TensorFlow model is only generated at the end of training (i.e. when the number of iterations equals the value of the training_steps config option). So, if you want to end training before the given steps are finished, you can generate the frozen model from this checkpoint by running the following command: bob@desktop:~/isaac$ bazel run packages/ml/tools:freeze_tensorflow_model_tool -- --out /tmp/autoenc_logs/ckpts/model-24000 --output_node_name 'translation_output','rotation_output_1','decoder_output' /tmp/autoenc_logs/ckpts/model-24000


The above command sets the last checkpoint number to 24000. You can change this value as needed.

### Conversion to UFF Model¶

The UFF package contains a set of utilities to convert trained models from various frameworks to a common UFF format. In this application, the UFF parser converts the Tensorflow model to UFF so it can be used for codebook generation and inference. Refer to the NVIDIA TensorRT Documentation for more details.

At the end of the training iterations, the Tensorflow model is saved as a .pb file. You then need to convert it to the UFF model using the python script and UFF parser. For example, at the end of 24000 iterations, the Tensorflow model is saved as model-24000.pb and can be converted to a UFF model using the following command:

bob@desktop:~/isaac$bazel run packages/ml/tools:tensorflow_to_tensorrt_tool -- --out /tmp/autoenc_logs/ckpts/pose_cnn_model.uff --input_node_name 'encoder_input' --input_node_name 'encoder_bbox_input' --output_node_name 'translation_output' --output_node_name 'rotation_output_1' --output_node_name 'decoder_output' /tmp/autoenc_logs/ckpts/model-24000-frozen.pb  ### Running inference¶ There are four available modes of inference based on data collection for object pose estimation: simulation, camera feed, log replay, and image feed. These modes can be run via Python script. As with training, the package provides sample inference for two objects: a dolly for the navigation use case and a box for the manipulation use case. The inference Python script and the config files for dolly and box are located in /packages/object_pose_estimation/apps/pose_cnn_decoder. The default config file for inference is located at /packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_dolly.config.json for the dolly object. Commands to run the four inference modes for both dolly and box are provided below: 1. Inference on simulation data is a good first test to check the accuracy of the model. To run the inference application on simulation data for both detection and pose estimation, first play the pose estimation training scene binary, then run the inference command within Isaac SDK. The following are input arguments needed for this mode: • --mode: This should be set to sim for this mode. • --image_channel: The name of the image channel sending the color image from simulation. The default name is color. • --rows and --cols: The number of rows and columns of the input image. The default values are 720 and 1280, respectively. For dolly inference, run scenario 12 using the following command from inside the isaac_sim_unity3d release folder: bob@desktop:~/isaac$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 12


Then run dolly inference on simulated data using this command:

bob@desktop:~/isaac$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel color  For box inference, run inference scenario 11 using the following command from inside the isaac_sim_unity3d release folder: bob@desktop:~/isaac$ ./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 11


Then run box inference on simulated data using this command:

bob@desktop:~/isaac$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode sim --image_channel encoder_color --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json  2. To run inference on a real-time camera feed from a RealSense camera, the following input arguments are required: • --mode: This should be set to realsense for this mode. • --rows and --cols: The number of rows and columns of the input image. The default values are 720 and 1280, respectively. • --fps: The frame rate of the RGB channel. Currently, only two frame rates, 15 and 30, are supported for the RealSense camera. The default value is 15. Run dolly inference on a camera feed using the following command inside Isaac SDK: bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode realsense


Run box inference on a camera feed using the following command inside Isaac SDK:

bob@desktop:~/isaac$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode realsense --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json  3. To run the inference application using recorded logs, the following input arguments are required: • --mode: This should be set to cask for this mode. • --cask_directory: The path to the sample replay log. The default path is set to the dolly replay log. • --rows and --cols: The number of rows and columns of the input image. The default values are 720 and 1280, respectively. Run dolly inference on sample logs using the following command inside Isaac SDK: bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode cask --rows 480 --cols 848


Run box inference on sample logs using the following command inside Isaac SDK:

bob@desktop:~/isaac$bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode cask --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json --cask_directory external/sortbot_pose_estimation_data/klt_logs  4. To run the inference application using sample images, the following input arguments are required • --mode: This should be set to image for this mode. • --image_directory: The path to the sample image. The default is set to the path for the dolly sample image. • --rows and --cols: The number of rows and columns of the input image. The default values are 720 and 1280, respectively. • --optical_center_rows and --optical_center_cols: The optical center specified as a row and column in the input image. The default values are 360 and 640, respectively. • --focal_length: The focal length of the camera used to capture the image in pixels. The default value is 925.74. Note The last three parameters are needed to construct the pinhole model of the camera. Run dolly inference on a sample image using the following command inside Isaac SDK: bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode image --rows 480 --cols 848 --optical_center_rows 240 --optical_center_cols 424 --focal_length 614.573


Run box inference on a sample image using the following command inside Isaac SDK:

bob@desktop:~/isaac\$ bazel run packages/object_pose_estimation/apps/pose_cnn_decoder:detection_pose_estimation_cnn_inference_app -- --mode image --config packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json --image_directory external/sortbot_pose_estimation_data/klt_sample_images/klt_box.png


The inference applications take an RGB image from different input sources and outputs the estimated 3D poses of the detected objects along with the segmentation mask of the cropped image. Sample data of the dolly object is provided, allowing you to run the last two inference apps using images and logs. You can visualize the estimated pose in Sight at http://localhost:3000.

There are two types of visualization for the estimated pose:

• A 3D bounding box, which requires specification of both the 3D bounding box size at zero orientation and the transformation from the object center to the bounding box center. Configure these parameters in the viewers/Detections3Viewer component in the inference application files.

• A rendering of the CAD model in the scene, which requires the path to the object CAD model and file names. These correspond to the assetroot and assets parameters respectively in the websight component in the inference application files.

If you notice lag in mesh versus image rendering in Sight, increase the Change delay value slightly to synchronize the image, rendered mesh, and bounding box. To find this option, right-click the rendering window in Sight.