3D Object Pose Estimation with AutoEncoder
Object detection and 3D pose estimation play a crucial role in robotics. They are needed in a variety of applications such as navigation, object manipulation, and inspection. The 3D Object Pose Estimation application in Isaac SDK provides the framework to train pose estimation for any model completely in simulation, and to test and run the inference in simulation as well as the real world.
The 3D pose estimation model used in this application is based on the work by Sundermeyer et al. Given an RGB image, the algorithm first detects objects from a known set of objects with available 3D CAD models using any object detection model available in Isaac SDK, and then estimates their 3D pose using an autoencoder model. This application aims at low-latency, real-time object detection and 3D pose estimation by leveraging GPU acceleration while achieving good accuracy.
3D Pose Estimation Algorithm
This application consists of three modules:
An inference module
An autoencoder-training module
A codebook-generation module
The inference module consists of two sub-modules: object detection and 3D pose estimation. The object detection sub-module takes an RGB image and determines bounding boxes for objects of interest using any object detection model available in Isaac SDK. The 3D pose estimation sub-module crops the image based on the bounding boxes and estimates the 3D poses of the objects in each of the cropped images. To estimate the 3D pose of an object, this module requires a trained encoder neural network for the object and a pose codebook for the object.
The autoencoder-training module trains the encoder neural network for the object. This module uses a variant of the Denoising Autoencoder called the Augmented Autoencoder. It trains the decoder neural network to reconstruct the input RGB image such that it is invariant to any changes other than the orientation of the object. The changes include but are not limited to background, lighting condition, occlusion, scale, translation, etc. In other words, the decoder learns to preserve the orientation of the object and normalize all other aspects.
The codebook-generation module generates a pose codebook for the object using the trained encoder. This codebook is used by the 3D pose estimation module to find the best match for the orientation of the object at the time of inference.
Both the autoencoder-training and codebook-generation modules require simulated data based on the user-provided 3D CAD model of the object. This application also provides a simulated data generation module, using the Unity Engine and Isaac_sim_unity3d, that generates the simulated data for an object given its 3D CAD model.
The following three sections describe each of the above three modules–autoencoder training, codebook generation and inference–in more detail.
Acquiring and labeling real-world data required to train models for 3D pose estimation is extremely challenging, time consuming, and prone to error. The problem is especially aggravated for robotics: robots are required to perform pose estimation in a wide variety of specialized scenarios, so collecting large amounts of accurate and labeled real-world data for every such scenario is prohibitive, which in turn slows down the rate of adoption of these models.
This application uses simulated data, training the autoencoder models for 3D pose estimation entirely in simulation, to bridge the sim-to-real gap through features available in simulators such as domain randomization and procedural generation techniques.
The autoencoder training stage involves two broad steps:
Generating simulated data
Running the autoencoder-training pipeline
Generating simulated data
Training the autoencoder model requires four different types of data:
A rendered RGB image
A corresponding segmentation image
A de-noised version of the RGB image (i.e. a version of the RGB image that is invariant to background, occlusion, lighting and translation)
A corresponding de-noised segmentation image
This application allows you to generate the above data by setting up a scene using the Unity engine and Isaac_sim_unity3d and then streaming it to the Isaac SDK.
A sample scene to generate the above data is available in the isaac_sim_unity3d repository in packages/Nvidia/Samples/ObjectPoseEstimation/. Follow the instructions on the IsaacSim Unity3D page to open the scene in Unity Editor. This sample scene can generate data with randomized backgrounds, occluding objects, lighting conditions, and camera poses.
The important components of the scene are as follows:
The color camera and the segmentation camera that respectively render the actual RGB and segmentation images.
The color camera and the segmentation camera that respectively render the de-noised color and segmentation images.
To make the images invariant to background and occlusions, a culling mask is used for both the cameras.
To make the rendered color image invariant to lighting conditions, a custom albedo shader is added to the color camera.
To render the images invariant to translation, the
LookAtFromFixedDistancescript is added to both the cameras so they are always at a constant distance from the object of interest.
The scene is configured such that it can stream all four required labels for each frame to Isaac SDK using TCP. The default object of interest in the scene is a dolly. Follow these steps to set up the scene for a new object:
If the object of interest is already in the group in packages/Nvidia/Samples/ObjectPoseEstimation/PoseEstimationObjectsGroup, you can skip steps 1-3.
Load Unity, upload the 3D CAD model and textures of the object, and create a prefab.
In the prefab, click Add Component and add the
LabelSetterscript. Enter a Label Name of your choice. This script creates segmentation images and computes bounding boxes.
Add the prefab to the PoseEstimationObjectsGroup in packages/Nvidia/Samples/ObjectPoseEstimation. To accomplish this, either increase the element size by 1 or swap an existing prefab with the prefab of your object by dragging and dropping the prefab into the list.
Drag the packages/Nvidia/Samples/ObjectPoseEstimation/pose_estimation_training.unity scene into the Hierarchy panel and remove any existing scenes in the panel.
Provide the label name set to the prefab in the Class Label Rules > dolly > Expression field. The default name is “dolly”, which is the label name of the “Dolly” prefab. The label name is used in isaac_sim_unity3d to render the segmentation image of the object in a scene.
Set a Name for the Dolly element in the Class Labels section. This name is used as the label for the object in Isaac SDK.
Save the scene and play it.
In Isaac SDK, add the prefab name of the object to the
robot_prefabparameter in the
data.simulation.scenario_managercomponent in packages/object_pose_estimation/apps/autoencoder/training.app.json.
This is the name of the object prefab and not the Class Label name.
For example, if you are using the “TrashCan02.prefab” Unity object with the
LabelSetter name in the prefab “trashcan”, then the
robot_prefab parameter in the Isaac
app should be
TrashCan02 and the expression name in ClassLabelManager in Unity can be any
name of your choice.
Adding the prefab name in the Isaac app will enable spawning of the object in the scene when you run the training app as described in the sections below.
At this point, you should see the camera positions, objects, lighting, and background substances for walls and floors changing every frame. You can adjust the randomization frequency of the backgrounds, camera positions, etc. as needed in the procedural camera GameObject and its child CameraGroup GameObject.
Isaac_sim_unity3d communicates with the Isaac SDK via TCP sockets. isaac/packages/navsim/apps/navsim.app.json publishes the simulation data to a user-defined port using a TcpPublisher node. The data is received by the training application, packages/object_pose_estimation/apps/autoencoder/training.app.json, using TcpSubscriber.
Running the autoencoder training pipeline
Run the training application with the following command:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/autoencoder:autoencoder_training
Make sure that the
robot_prefab name specified in the app file matches the prefab name
of the GameObject in the Unity scene.
The training configuration can be set in the
The logs and checkpoints are stored in
/tmp/autoenc_logs/ckpts by default, but this path
can be changed in the training configuration JSON file. By default, the training application runs
from training iteration 0 until the iteration number equals the
training_step value given in
the training_config.json file.
To start the training from an intermediate checkpoint, set the
checkpoint config option to
the corresponding checkpoint filename, which is by default in the form
model-<checkpoint_number>. The script extracts the number appending “model-” as the
checkpoint number and restarts from that iteration number. For example, if you want to restart
from iteration number 10000, set the
model-10000: The step number
will then start from 10000 and end at the value of the
training_steps config option,
which is 50000 by default.
To view the training progress on TensorBoard, run the following command in the terminal:
The visualization interface can be accessed at
The images received from Isaac_sim_unity3d, along with the bounding boxes, can be visualized in
Depending on the type of object, the model may not need the default of 50000 training steps. You can end training sooner if the reconstructed decoder image, which is the output of the model, is clear over multiple training samples. This is a good check to determine the quality of the model. You can view the reconstructed image in the TensorBoard Images section.
The frozen TensorFlow model is only generated at the end of training (i.e. when the number of
iterations equals the value of the
training_steps config option). So if you want to end
training before it finishes the given steps, you can generate the frozen model from this
checkpoint by running the following command:
bob@desktop:~/isaac$ bazel run packages/ml/tools:freeze_tensorflow_model_tool -- --out /tmp/autoenc_logs/ckpts/model-24000 --output_node_name encoder_output/BiasAdd /tmp/autoenc_logs/ckpts/model-24000
The above command sets the last checkpoint number to 24000. You can change this value as needed.
The autoencoder training module has the following message types:
ColorCameraProto: Holds a color image and camera-intrinsic information.
TensorListProto: Defines a list of TensorProto messages, which are mainly used to pass around tensors.
SegmentationCameraProto: Holds an image containing the class label for every pixel in the image. It also contains camera-intrinsic information, similar to ColorCameraProto. This is used to compute ground-truth 2D bounding boxes.
RigidBody3GroupProto: Holds information about a rigid body like position, velocity, acceleration.
Detections2Proto: Holds the absolute 2D bounding box coordinates and class name.
Detections3Proto: Holds the 3D pose of objects detected with respect to the sensor frame and their class names.
TcpSubscriber: Used by the training application to receive data from the simulator. Five TcpSubscribers are used in this example, each receiving encoder and decoder color images and their detection labels, as well as the poses of the rigid bodies of interest from the simulation.
LabelToBoundingBox: Takes in a SegmentationCameraProto and outputs a Detections2Proto. This codelet is responsible for computing the ground truth bounding boxes from the object class and instance labels. The bounding boxes are published as Detections2Proto with confidence 1.0.
DetectionImageExtraction: Takes in a ColorCameraProto containing the color image and camera parameters and Detections2Proto containing the list of bounding boxes. The codelet crops the detected objects in the image and publishes them as a TensorListProto of size (N x W x H x 3).
TensorSynchronization: Takes in two or more TensorListProto inputs and synchronizes them according to their acquisition time. This codelet ensures that all labels received by the training code are synchronized for every data sample.
SampleAccumulator: Takes in the training data labels as a TensorListProto and stores them in a buffer. This codelet is bound to the Python training script such that the training script can directly sample from this buffer using the
acquire_samples()function, which converts the TensorListProto into a list of numpy arrays with corresponding dimensions and passes the list to the Python training script.
Conversion to UFF Model
The UFF package contains a set of utilities to convert trained models from various frameworks to a common UFF format. In this application, the UFF parser converts the Tensorflow model to UFF so that it can be used for codebook generation and inference. Refer to NVIDIA TensorRT Documentation for more details.
At the end of the training iterations, the Tensorflow model is saved as a .pb file. You then
need to convert it to the UFF model using the python script and UFF parser.
For example, at the end of 24000 iterations, the Tensorflow model is saved as
model-24000.pb and can be converted to a UFF model using the following command:
bob@desktop:~/isaac$ bazel run packages/ml/tools:tensorflow_to_tensorrt_tool -- --out /tmp/autoenc_logs/ckpts/ae_model.uff --input_node_name encoder_input --output_node_name encoder_output/BiasAdd /tmp/autoenc_logs/ckpts/model-24000-frozen.pb
To determine the orientation of an object during inference, we use a codebook containing the latent space representations of the object at different poses. The latent vector of the test image during run time is compared with the codebook vectors to find the best match. More details on the codebook generation setup and requirements can be found in the paper by Sundermeyer et al.
The Codebook Generation module performs the following operations:
Generates a list of poses sampled uniformly from a sphere, with the object of interest at the center.
Sends teleportation commands with the sampled poses to the simulator.
Receives the color image containing the object appearance at each of those poses.
Writes the codebook file containing the encoder latent vectors and their corresponding pose and bounding box information as a JSON file.
This module requires the UFF model as input to generate the codebook of sampled poses. Uniform spherical sampling can be replaced by any other sampling strategy to achieve better performance, depending on your use case. You can also sample only parts of the sphere by limiting the range of the pitch, roll and yaw angles in the config options in the /packages/object_pose_estimation/apps/autoencoder/codebook_generation.app.json application file.
Setting up a Scene in Unity
The scene to generate the codebook in Unity is available in the isaac_sim_unity3d repository in packages/Nvidia/Samples/ObjectPoseEstimation/. Follow these steps to set up the scene for codebook generation:
Load Unity, drag the packages/Nvidia/Samples/ObjectPoseEstimation/pose_estimation_codebook_generation.unity scene into the Hierarchy panel, and remove any existing scenes in the panel. Then drag the object prefab into the scene and set its position to origin.
Provide the label name set to the prefab in the Class Label Rules > dolly > Expression field. The default name is “dolly”, which is the label name of the “Dolly” prefab. This label name is used in isaac_sim_unity3d to render the segmentation image of the object in a scene.
Set a Name for the Dolly element in the Class Labels section. This name is used as the label for the object in Isaac SDK.
All the settings in this
ClassLabelManager script must
match the settings of the same GameObject in the training scene.
Save the scene and play it.
Generating the Codebook
The codebook is generated using the UFF model saved at the end of training. Before running the codebook generation app in Isaac SDK, configure the following options in the /packages/object_pose_estimation/apps/autoencoder/codebook_generation.app.json application file:
Add the prefab name of the object to the
robot_prefaboption in the
simulation.scenario_managercomponent in the application file. The default value is “Dolly”.
whiteList_labelsoption in the
FilterDetectionByLabelcomponent to the Name given to the object in the Class Labels section of the “ClassLabelManager” GameObject in Unity. The default name is “Dolly”. During the visualization of bounding boxes in Sight, the boxes are tagged with this name.
Set the path of the saved UFF model in the
model_file_pathoption in the
You can also set the number of sampling points, in-plane rolls, range of pitch, and roll/yaw
angles for sampling the poses from the sphere in the
codebook_view_sampler component in the
Run the codebook generation app from the terminal using the following command:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/autoencoder:codebook_generation
At this point, in the Unity scene you should see the camera position changing every frame according
to the poses sampled from the uniform sphere, which is the default option. The rendered object can
be visualized in Sight at
http://localhost:3000. The codebook is saved by default in
Many of the codelets used in the autoencoder-training module are re-used in the codebook-generation module. The additional codelets used are listed below:
CodebookSampler: Samples camera poses for codebook generation assuming the object under observation is placed at the origin and publishes one sampled pose per tick. It uses icosahedron subdivision to uniformly sample viewpoints from a sphere of a given radius, and additionally applies in-plane rotation (roll) along the camera axis.
TcpPulisher: Sends data from Isaac SDK to the simulator. For codebook generation, one TcpPublisher is used for sending poses that the camera should teleport to in the simulation.
Detections3Encoder: Takes in a Detections3Proto containing the pose information and encodes it into a tensor. The tensor is published as a TensorListProto containing pose information for all the detected objects.
RigidBodyToDetections3: Takes in a RigidBody3GroupProto containing a list of 3D rigid bodies with poses in the Isaac SDK coordinate frame, converts the poses to the reference frame as one of the input rigid bodies if needed, and publishes the list of 3D rigid body poses in the reference frame as Detections3Proto. This codelet is used to compute the pose of the rigid bodies obtained from simulation with respect to the camera frame. The resulting output pose is used to train the pose estimation model.
TensorRTInference: Loads the frozen neural network model into memory, generates an optimized TensorRT engine, evaluates the model using the inputs to the network as a TensorListProto, and publishes the network output, which is the embeddings vectors for all instances of an object class, as a TensorListProto.
ImagePoseEncoder: Takes in ColorCameraProto for camera-intrinsic properties, Detections2Proto containing the list of bounding boxes, and RigidBody3GroupProto to get the 3D pose of the rigid body. It publishes seven labels extracted from these inputs as a single tensor in a TensorListProto for generating the codebook.
CodebookWriter: Takes in two TensorListProto messages, one containing the list of embeddings and the other containing the corresponding decoded information received from the ImagePoseEncoder codelet, and sends it as a JSON Proto. In this example, such a codebook is prepared where the codes are the latent vectors from the encoder and the decoded information are the poses for those vectors along with bounding box and camera parameters.
JsonWriter: Takes in JsonProto messages and writes them as a JSON file. This codelet collects the pose encodings from the CodebookWriter codelet and writes them to the finished codebook file that will be used by the inference module.
The end-to-end inference of the pose estimation model from a single RGB image includes two primary subgraphs.
The object-detection subgraph, which can make use of any object detection model in Isaac SDK
The pose estimation subgraph, which runs the encoder network and estimates the 3D pose based on the best match of the latent vector from the generated codebook. The inference app instantiates one pose estimation subgraph per object class. Multiple instances of an object class can be processed at a time in a single pose estimation subgraph.
Each object with a distinct CAD model requires a corresponding trained model and generated codebook. You must then add a separate subgraph to the inference application for each object. Different instances of the same object in the scene are automatically passed to a single subgraph corresponding to that object class.
There are four inference applications based on the mode of data collection: inference on simulation data as a first test to check the accuracy of the model, inference in real-time using the camera feed, inference using a sample image, and inference using recorded logs.
First, follow the same configuration steps for the inference application files as instructed for the
/packages/object_pose_estimation/apps/autoencoder/codebook_generation.app.json application file
in the Codebook Generation section. The inference application files are located in
To run the inference application on the simulation data, first play the pose estimation training scene in Unity, then run the following command within Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/autoencoder:pose_estimation_inference_sim
To run the inference application using the real-time camera feed, run the following command within Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/autoencoder:pose_estimation_inference_camerafeed
To run the inference application using a sample image, run the following command within Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/autoencoder:pose_estimation_inference_imagefeeder
To run the inference application using recorded logs, run the following command within Isaac SDK:
bob@desktop:~/isaac$ bazel run packages/object_pose_estimation/apps/autoencoder:pose_estimation_inference_replay
The inference applications take an RGB image from different input sources
and outputs the estimated 3D poses of the detected objects. Sample data of the dolly object is
provided to run the last two inference apps using images and logs. The estimated pose can be
automatically visualized in Sight at
There are two types of visualization for the estimated pose:
A 3D bounding box, which requires specification of both the 3D bounding box size at zero orientation and the transformation from the object center to the bounding box center. Configure these parameters in the
ObjectDetectionViewercomponent in the inference application files.
A rendering of the CAD model in the scene, which requires the path to the object CAD model and file names. These correspond to the
assetsparameters respectively in the
websightcomponent in the inference application files.
RealSense: Enables data ingestion from the Real Sense Camera. Encodes the raw image as a ColorCameraProto message containing the RGB image and camera intrinsic information.
FilterDetectionsByLabel: Takes in a list of bounding box detections as Detections2Proto, filters the detections of the required object class, and publishes them as Detections2Proto.
DetectionImageExtraction: Takes in a ColorCameraProto containing the color image and camera parameters and a Detections2Proto containing the list of bounding boxes. The codelet crops the detected objects in the image and publishes them as a TensorListProto of size (N x W x H x 3).
TensorRTInference: Loads the frozen neural network model into memory, generates an optimized TensorRT engine, evaluates the model using the inputs to the network as TensorListProto, and publishes the network output, which are the embeddings vectors for all instances of an object class, as TensorListProto.
CodebookLookup: Takes in a TensorListProto containing the embeddings and finds the list of top K best matches of the vector with a list of codes in a Codebook using the cosine similarity. It publishes the corresponding labels of the best-matched embedding vector for all instances as TensorListProto.
PoseEstimation: Takes in the TensorListProto containing the labels of the best matched vector from the codebook and publishes the pose estimate of the object as Detections3Proto.