Object Detection Pipeline

Creating a pipeline for object detection involves the following tasks:

  • Selection of a pretrained network
  • Fine-tuning the selected network with synthetic data from Unreal Engine 4
  • Converting the tuned model to Tensorflow or TensorRT for Inference
  • Inferencing with Tensorflow or TensorRT on either the host or the target system

Selecting a Pretrained Network

The network selected for Isaac object detection is the You Only Look Once (YOLO) network, because it has the following characteristics:

  • Easily trained with tensorflow or Keras
  • Easily trained further with a small set of synthetic data
  • Supports Tensor RT inference
  • Includes a suitable license on dataset and network
  • YOLO is a state-of-the-art, real-time object detection system. Version 3 achieves both high precision and high speed on the COCO data set. The alternative tiny-YOLO network can achieve even faster speed without great sacrifice of precision.

Some target devices may not have the necessary memory to run a network like yolov3. In that case the user must run tiny-yolov3. See tiny-yolov3 for instructions on how to run tiny-yolov3.

Sample results using the YOLO v3 network, with detected objects shown in bounding boxes of different colors, are shown in the following figure:


Training / Fine-Tuning the Network

The training application gets camera images and bounding box proto from the Unreal Engine 4 (UE4) simulation over the Isaac UE4 bridge. The application uses that data to generate training pairs. The training pairs are used to train the YOLO network to perform multi-class object detection. In Isaac Sim, objects that must be detected include persons, bowling pins, potted plants, trash cans, balls, and chairs. (The objects presented in simulation can be adjusted as needed). The training script is written in Python with minimal interfacing overhead to the C++ side of the Isaac SDK.

Configure UE4 to connect with several existing Isaac SDK components to set up the pipeline, as shown in the following diagram:


Message Types

Messages are of the following types:

  • TensorProto: Defines an n-dimensional tensor that forms the basis of images and tensor pairs for training.
  • TensorListProto: Defines a list of TensorProto. TensorProto messages are mainly used to pass around tensors.
  • ImageProto: A spcial case of TensorProto for tensors limited to three dimensions.
  • ColorCameraProto: Holds a color image, including camera information, and is sent by UE4 by default. In this example, only the color image data is used.
  • Detections2Proto: Holds the absolute bounding box coordinates and class name.
  • RigidBody3GroupProto: Defines position and rotation of a given actor in UE4 to randomly spawn the location of Carter in order to get different training images with different backgrounds.


  • TcpSubscriber: Used by the training application to receive data from UE4 simulator. Two TcpSubscribers are used in this example, each receiving a color image and detection label from the simulation.
  • ColorCameraEncoder: Takes in ColorCameraProto and outputs a downsampled image stored in a 3D tensor (WxHx3). The tensor is published as a TensorListProto containing only one tensor. The codelet also supports downsampling which reduces the image to a smaller user-defined size.
  • DetectionEncoder: Filters bounding box information from Detections2Proto. First, DetectionEncoder takes a list of object names and get the bounding boxes from the Detections2Proto whose class name is listed. Then candidate bounding boxes are filtered further if their areas are below the area threshold. Finally, the bounding box coordinates and unique ID are encoded into the TensorListProto.
  • TensorSynchronization: Takes in two TensorListProto inputs and synchronizes them according to their acquisition time. This codelet makes sure that the training code gets synchronized color image and detection label data.
  • Sample Accumulator: Takes in the training pairs (image tensor and detection label tensor) as a TensorListProto and stores that in a buffer. This codelet is bound to the python script such that the training script can directly sample from this buffer with the acquire_samples() function. The acquire_samples() function converts the TensorListProto into a list of numpy arrays with corresponding dimensions and pass that to Python.
  • Teleportation: Publishes RigidBody3GroupProto in a pre-defined way to randomly change the spawn location. Note that the randomness of object positions are not achieved by this codelet, but by using the Domain Randomization Plugin in UE4.

Setup Example

Before running the object detection training pipeline, codelet configuration, and training configuration must be set up properly.

Bridge Configuration

UE4 (Unreal Engine 4) is used to generate the images for training. The Carter robot and the training assets must be properly configured in the bridge configuration file.

To Set Up UE4

  1. The default map for training is CarterWarehouse_P. Other maps can be used instead if desired.
  2. Change the “graphs” and “configs” in apps/samples/yolo/bridge_config/carter_rgb_detection.json to absolute paths in the host system.
  3. The file apps/samples/yolo/bridge_config/carter_rgb_detection.config.json is used to configure sim_bridge between the Isaac SDK and UE4. In this A_CarterGroup, make sure the name of the actors is the camera-carrying carter defined in UE4. Change the pose to set a different initial position if desired.
  4. Modify the A_DRGroup to setup how different training assets are spawned and how Domain Randomization will be applied in the scene during training. In the random_meshes, be sure to assign a unique name to each category of meshes in the class field. You can also change the randomization of meshes, materials and transportaion for each category of meshes in the random_meshes field.
  5. You can modify the Domain Randomization in existing_light_config and existing_mesh_config. Be sure to add the names of classes in the ignored_classes under existing_mesh_config which you don’t want to be changed along the background’s texture and color.
  6. In the CarterLeftCamera, you can specify the class names that you want to train in the classes under bounding_box_settings, add the category names that correspond to the name you defined in step 4.

Codelet Configuration

The file yolo_training_apps.config.json is used to set codelet parameters. In detection_encoder, set the list of category names for training (make sure they exist as defined in the last step), and set the lower bound of the area for each object.

The position and orientation of Carter during training are set through the Teleportation codelet. In camera_teleportation, change the time interval to teleport camera/carter, and the range of xy coordinates/yaw angle, and other characteristics.

If configured properly, the scene with bounding boxes overlay in sight is similar to the following:


Training Configurations

The training configuration files are in the apps/samples/yolo/keras-yolo3/configs/isaac_object_detection.json directory, and have the following elements:

  • classes_path: Contains the category name of all trained objects. These should be the same as in the names of detection_encoder.
  • anchors_path: Contains the path to the anchor points file which determines whether the YOLO or tiny-YOLO model is trained.
  • lr_stage1: The learning rate of stage 1, when only the heads of the YOLO network are trained.
  • lr_stage2: The learning rate of stage 2, when all of the layers are fine-tuned.
  • gamma: The learning rate decay rate.
  • batch_size_stage1: The batch size of stage 1.
  • batch_size_stage2: The batch size of stage 2.
  • num_epochs_stage1: The number of epochs to train for stage 1.
  • num_epochs_total: number of epochs to train in total (including both stages).

The log_dir subdirectory contains log files, including trained models.

To Train

After finishing the setup steps and launching UE4 simulation, run the following command:

bazel run //apps/samples/yolo:yolo_training

To monitor the training process in Tensorboard, navigate to the log_dir and run the following command:

tensorboard --logdir .

View training images with bounding box overlays in Sight.

Freezing the Model

After selecting the model with the best AP score for training, freeze the model in either Tensorflow or Darknet format. You can export the model to pb, onnx and h5 format using export_model.py. You can then export the model to darknet format using keras_to_darknet.py given the generated h5 file. Those models can be used in the inference codelet. Note that keras_to_darknet.py only supports h5 format in this release.

python keras_to_darknet.py [config_file] [keras_weights_file] [out_file] [num_classes]

Make sure the number of classes in the yolo section of the darknet configuration is updated with the number of classes trained. The number of filters in the convolutional layer before the yolo layer is:

filters = (num_classes + K) * B

B is the number of bounding boxes a cell on the feature map can predict, 3 in the case of yolov3 and yolov3-tiny. K is the sum of the number of bounding box attributes and confidence, in this case: 4 + 1 = 5.


The following diagram describes the different codelets used in inference.


Messages Involved

  • ColorCameraProto - This message type holds a color image including camera information. This is one of the default message types sent by UE4. In this example, we only use the color image that the message holds and not the camera information.
  • Detections2Proto: This message type holds absolute bounding box coordinates, the class name, and detection confidence.

Codelets Involved

The object detection pipeline involves the codelets described in this section.

Zed Camera/Image Feeder

This is a driver codelet that interfaces with and retrieves images from the ZED camera/image path at specified resolutions and frame rates. The codelet outputs ColorCameraProto messages.

YoloTensorRT (TensorRT Inference)

NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low-latency and high-throughput for deep learning. When training is complete, the frozen model is saved in darknet format. the Darknet format is then loaded by the YoloTensorRT Codelet which performs the following operations:

  • Loads the darknet model and converts it to TensorRT
  • Optimizes the tensorRT model based on the target device and saves the model
  • Downscales the image, maintaining the aspect ratio based on the Inference resolution provided as input. (Default is 416x416)
  • Normalizes the image so that the pixel values are in the range 0-1
  • Converts the RGB-interleaved image to RGB Planar Tensor format
  • Allocates the output buffer required for inference
  • Provides the Normalized RGB TensorRT data as input to the inference API for Yolo (with a TensorRT backend)
  • Runs inference on the model and providesas output the unprocessed detections from the network

The output buffer is in the format {{bounding_box1{x1, y1, x2, y2}, objectness, {probability0, probability1,…probability<N>}}, {bounding_box2{x1, y1, x2, y2}, objectness, {probability0, probability1… probability<N>}}…..{bounding_box<K>{x1, y1, x2, y2}, objectness, {probability0, probability1, probability2…. probability<N>}}} where bounding_box<K> represents the Kth bounding box with minimum and maximum coordinates, objectness represents the confidence that an object is present inside the bounding box, and probability<i> represents the confidence that the object belongs to class <i>.


This codelet decodes the output detections from a network in the following steps:

  1. It parses the output buffer from YoloTensorRT into the BoundingBoxDetection struct.
  2. It applies the confidence threshold to the detections.
  3. It applies NonMaximumSuppression to the detections. NonMaximumSuppression is used to make sure that in object detection, a particular object is identified only once.
  4. It outputs a list of bounding boxes in the format BoundingBoxDetection.
To Prepare to Run Inference

Set the configuration parameters in apps/samples/yolo/yolo_tensorrt.app.json to run the application.

Configuration parameters for YoloTensorRT include the following:

  • yolo_dimensions (Default : (416, 416)) - image resolution. This resolution should be a multiple of 32, to ensure YOLO network support.
  • batch_size(Default : 1) - The number of images run simultaneously during inference. This value is restricted by the memory available on the device running the application.
  • weights_file_path - The path to the weights file.
  • config_file_path - The path to the YOLO network configuration describing the structure of the network
  • tensorrt_folder_path : The path to store the optimized YOLO TensorRT network.
  • network_type : Type of Yolo Architecture to run inference on. Current supported architectures are “yolov3” and “yolov3-tiny”.
  • num_classes : Number of classes trained on.

Configuration parameters of BoundingBox2DetectionDecoder include the following:

  • confidence_threshold: Probability threshold for inference. Detection is considered true positive if confidence of detection is greater than specified with this parameter.
  • nms_threshold: Non Maximal suppression prunes away boxes that have high intersection-over-union (IOU) overlap with previously selected boxes. The nms_threshold represents the threshold for deciding whether boxes overlap too much with respect to IOU.
  • labels_file_path: Path to the labels file listing all classes the network is trained on.

To Run Inference on the Host System

Run inference on the host system with the following commands:

bazel build ...
bazel run apps/samples/yolo/yolo_tensorrt_inference

When first run, the optimization of the network takes more time than required for subsequent inference of the cached model.

To Run Inference on Jetson Nano, Jetson TX 2 or Jetson Xavier

For maximum performance, run the following commands to maximize the GPU/CPU frequency as well as CPU cores:

sudo nvpmodel -m 0
sudo ~/jetson_clocks.sh

Run inference on the Jetson system with the following commands:

./engine/build/deploy.sh -d jetpack42 -h <IP> -p //apps/samples/yolo:yolo_tensorrt-pkg
ssh <IP>
cd ~/deploy/<username>/yolo_tensorrt-pkg

Load Sight in your browser at http://localhost:3000 to see inference results and the input camera feed.

The default architecture for inference is yolov3. In order to run inference on tiny-yolov3 update the following parameters in the yolo application config file:

  • yolo_dimensions (Default : (416, 416)) - image resolution. This resolution should be a multiple of 32, to ensure YOLO network support.
  • batch_size(Default : 1) - The number of images run simultaneously during inference. This value is restricted by the memory available on the device running the application.
  • weights_file_path - The path to the Tiny-YoloV3 weights file.
  • config_file_path - The path to the Tiny-YoloV3 network configuration describing the structure of the network
  • tensorrt_folder_path : The path to store the optimized Tiny-YoloV3 TensorRT network.
  • network_type (Default : yolov3) : Set the Yolo architecture type to yolov3-tiny.
  • num_classes : Number of classes trained on.

Compile and run the inference on the target using the above procedures.