Object Detection with YOLO

Creating a pipeline for object detection with YOLO involves the following tasks:

  • Selection of a pretrained network
  • Fine-tuning the selected network with synthetic data from Unity (using IsaacSim Unity3D)
  • Converting the tuned model to Tensorflow or TensorRT for Inference
  • Inferencing with Tensorflow or TensorRT on either the host or the target system

Selecting a Pretrained Network

The You Only Look Once (YOLO) network has the following characteristics:

  • Easily trained with tensorflow or Keras
  • Easily trained further with a small set of synthetic data
  • Supports Tensor RT inference
  • Includes a suitable license on dataset and network
  • YOLO is a state-of-the-art, real-time object detection system. Version 3 achieves both high precision and high speed on the COCO data set. The alternative tiny-YOLO network can achieve even faster speed without great sacrifice of precision.

Some target devices may not have the necessary memory to run a network like yolov3. In that case the user must run tiny-yolov3. See To Run inference on the Tiny Yolov3 Architecture for instructions on how to run tiny-yolov3.

Sample results using the YOLO v3 network, with detected objects shown in bounding boxes of different colors, are shown in the following figure:


Training YOLO with NavSim

An example application for training YOLO with NavSim is provided in apps/samples/yolo/yolo_training_navsim.app.json. You need to make a few modifications to the apps/samples/yolo/keras-yolo3/configs/isaac_object_detection.json config file to run the training sample with the NavSim “object_expo” scene:

  1. Change the app_filename parameter to apps/samples/yolo/yolo_training_navsim.app.json.
  2. Change the classes_path parameter to apps/samples/yolo/keras-yolo3/model_data/object_classes_navsim.txt.

You can see the data used for training at http://localhost:3001/. When training, ignore warning messages about skipping loading of weights for layers.

The log directory can be changed in /apps/samples/yolo/keras-yolo3/configs/issac_object_expo.json. To see training progress, run tensorboard with the following commmand:

tensorboard --logdir ~/yolo3_logs

Training with Your Own AssetBundle

The default scene uses objects from the AssetBundle in navsim_Data/StreamingAssets/AssetBundles/warehouseobjects. You can run the scene with a different AssetBundle using the command line argument --assetBundle, followed by the path to your own AssetBundle. The path can be relative to navsim_Data/StreamingAssets or absolute. For example, there is another AssetBundle in navsim_Data/StreamingAssets named “owenobjects”, and you can generate training data using this AssetBundle instead with the following command:

bob@desktop:~/isaac/packages/navsim/unity$ ./navsim.x86_64 --scene object_expo
--assetBundle AssetBundles/owenobjects

The asset randomizer draws from all the Prefabs in the AssetBundle, then uses the name of each Prefab as the class label. To train with your own models, follow the procedures in Unity documentation to create an AssetBundle with all the Prefabs to train on, and make sure their names match the desired class labels. You can optionally provide a JSON file using the --labels command line argument, which overrides the labels for certain Prefabs. For example, create a labels.json file in the packages/navsim/unity/navsim_Data/StreamingAssets folder with the following contents:

  "PalletWood01": "wood_pallet",
  "PalletWood02": "wood_pallet",
  "PalletPlastic": "plastic_pallet"

Then run the “object_expo” scene:

bob@desktop:~/isaac/packages/navsim/unity$ ./navsim.x86_64 --scene object_expo
--assetBundle AssetBundles/owenobjects --labels labels.json

You can see in Sight that both PalletWood01 and PalletWood02 are labeled as “wood_pallet”, and the PalletPlastic is labeled as “plastic_pallet”.

When training, make sure to update the apps/samples/yolo/keras-yolo3/model_data/object_classes_navsim.txt file and DetectionEncoder configuration in apps/samples/yolo/yolo_training_navsim.app.json with the correct object labels from the new AssetBundle.

Training Configurations

The training configuration files are in the apps/samples/yolo/keras-yolo3/configs/isaac_object_detection.json directory, and have the following elements:

  • classes_path: Contains the category name of all trained objects. These should be the same as in the names of detection_encoder.
  • anchors_path: Contains the path to the anchor points file which determines whether the YOLO or tiny-YOLO model is trained.
  • lr_stage1: The learning rate of stage 1, when only the heads of the YOLO network are trained.
  • lr_stage2: The learning rate of stage 2, when all of the layers are fine-tuned.
  • gamma: The learning rate decay rate.
  • batch_size_stage1: The batch size of stage 1.
  • batch_size_stage2: The batch size of stage 2.
  • num_epochs_stage1: The number of epochs to train for stage 1.
  • num_epochs_total: number of epochs to train in total (including both stages).

The log_dir subdirectory contains log files, including trained models.

To Train

After finishing the setup steps and launching the simulation, run the following command:

bazel run //apps/samples/yolo:yolo_training

To monitor the training process in Tensorboard, navigate to the log_dir and run the following command:

tensorboard --logdir .

View training images with bounding box overlays in Sight.

Freezing the Model

After selecting the model with the best AP score for training, freeze the model in either Tensorflow or Darknet format. You can export the model to pb, onnx and h5 format using export_model.py. You can then export the model to darknet format using keras_to_darknet.py given the generated h5 file. Those models can be used in the inference codelet. Note that keras_to_darknet.py only supports h5 format in this release.

python keras_to_darknet.py [config_file] [keras_weights_file] [out_file] [num_classes]

Make sure the number of classes in the yolo section of the darknet configuration is updated with the number of classes trained. The number of filters in the convolutional layer before the yolo layer is:

filters = (num_classes + K) * B

B is the number of bounding boxes a cell on the feature map can predict, 3 in the case of yolov3 and yolov3-tiny. K is the sum of the number of bounding box attributes and confidence, in this case: 4 + 1 = 5.


The following diagram describes the different codelets used in inference.


Messages Involved

  • ColorCameraProto - This message type holds a color image including camera information. In this example, we only use the color image that the message holds and not the camera information.
  • Detections2Proto: This message type holds absolute bounding box coordinates, the class name, and detection confidence.

Codelets Involved

The object detection pipeline involves the codelets described in this section.

Zed Camera/Image Feeder

This is a driver codelet that interfaces with and retrieves images from the ZED camera/image path at specified resolutions and frame rates. The codelet outputs ColorCameraProto messages.

YoloTensorRT (TensorRT Inference)

NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low-latency and high-throughput for deep learning. When training is complete, the frozen model is saved in darknet format. the Darknet format is then loaded by the YoloTensorRT Codelet which performs the following operations:

  • Loads the darknet model and converts it to TensorRT
  • Optimizes the tensorRT model based on the target device and saves the model
  • Downscales the image, maintaining the aspect ratio based on the Inference resolution provided as input. (Default is 416x416)
  • Normalizes the image so that the pixel values are in the range 0-1
  • Converts the RGB-interleaved image to RGB Planar Tensor format
  • Allocates the output buffer required for inference
  • Provides the Normalized RGB TensorRT data as input to the inference API for Yolo (with a TensorRT backend)
  • Runs inference on the model and providesas output the unprocessed detections from the network

The output buffer is in the format {{bounding_box1{x1, y1, x2, y2}, objectness, {probability0, probability1,…probability<N>}}, {bounding_box2{x1, y1, x2, y2}, objectness, {probability0, probability1… probability<N>}}…..{bounding_box<K>{x1, y1, x2, y2}, objectness, {probability0, probability1, probability2…. probability<N>}}} where bounding_box<K> represents the Kth bounding box with minimum and maximum coordinates, objectness represents the confidence that an object is present inside the bounding box, and probability<i> represents the confidence that the object belongs to class <i>.


This codelet decodes the output detections from a network in the following steps:

  1. It parses the output buffer from YoloTensorRT into the BoundingBoxDetection struct.
  2. It applies the confidence threshold to the detections.
  3. It applies NonMaximumSuppression to the detections. NonMaximumSuppression is used to make sure that in object detection, a particular object is identified only once.
  4. It outputs a list of bounding boxes in the format BoundingBoxDetection.

To Prepare to Run Inference

Set the configuration parameters in apps/samples/yolo/yolo_tensorrt.app.json to run the application.

Running this application for the first time builds the TensorRT engine for the given target device and caches this optimized YOLO TensorRT network file in the folder specified by the tensorrt_folder_path parameter. On subsequent runs, the application attempts to load this cached .engine file if the configuration parameters did not change. To force the application to rebuild the TensorRT engine, delete the .engine file from the tensorrt_folder_path location.

Configuration parameters for YoloTensorRT include the following:

  • yolo_dimensions (Default : (416, 416)) - image resolution. This resolution should be a multiple of 32, to ensure YOLO network support.
  • batch_size(Default : 1) - The number of images run simultaneously during inference. This value is restricted by the memory available on the device running the application.
  • weights_file_path - The path to the weights file.
  • config_file_path - The path to the YOLO network configuration describing the structure of the network
  • tensorrt_folder_path : The path to store the optimized YOLO TensorRT network.
  • network_type : Type of Yolo Architecture to run inference on. Current supported architectures are “yolov3” and “yolov3-tiny”.
  • num_classes : Number of classes trained on.

Configuration parameters of BoundingBox2DetectionDecoder include the following:

  • confidence_threshold: Probability threshold for inference. Detection is considered true positive if confidence of detection is greater than specified with this parameter.
  • nms_threshold: Non Maximal suppression prunes away boxes that have high intersection-over-union (IOU) overlap with previously selected boxes. The nms_threshold represents the threshold for deciding whether boxes overlap too much with respect to IOU.
  • labels_file_path: Path to the labels file listing all classes the network is trained on.

To Run Inference on the Host System

Run inference on the host system with the following commands:

bazel build ...
bazel run apps/samples/yolo/yolo_tensorrt_inference

When first run, the optimization of the network takes more time than required for subsequent inference of the cached model.

To Run Inference on Jetson Nano, Jetson TX 2 or Jetson Xavier

For maximum performance, run the following commands to maximize the GPU/CPU frequency as well as CPU cores:

sudo nvpmodel -m 0
sudo ~/jetson_clocks.sh

Deploy //apps/samples/yolo:yolo_tensorrt-pkg to the Jetson system as explained in Deploying and Running on Jetson.

Run inference on the Jetson system with the following commands:

cd ~/deploy/<username>/yolo_tensorrt-pkg

Load Sight in your browser at http://localhost:3000 to see inference results and the input camera feed.

To Run inference on the Tiny Yolov3 Architecture

The default architecture for inference is yolov3. In order to run inference on tiny-yolov3 update the following parameters in the yolo application config file:

  • yolo_dimensions (Default : (416, 416)) - image resolution. This resolution should be a multiple of 32, to ensure YOLO network support.
  • batch_size(Default : 1) - The number of images run simultaneously during inference. This value is restricted by the memory available on the device running the application.
  • weights_file_path - The path to the Tiny-YoloV3 weights file.
  • config_file_path - The path to the Tiny-YoloV3 network configuration describing the structure of the network
  • tensorrt_folder_path : The path to store the optimized Tiny-YoloV3 TensorRT network.
  • network_type (Default : yolov3) : Set the Yolo architecture type to yolov3-tiny.
  • num_classes : Number of classes trained on.

Compile and run the inference on the target using the above procedures.

Sample Inference Output