Object Detection with DetectNetv2

Isaac SDK supports a training/inference pipeline for object detection with DetectNetv2. For this pipeline, DetectNetv2 utilizes the ResNet backbone feature extractor. ResNet is an industrial network that is on par with MobileNet and InceptionNet (two common backbone models for feature extraction). The NVIDIA Transfer Learning Toolkit (TLT) can be used to train, fine-tune, and prune DetectNetv2 models for object detection.



The following sections explain how to:

  1. Generate dataset images from IsaacSim for Unity3D.

  2. Train a pre-trained DetectNetv2 model on the generated dataset.

  3. Run inference on various inputs using the Isaac TensorRT Inference codelet.


Training a DetectNetv2 model involves generating simulated data and using TLT to train a model on this data. Isaac SDK provides a sample model, based on ResNet18, that has been trained using this pipeline to detect a single object: the dolly shown below. The following step-by-step instructions walk through the process of how this model was trained. Use these steps as guidelines to train models on your own objects.


  1. Set up IsaacSim for Unity3D to generate simulated images for the objects of interest.

  1. Open the sample scene to generate data, available in the isaac_sim_unity3d repository in packages/Nvidia/Samples/ObjectDetection/. This sample scene can generate data with randomized backgrounds, occluding objects, lighting conditions, and camera poses.

  2. Objects are spawned in the procedural > objects GameObject. The list of objects for training is in packages/Nvidia/Samples/ObjectDetection/ObjectDetectionAssetGroup. By default, this AssetGroup contains the dolly prefab. Modify the list of GameObjects to match the list of objects you wish to train on by increasing the size of the ObjectDetectionAssetGroup and dragging each new prefab into this list. Each prefab in this list should contain a LabelSetter component that contains the name of the object.

    If you would like each label from the prefab to be associated with the same instance, add a InstanceLabelGroup to the prefab as well. For example, if each wheel in the dolly prefab has the “wheel” label, an InstanceLabelGroup component in the game object containing all the wheels would result in one bounding box containing all wheels, instead of four separate boxes, one per wheel.

  3. Modify the MaxCount and MaxTrials parameters in the procedural > objects > Collider Asset Spawner component to reflect the number of objects to spawn each frame. The maxCount parameter specifies the number of objects to spawn. The maxPickTrials and maxPlaceTrials values denote how many times each object should be placed again if the initial spawning location is invalid. Additionally, the Dropout parameter under procedural > objects > Collider Asset Spawner represents the probability of an asset being “dropped out” of the frame (the default value is 0.2). Increasing this value will result in a dataset with more negative samples, which should be present in the dataset to minimize false positives during inference.

  4. Modify the ClassLabelManager game object in the scene. By default, it contains one class label rule (dolly) and two class labels (one for background, and one for dolly). Modify this such that there is one class label rule and one class label per object in your ObjectDetectionAssetGroup. Set the “name” and “expression” fields to the label of the object–this should match the string that was set as the label in LabelSetter in step (c). Make sure that the rule index of each object class label is the same as its class label index (for example, the dolly uses index 1 by default). The index value is used as the value to set the pixels in the label image that is later used to generate bounding boxes. Leave the “Default Label” field to 0, as it is the value used to populate all the pixels that are not associated with objects (background pixels).

  1. Generate a dataset in KITTI format with simulated images of the objects of interest.

  1. Configure parameters for the dataset in packages/ml/apps/generate_kitti_dataset/generate_kitti_dataset.app.json. Here the config can be modified to vary, among other parameters, the output resolution of the images (for best results, use dimensions that are multiples of 16), number of training images, and number of testing images to create. The default application generates a dataset of 10k training images and 100 testing images; all images are in PNG format, with a resolution of 640x368.

  2. Run the following application to generate a dataset for input to the TLT training pipeline:


    bazel run packages/ml/apps/generate_kitti_dataset

    On completion, the application will create a directory (/tmp/unity3d_kitti_dataset by default) with the following structure:


    unity3d_kitti_dataset/ training/ image_2/ [training images] 000001.png 000002.png ... label_2/ [training labels in kitti format] 000001.txt 000002.txt ... testing image_2/ [testing images] 000001.png 000002.png ...

  1. Create a local directory called tlt-experiments to mount in the docker container. Move the unity3d_kitti_dataset directory into this directory.

  2. Follow these instructions from IVA to set up docker and NGC.

  3. Start a docker container and mount the directory with the commands outlined here. The docker container includes all the necessary files to train a DetectNetv2 model.

  4. Navigate to the /workspace/examples/detectnet_v2/ directory in the docker image.

  5. Copy the /workspace/examples/detectnet_v2/specs folder into your workspace/tlt-experiments folder. We will later modify these specs in the mounted folder so that the training specs persist after the docker container is terminated.

  6. Start a Jupyter notebook server as described in the TLT documentation:


    jupyter notebook --ip --allow-root

  7. Open the detectnet_v2.ipynb notebook and follow the instructions, taking into account these special instructions for each step.

  1. Set up env variables:

    • $KEY: Create a “key”, which will be used to protect trained models and must be known at inference time to access model weights.

    • $USER_EXPERIMENT_DIR: Leave this set to /workspace/tlt-experiments.

    • $DATA_DOWNLOAD_DIR: Set this to the path of your unity3d_kitti_dataset.

    • $SPECS_DIR: Set this to the path of the copied specs directory within the mounted folder from step #6.

  2. Verify the downloaded dataset. Skip the first two cells, which download a KITTI object detection dataset into the $DATA_DOWNLOAD_DIR specified above. The simulated dataset from Unity3D should already be at this path, so run the last two cells of this section to validate your simulated dataset.

  3. Prepare tf records from the KITTI format dataset. Modify the $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt file to reflect the correct dataset path. An example is provided below for training dolly detection.


    kitti_config { root_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training" image_dir_name: "image_2" label_dir_name: "label_2" image_extension: ".png" partition_mode: "random" num_partitions: 2 val_split: 14 num_shards: 10 } image_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training"

    Then run the cells as instructed in the notebook. The cell containing the tlt-dataset-convert will output a message regarding the classmap such as the one below. Note the “label in tfrecords” file. This value will be used as the key when writing the training configuration in step (e).


    2020-05-09 01:30:12,694 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map. Label in GT: Label in tfrecords file dolly: dolly For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

  4. Download the pre-trained model: Run the cells as instructed in the notebook.

  5. Modify the training parameters for object classes in $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt for your use case:

  1. First, change the dataset_config > data_sources > image_directory_path and tfrecords_path to the training folder inside your generated dataset:


    dataset_config { data_sources { tfrecords_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/tfrecords/kitti_trainval/*" image_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training" }

  2. Update the list of target_class_mapping parameters, adding one for each object class. For each object, the key field of this struct should exactly match the corresponding “label in tfrecords file” from step 9c.


    target_class_mapping { key: "dolly" value: "dolly" }

  3. Edit the output_image_width and output_image_height parameters under augmentation_config > preprocessing.


    preprocessing { output_image_width: 640 output_image_height: 368 ... }

  4. Under the postprocessing_config header, make sure there is one target_class_config configuration per object class. Leave the clustering_config set to default values.


    target_class_config { key: "dolly" value { clustering_config { ... } }

  5. Use the default values for the model_config section.

  6. Modify the evaluation_config section. Edit the validation_period_during_training parameter to change the number of epochs between validation steps. Make sure there is one minimum_detection_ground_truth_overlap and one evaluation_box_config struct for each object class, using the default values within the struct:


    evaluation_config { validation_period_during_training: 10 first_validation_epoch: 1 minimum_detection_ground_truth_overlap { key: "dolly" value: 0.5 } evaluation_box_config { key: "dolly" value { ... } ... }

  7. In cost_function_config, make sure that there is one target_classes struct per object class, using the default values within the struct.


    The cost_function_config section contains parameters for setting weights per class for calculation of the loss or cost

  8. Modify the training_config section. In this example, the images are 640x368, so the batch_size_per_gpu can be increased to 16 for faster learning, thus allowing for reduction of the num_epochs to 100. Use the default values for the learning_rate, regularizer, optimizer, and cost_scaling parameters, keeping in mind that these can be adjusted if needed. By default, the training will output a model checkpoint every 10 epochs; modify the checkpoint_interval parameter to change this frequency.

  9. Modify the bbox_rasterizer_config section to have one target_class_config per object class. For the dolly object, these values were used:


    bbox_rasterizer_config { target_class_config { key: "dolly" value: { cov_center_x: 0.5 cov_center_y: 0.5 cov_radius_x: 0.4 cov_radius_y: 0.4 bbox_min_radius: 1.0 } } ... }

For more guidance on these training parameters, see the TLT documentation and this blog post.

  1. Run TLT training using the tlt-train command, as shown in the notebook.

  2. Evaluate the trained model. Run the tlt-evaluate command as shown in the notebook to evaluate the final trained model. You can also evaluate any of the checkpoint models using the -m flag with the path of the model.step-xxx.tlt files.

  3. Prune the trained model to reduce the number of parameters, thus decreasing inference runtimes and the overall size of the model. To prune, run the tlt-prune command as shown in the notebook. Read the pruning instructions and adjust the pruning threshold accordingly. A pth value of 0.01 is a good starting point for detectnet_v2 models. We recommend a pruning ratio between 0.1 and 0.3.

  4. Retrain the pruned model by modifying the $SPECS_DIR/detectnet_v2_retrain_resnet18_kitti.txt file, similar to $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt. Update the model_config so that the load_graph option is set to true. Make sure to also set the correct path to the pruned model from the previous step in the pretrained_model_file parameter under model_config.

  5. Evaluate the retrained model. Run the tlt-evaluate command as shown in the notebook to evaluate the final trained model. You can also evaluate any of the checkpoint models using the -m flag with the path of the model.step-xxx.tlt files.

  6. Edit the $SPECS_DIR/detectnet_v2_inference_kitti_tlt.txt file to set inference parameters. In the inferencer_config, set the target classes and inference dimensions accordingly, and provide the correct path to the model to be used for inference. In the bbox_handler_config, makes sure there is one classwise_bbox_handler_config per class with the appropriate key in addition to the default classwise_bbox_handler_config.

  1. Visualize inferences using the tlt-infer command as shown in the notebook. Update the -i flag to the testing directory of the simulated dataset and the -m flag to the path to the retrained model.

  2. After the model is trained, pruned, and evaluated to your satisfaction, export it using the tlt-export command under the “Deploy!” section of the notebook. This will provide you with a file of .etlt format, which you can then use for inference with Isaac.

A sample DetectNetv2 model that was trained using the above workflow is provided. This model was trained on a different dolly than the one shown above, but with the same configuration. In addition, a sample inference application is provided in packages/detect_net/apps, utilizing the detect_net_inference subgraph located in the same folder. With this app, you can do the following:

  • Run inference on a set of real images:

    bazel run packages/detect_net/apps:detect_net_inference_app -- --mode image --rows 480 --cols 848

  • Run inference on a recorded Isaac log:

    bazel run packages/detect_net/apps:detect_net_inference_app -- --mode cask --rows 480 --cols 848

  • Run inference on an image stream from Isaac Sim Unity3D:

    bazel run packages/detect_net/apps:detect_net_inference_app -- --mode sim

  • Run inference on a camera feed from an Intel Realsense camera:

    bazel run packages/detect_net/apps:detect_net_inference_app -- --mode realsense

  • Run inference on a camera feed from a V4L camera (be sure to adjust the framerate and resolution according to your camera):

    bazel run packages/detect_net/apps:detect_net_inference_app -- --mode v4l --fps 30 --rows 448 --cols 800

  • Run inference on a Jetson device. See the Developing Codelets in Python page to learn more about deploying a Python app to a Jetson device.


When performing inference on the sample model, the resolution of input images must be greater than or equal to 640x368. The inference application uses the ColorCameraEncoder codelet to downscale input images to match the network input resolution, which is 640x368 for the provided sample dolly detection network. However, ColorCameraEncoder does not support upscaling, so images that are input to the inference applications cannot have a smaller resolution than the network input resolution in either dimension.

These applications can be modified to run inference on your own trained models. To do so, modify the --model_file_path command line argument for the sample application. Be sure to also modify the --etlt_password parameter accordingly. Note that if the input tensor info changes, the “detect_net_inference.tensor_encoder” configuration must be changed to match in the detect_net_inference subgraph.

The object pose estimation pipeline is one of the many use cases for DetectNet. For more sample applications and models, please refer to the 3D Object Pose Estimation with Pose CNN Decoder documentation.


This sample was trained on a limited dataset and is not guaranteed to work in every situation and lighting condition. To improve model accuracy in a custom environment, you can train your own model using the instructions provided above.

Evaluation of a model can help improve the model in several ways:

  1. Data validation: A model is only as good as the data it was trained on. There are many aspects to a training dataset that can affect performance: data integrity, class balance/imbalance, etc.

  2. Model improvement: Developers may wish to make incremental changes to model architectures, hyperparameters, etc. in order to explore their effects on performance.

One of the most common metrics used to evaluate object detection models is Average Precision (AP), which is calculated as follows: \((true positives) / (true positives + false positives)\). Average precision (AP) is the precision averaged over image frames. Average recall (AR) is also an important measure, where recall is \((true positives) / (true positives + false negatives)\). Precision quantifies how well each prediction made by the network matches a ground truth object, while recall captures how many ground truth objects are identified by the network.

The basic values needed to calculate the above metrics are the true positive (TP), false positive (FP), and false negative (FN) scores. In other words, we need to build a confusion matrix for the inference results. To determine if a prediction and a ground truth bounding box match well enough to consider it a true positive, we use the IOU (Intersection over Union) threshold. IOU is a measure of how much two bounding boxes overlap (0 being no overlap, and 1 being an exact match). Setting a lower IOU threshold corresponds to higher tolerance for bounding box errors. We define true positives as the bounding box pairs for which the IOU score is greater than the IOU threshold. The following image shows the ground truth box in black and the predicted bounding box in green for a sample image.


We provide an application to compute these confusion matrices and AP/AR scores across multiple IOU thresholds. This application evaluates the cart detection model being used in the cart delivery application. The app will output confusion matrices and AP/AR metrics to /tmp/object_detection_metrics by default.

To run the application on a recorded log that captures the cart delivery scenario, run the following:


bazel run packages/ml/apps/evaluate_object_detection:verify_confusion_matrices -- --mode log

This will produce statistics in /tmp/object_detection_metrics/object_detection_metrics.json. The metrics below were computed on the default log. The average precision and recall are computed for each IOU threshold. For example, in the metrics below, the AP50 (AP for IOU threshold 0.5) is 90.55%.


{ "trial_name": "object_detection_metrics_2020-05-20", "iou_thresholds": [ 0.5, 0.8, 0.95 ], "statistics": [ { "class_name": "Dolly", "precisions": [ 0.9055374592833876, 0.5597176981541803, 0.14332247557003258 ], "recalls": [ 0.38108293351610695, 0.23554946310258168, 0.06031528444139822 ], "area_under_curve": 0.16822016775579188 } ] }

The application can also be run on a simulation scene as follows:


bazel run packages/ml/apps/evaluate_object_detection:verify_confusion_matrices -- --mode sim

Two evaluation scenes are provided in the Factory of the Future scene. Run either of these alongside the evaluation application. The application will run for 100 seconds and output the metrics after completion.

Scenario 17 spawns a cart in front of the camera at various angles and positions between 1.5 and 2.5 meters from the camera. To run Scenario 17, use the following command from within the IsaacSim release folder:


./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 17

Scenario 18 spawns multiple carts along the robot’s path as it drives along the factory floor. To run Scenario 18, use the following command from within the IsaacSim release folder:


./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 18

© Copyright 2018-2020, NVIDIA Corporation. Last updated on Feb 1, 2023.