Object Detection with DetectNetv2

Isaac SDK supports a training/inference pipeline for object detection with DetectNetv2. For this pipeline, DetectNetv2 utilizes the ResNet backbone feature extractor. ResNet is an industrial network that is on par with MobileNet and InceptionNet (two common backbone models for feature extraction). The NVIDIA Transfer Learning Toolkit (TLT) can be used to train, fine-tune, and prune DetectNetv2 models for object detection.

../../../_images/Isaac-TLT-integration.jpg ../../../_images/Isaac-TLT-integration-legend.jpg

The following sections explain how to:

  1. Generate dataset images from IsaacSim for Unity3D.
  2. Train a pre-trained DetectNetv2 model on the generated dataset.
  3. Run inference on various inputs using the Isaac TensorRT Inference codelet.

Training on Simulated Images in TLT

Training a DetectNetv2 model involves generating simulated data and using TLT to train a model on this data. Isaac SDK provides a sample model, based on ResNet18, that has been trained using this pipeline to detect a single object: the dolly shown below. The following step-by-step instructions walk through the process of how this model was trained. Use these steps as guidelines to train models on your own objects.

  1. Set up IsaacSim for Unity3D to generate simulated images.
  1. Open the sample scene to generate data, available in the isaac_sim_unity3d repository in packages/Nvidia/Samples/ObjectDetection/. This sample scene can generate data with randomized backgrounds, occluding objects, lighting conditions, and camera poses.

  2. Objects are spawned in the procedural > objects GameObject. The list of objects for training is in packages/Nvidia/Samples/ObjectDetection/ObjectDetectionAssetGroup. By default, this AssetGroup contains the dolly prefab. Modify the list of GameObjects to match the list of objects you wish use to train the detector. Each prefab in this list should contain a LabelSetter component that contains the name of the object.

    Be sure to modify the MaxCount and MaxTrials in the procedural > objects > Collider Asset Spawner component to reflect the number of objects to spawn each frame. Additionally, the Dropout parameter under procedural > objects > Collider Asset Spawner represents the probability of an asset being “dropped out” of the frame (the default value is 0.2). Increasing this value will result in a dataset with more negative samples, which should be present in the dataset to minimize false positives during inference.

  1. Generate a dataset in KITTI format with simulated images of the object.
  1. Configure parameters for the dataset in packages/ml/apps/generate_kitti_dataset/generate_kitti_dataset.app.json. Here the config can be modified to vary, among other parameters, the output resolution of the images (for best results, use dimensions that are multiples of 16), number of training images, and number of testing images to create. The default application generates a dataset of 10k training images and 100 testing images; all images are in PNG format, with a resolution of 368x640.

  2. Run the following application to generate a dataset for input to the TLT training pipeline:

    bazel run packages/ml/apps/generate_kitti_dataset

    On completion, the application will create a directory (/tmp/unity3d_kitti_dataset by default) with the following structure:

            image_2/   [training images]
            label_2/     [training labels in kitti format]
            image_2/   [testing images]
  1. Create a local directory called tlt-experiments to mount in the docker container. Move the unity3d_kitti_dataset directory into this directory.

  2. Follow these instructions from IVA to set up docker and NGC.

  3. Start a docker container and mount the directory with the commands outlined here. The docker container includes all the necessary files to train a DetectNetv2 model.

  4. Navigate to the /workspace/examples/detectnet_v2/ directory in the docker image.

  5. Copy the /workspace/examples/detectnet_v2/specs folder into your workspace/tlt-experiments folder. We will later modify these specs in the mounted folder so that the training specs persist after the docker container is terminated.

  6. Start a Jupyter notebook server as described in the TLT documentation:

    jupyter notebook --ip --allow-root
  7. Open the detectnet_v2.ipynb notebook and follow the instructions, taking into account these special instructions for each step.

  1. Set up env variables:

    • $KEY: Create a “key”, which will be used to protect trained models and must be known at inference time to access model weights.
    • $USER_EXPERIMENT_DIR: Leave this set to /workspace/tlt-experiments.
    • $DATA_DOWNLOAD_DIR: Set this to the path of your unity3d_kitti_dataset.
    • $SPECS_DIR: Set this to the path of the copied specs directory within the mounted folder from step #6.
  2. Verify the downloaded dataset. Skip the first two cells, which download a KITTI object detection dataset into the $DATA_DOWNLOAD_DIR specified above. The simulated dataset from Unity3D should already be at this path, so run the last two cells of this section to validate your simulated dataset.

  3. Prepare tf records from the KITTI format dataset. Modify the $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt file to reflect the correct dataset path. Then run the cells as instructed in the notebook. An example is provided below for training dolly detection.

    kitti_config {
        root_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training"
        image_dir_name: "image_2"
        label_dir_name: "label_2"
        image_extension: ".png"
        partition_mode: "random"
        num_partitions: 2
        val_split: 14
        num_shards: 10
    image_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training"
  4. Download the pre-trained model: Run the cells as instructed in the notebook.

  5. Modify the training parameters for object classes in $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt for your use case:

  1. First, change the dataset_config > data_sources > image_directory_path to the training folder inside your generated dataset:

    dataset_config {
        data_sources {
            tfrecords_path: "/workspace/tlt-experiments/tfrecords/kitti_trainval/*"
            image_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training"
  2. Update the list of target_class_mapping parameters, adding one for each object class. For each object, the key field of this struct should match the label set via the LabelSetter component in step 1b.

    target_class_mapping {
        key: "dolly"
        value: "dolly"
  3. Edit the output_image_width and output_image_height parameters under augmentation_config > preprocessing.

    preprocessing {
        output_image_width: 640
        output_image_height: 368
  4. Under the postprocessing_config header, make sure there is one target_class_config configuration per object class. Leave the clustering_config set to default values.

    target_class_config {
        key: "dolly"
        value {
        clustering_config {
  5. Use the default values for the model_config section.

  6. Modify the evaluation_config section. Edit the validation_period_during_training parameter to change the number of epochs between validation steps. Make sure there is one minimum_detection_ground_truth_overlap and one evaluation_box_config struct for each object class, using the default values within the struct:

    evaluation_config {
        validation_period_during_training: 10
        first_validation_epoch: 1
        minimum_detection_ground_truth_overlap {
            key: "dolly"
            value: 0.5
        evaluation_box_config {
            key: "dolly"
            value {
  7. In cost_function_config, make sure that there is one target_classes struct per object class, using the default values within the struct.


    The cost_function_config section contains parameters for setting weights per class for calculation of the loss or cost

  8. Modify the training_config section. In this example, the images are 368x640, so the batch_size_per_gpu can be increased to 16 for faster learning, thus allowing for reduction of the num_epochs to 100. Use the default values for the learning_rate, regularizer, optimizer, and cost_scaling parameters, keeping in mind that these can be adjusted if needed. By default, the training will output a model checkpoint every 10 epochs; modify the checkpoint_interval parameter to change this frequency.

  9. Modify the bbox_rasterizer_config section to have one target_class_config per object class. For the dolly object, these values were used:

    bbox_rasterizer_config {
        target_class_config {
            key: "dolly"
            value: {
                cov_center_x: 0.5
                cov_center_y: 0.5
                cov_radius_x: 0.4
                cov_radius_y: 0.4
                bbox_min_radius: 1.0

For more guidance on these training parameters, see the TLT documentation and this blog post.

  1. Run TLT training using the tlt-train command, as shown in the notebook.

  2. Evaluate the trained model. Run the tlt-evaluate command as shown in the notebook to evaluate the final trained model. You can also evaluate any of the checkpoint models using the -m flag with the path of the model.step-xxx.tlt files.

  3. Prune the trained model to reduce the number of parameters, thus decreasing inference runtimes and the overall size of the model. To prune, run the :code:tlt-prune` command as shown in the notebook. Read the pruning instructions and adjust the pruning threshold accordingly. A pth value of 0.01 is a good starting point for detectnet_v2 models. We recommend a pruning ratio between 0.1 and 0.3.

  4. Retrain the pruned model by modifying the $SPECS_DIR/detectnet_v2_retrain_resnet18_kitti.txt file, similar to $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt. Update the model_config so that the load_graph option is set to true. Make sure to also set the correct path to the pruned model from the previous step in the pretrained_model_file parameter under model_config.

  5. Evaluate the retrained model. Run the tlt-evaluate command as shown in the notebook to evaluate the final trained model. You can also evaluate any of the checkpoint models using the -m flag with the path of the model.step-xxx.tlt files.

  6. Edit the $SPECS_DIR/detectnet_v2_clusterfile_kitti.json file to set inference parameters. An example of the clusterfile is shown below for the dolly detector.

        "dbscan_criterion": "IOU",
        "dbscan_eps": {
            "dolly": 0.3
        "dbscan_min_samples": {
            "dolly": 0.05
        "min_cov_to_cluster": {
            "dolly": 0.005
        "min_obj_height": {
            "dolly": 4,
            "default": 2
        "target_classes": ["dolly"],
        "confidence_th": {
            "dolly": 0.6
        "confidence_model": {
            "dolly": { "kind": "aggregate_cov"}
        "output_map": {
            "dolly" : "dolly"
        "color": {
            "dolly": "white"
        "postproc_classes": ["dolly"],
        "image_height": 384,
        "image_width": 640,
        "stride": 16
  1. Visualize inferences using the tlt-infer command as shown in the notebook. Update the -i flag to the testing directory of the simulated dataset and the -m flag to the path to the retrained model.

  2. After the model is trained, pruned, and evaluated to your satisfaction, export it using the tlt-export command under the “Deploy!” section of the notebook. This will provide you with a file of .etlt format, which you can then use for inference with Isaac.

    !tlt-export $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
        -o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector_dolly_368x640.etlt \
        --outputs output_cov/Sigmoid,output_bbox/BiasAdd \
        --enc_key $KEY \
        --input_dims 3,368,640 \
        --export_module detectnet_v2

TensorRT Inference on TLT models

A sample DetectNetv2 model that was trained using the above workflow is provided. In addition, three sample inference applications are provided in packages/detect_net/apps, all utilizing the detect_net_inference subgraph also located in the same folder.

  • detect_net_inference_imagefeeder: Runs inference on a set of real images.

    bazel run packages/detect_net/apps:detect_net_inference_imagefeeder

  • detect_net_inference_camerafeed: Runs inference on a camera feed from an Intel Realsense camera.

    bazel run packages/detect_net/apps:detect_net_inference_camerafeed

  • detect_net_inference_replay: Runs inference on a recorded Isaac log.

    bazel run packages/detect_net/apps:detect_net_inference_replay

These applications can be modified to run inference on your own trained models. To do so, modify the configuration for the detect_net_inference.tensor_r_t_inference codelet in any of the sample apps. Be sure to also modify the etlt_password and input/output tensor info parameters in this codelet accordingly. Note that if the input tensor info changes, the detect_net_inference.tensor_encoder configuration must be changed to match.


This sample was trained on a limited dataset and is not guaranteed to work in every situation and lighting condition. To improve model accuracy in a custom environment, you can train your own model using the instructions provided above.