Isaac SDK supports a training/inference pipeline for object detection with DetectNetv2. For this pipeline, DetectNetv2 utilizes the ResNet backbone feature extractor. ResNet is an industrial network that is on par with MobileNet and InceptionNet (two common backbone models for feature extraction). The NVIDIA Transfer Learning Toolkit (TLT) can be used to train, fine-tune, and prune DetectNetv2 models for object detection.


The following sections explain how to:
Generate dataset images from IsaacSim for Unity3D.
Train a pre-trained DetectNetv2 model on the generated dataset.
Run inference on various inputs using the Isaac TensorRT Inference codelet.

Training a DetectNetv2 model involves generating simulated data and using TLT to train a model on this data. Isaac SDK provides a sample model, based on ResNet18, that has been trained using this pipeline to detect a single object: the dolly shown below. The following step-by-step instructions walk through the process of how this model was trained. Use these steps as guidelines to train models on your own objects.

Set up IsaacSim for Unity3D to generate simulated images for the objects of interest.
Open the sample scene to generate data, available in the
isaac_sim_unity3d
repository inpackages/Nvidia/Samples/ObjectDetection/
. This sample scene can generate data with randomized backgrounds, occluding objects, lighting conditions, and camera poses.Objects are spawned in the procedural > objects GameObject. The list of objects for training is in
packages/Nvidia/Samples/ObjectDetection/ObjectDetectionAssetGroup
. By default, this AssetGroup contains the dolly prefab. Modify the list of GameObjects to match the list of objects you wish to train on by increasing the size of theObjectDetectionAssetGroup
and dragging each new prefab into this list. Each prefab in this list should contain a LabelSetter component that contains the name of the object.If you would like each label from the prefab to be associated with the same instance, add a InstanceLabelGroup to the prefab as well. For example, if each wheel in the dolly prefab has the “wheel” label, an InstanceLabelGroup component in the game object containing all the wheels would result in one bounding box containing all wheels, instead of four separate boxes, one per wheel.
Modify the MaxCount and MaxTrials parameters in the procedural > objects > Collider Asset Spawner component to reflect the number of objects to spawn each frame. The maxCount parameter specifies the number of objects to spawn. The maxPickTrials and maxPlaceTrials values denote how many times each object should be placed again if the initial spawning location is invalid. Additionally, the Dropout parameter under procedural > objects > Collider Asset Spawner represents the probability of an asset being “dropped out” of the frame (the default value is 0.2). Increasing this value will result in a dataset with more negative samples, which should be present in the dataset to minimize false positives during inference.
Modify the ClassLabelManager game object in the scene. By default, it contains one class label rule (dolly) and two class labels (one for background, and one for dolly). Modify this such that there is one class label rule and one class label per object in your ObjectDetectionAssetGroup. Set the “name” and “expression” fields to the label of the object–this should match the string that was set as the label in LabelSetter in step (c). Make sure that the rule index of each object class label is the same as its class label index (for example, the dolly uses index 1 by default). The index value is used as the value to set the pixels in the label image that is later used to generate bounding boxes. Leave the “Default Label” field to 0, as it is the value used to populate all the pixels that are not associated with objects (background pixels).
Generate a dataset in KITTI format with simulated images of the objects of interest.
Configure parameters for the dataset in
packages/ml/apps/generate_kitti_dataset/generate_kitti_dataset.app.json
. Here the config can be modified to vary, among other parameters, the output resolution of the images (for best results, use dimensions that are multiples of 16), number of training images, and number of testing images to create. The default application generates a dataset of 10k training images and 100 testing images; all images are in PNG format, with a resolution of 640x368.Run the following application to generate a dataset for input to the TLT training pipeline:
bazel run packages/ml/apps/generate_kitti_dataset
On completion, the application will create a directory (
/tmp/unity3d_kitti_dataset
by default) with the following structure:unity3d_kitti_dataset/ training/ image_2/ [training images] 000001.png 000002.png ... label_2/ [training labels in kitti format] 000001.txt 000002.txt ... testing image_2/ [testing images] 000001.png 000002.png ...
Create a local directory called
tlt-experiments
to mount in the docker container. Move theunity3d_kitti_dataset
directory into this directory.Follow these instructions from IVA to set up docker and NGC.
Start a docker container and mount the directory with the commands outlined here. The docker container includes all the necessary files to train a DetectNetv2 model.
Navigate to the
/workspace/examples/detectnet_v2/
directory in the docker image.Copy the
/workspace/examples/detectnet_v2/specs
folder into yourworkspace/tlt-experiments
folder. We will later modify these specs in the mounted folder so that the training specs persist after the docker container is terminated.Start a Jupyter notebook server as described in the TLT documentation:
jupyter notebook --ip 0.0.0.0 --allow-root
Open the detectnet_v2.ipynb notebook and follow the instructions, taking into account these special instructions for each step.
Set up env variables:
$KEY
: Create a “key”, which will be used to protect trained models and must be known at inference time to access model weights.$USER_EXPERIMENT_DIR
: Leave this set to/workspace/tlt-experiments
.$DATA_DOWNLOAD_DIR
: Set this to the path of yourunity3d_kitti_dataset
.$SPECS_DIR
: Set this to the path of the copied specs directory within the mounted folder from step #6.
Verify the downloaded dataset. Skip the first two cells, which download a KITTI object detection dataset into the
$DATA_DOWNLOAD_DIR
specified above. The simulated dataset from Unity3D should already be at this path, so run the last two cells of this section to validate your simulated dataset.Prepare tf records from the KITTI format dataset. Modify the
$SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt
file to reflect the correct dataset path. An example is provided below for training dolly detection.kitti_config { root_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training" image_dir_name: "image_2" label_dir_name: "label_2" image_extension: ".png" partition_mode: "random" num_partitions: 2 val_split: 14 num_shards: 10 } image_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training"
Then run the cells as instructed in the notebook. The cell containing the
tlt-dataset-convert
will output a message regarding the classmap such as the one below. Note the “label in tfrecords” file. This value will be used as thekey
when writing the training configuration in step (e).2020-05-09 01:30:12,694 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map. Label in GT: Label in tfrecords file dolly: dolly For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.
Download the pre-trained model: Run the cells as instructed in the notebook.
Modify the training parameters for object classes in
$SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt
for your use case:
First, change the
dataset_config
>data_sources
>image_directory_path
andtfrecords_path
to the training folder inside your generated dataset:dataset_config { data_sources { tfrecords_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/tfrecords/kitti_trainval/*" image_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training" }
Update the list of
target_class_mapping
parameters, adding one for each object class. For each object, thekey
field of this struct should exactly match the corresponding “label in tfrecords file” from step 9c.target_class_mapping { key: "dolly" value: "dolly" }
Edit the
output_image_width
andoutput_image_height
parameters underaugmentation_config
>preprocessing
.preprocessing { output_image_width: 640 output_image_height: 368 ... }
Under the
postprocessing_config
header, make sure there is onetarget_class_config
configuration per object class. Leave theclustering_config
set to default values.target_class_config { key: "dolly" value { clustering_config { ... } }
Use the default values for the
model_config
section.Modify the
evaluation_config
section. Edit thevalidation_period_during_training
parameter to change the number of epochs between validation steps. Make sure there is oneminimum_detection_ground_truth_overlap
and oneevaluation_box_config
struct for each object class, using the default values within the struct:evaluation_config { validation_period_during_training: 10 first_validation_epoch: 1 minimum_detection_ground_truth_overlap { key: "dolly" value: 0.5 } evaluation_box_config { key: "dolly" value { ... } ... }
In
cost_function_config
, make sure that there is one target_classes struct per object class, using the default values within the struct.NoteThe
cost_function_config
section contains parameters for setting weights per class for calculation of the loss or cost
Modify the
training_config
section. In this example, the images are 640x368, so thebatch_size_per_gpu
can be increased to 16 for faster learning, thus allowing for reduction of thenum_epochs
to 100. Use the default values for thelearning_rate
,regularizer
,optimizer
, andcost_scaling
parameters, keeping in mind that these can be adjusted if needed. By default, the training will output a model checkpoint every 10 epochs; modify thecheckpoint_interval
parameter to change this frequency.Modify the
bbox_rasterizer_config
section to have onetarget_class_config
per object class. For the dolly object, these values were used:bbox_rasterizer_config { target_class_config { key: "dolly" value: { cov_center_x: 0.5 cov_center_y: 0.5 cov_radius_x: 0.4 cov_radius_y: 0.4 bbox_min_radius: 1.0 } } ... }
For more guidance on these training parameters, see the TLT documentation and this blog post.
Run TLT training using the
tlt-train
command, as shown in the notebook.Evaluate the trained model. Run the
tlt-evaluate
command as shown in the notebook to evaluate the final trained model. You can also evaluate any of the checkpoint models using the-m
flag with the path of themodel.step-xxx.tlt
files.Prune the trained model to reduce the number of parameters, thus decreasing inference runtimes and the overall size of the model. To prune, run the
tlt-prune
command as shown in the notebook. Read the pruning instructions and adjust the pruning threshold accordingly. Apth
value of 0.01 is a good starting point for detectnet_v2 models. We recommend a pruning ratio between 0.1 and 0.3.Retrain the pruned model by modifying the
$SPECS_DIR/detectnet_v2_retrain_resnet18_kitti.txt
file, similar to$SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt
. Update themodel_config
so that theload_graph
option is set totrue
. Make sure to also set the correct path to the pruned model from the previous step in thepretrained_model_file
parameter undermodel_config
.Evaluate the retrained model. Run the
tlt-evaluate
command as shown in the notebook to evaluate the final trained model. You can also evaluate any of the checkpoint models using the-m
flag with the path of themodel.step-xxx.tlt
files.Edit the
$SPECS_DIR/detectnet_v2_inference_kitti_tlt.txt
file to set inference parameters. In theinferencer_config
, set the target classes and inference dimensions accordingly, and provide the correct path to the model to be used for inference. In thebbox_handler_config
, makes sure there is oneclasswise_bbox_handler_config
per class with the appropriate key in addition to the defaultclasswise_bbox_handler_config
.
Visualize inferences using the
tlt-infer
command as shown in the notebook. Update the-i
flag to the testing directory of the simulated dataset and the-m
flag to the path to the retrained model.After the model is trained, pruned, and evaluated to your satisfaction, export it using the
tlt-export
command under the “Deploy!” section of the notebook. This will provide you with a file of .etlt format, which you can then use for inference with Isaac.
A sample DetectNetv2 model that was trained using the above workflow is provided. This model
was trained on a different dolly than the one shown above, but with the same configuration.
In addition, a sample inference application is provided in packages/detect_net/apps
,
utilizing the detect_net_inference
subgraph located in the same folder. With this
app, you can do the following:
Run inference on a set of real images:
bazel run packages/detect_net/apps:detect_net_inference_app -- --mode image --rows 480 --cols 848
Run inference on a recorded Isaac log:
bazel run packages/detect_net/apps:detect_net_inference_app -- --mode cask --rows 480 --cols 848
Run inference on an image stream from Isaac Sim Unity3D:
bazel run packages/detect_net/apps:detect_net_inference_app -- --mode sim
Run inference on a camera feed from an Intel Realsense camera:
bazel run packages/detect_net/apps:detect_net_inference_app -- --mode realsense
Run inference on a camera feed from a V4L camera (be sure to adjust the framerate and resolution according to your camera):
bazel run packages/detect_net/apps:detect_net_inference_app -- --mode v4l --fps 30 --rows 448 --cols 800
Run inference on a Jetson device. See the Developing Codelets in Python page to learn more about deploying a Python app to a Jetson device.
When performing inference on the sample model, the resolution of input images must be greater than or equal to 640x368. The inference application uses the ColorCameraEncoder codelet to downscale input images to match the network input resolution, which is 640x368 for the provided sample dolly detection network. However, ColorCameraEncoder does not support upscaling, so images that are input to the inference applications cannot have a smaller resolution than the network input resolution in either dimension.
These applications can be modified to run inference on your own trained models. To do so,
modify the --model_file_path
command line argument for the sample application.
Be sure to also modify the --etlt_password
parameter accordingly. Note that if the
input tensor info changes, the “detect_net_inference.tensor_encoder” configuration must be
changed to match in the detect_net_inference
subgraph.
The object pose estimation pipeline is one of the many use cases for DetectNet. For more sample applications and models, please refer to the 3D Object Pose Estimation with Pose CNN Decoder documentation.
This sample was trained on a limited dataset and is not guaranteed to work in every situation and lighting condition. To improve model accuracy in a custom environment, you can train your own model using the instructions provided above.
Evaluation of a model can help improve the model in several ways:
Data validation: A model is only as good as the data it was trained on. There are many aspects to a training dataset that can affect performance: data integrity, class balance/imbalance, etc.
Model improvement: Developers may wish to make incremental changes to model architectures, hyperparameters, etc. in order to explore their effects on performance.
One of the most common metrics used to evaluate object detection models is Average Precision (AP), which is calculated as follows: \((true positives) / (true positives + false positives)\). Average precision (AP) is the precision averaged over image frames. Average recall (AR) is also an important measure, where recall is \((true positives) / (true positives + false negatives)\). Precision quantifies how well each prediction made by the network matches a ground truth object, while recall captures how many ground truth objects are identified by the network.
The basic values needed to calculate the above metrics are the true positive (TP), false positive (FP), and false negative (FN) scores. In other words, we need to build a confusion matrix for the inference results. To determine if a prediction and a ground truth bounding box match well enough to consider it a true positive, we use the IOU (Intersection over Union) threshold. IOU is a measure of how much two bounding boxes overlap (0 being no overlap, and 1 being an exact match). Setting a lower IOU threshold corresponds to higher tolerance for bounding box errors. We define true positives as the bounding box pairs for which the IOU score is greater than the IOU threshold. The following image shows the ground truth box in black and the predicted bounding box in green for a sample image.

We provide an application to compute these confusion matrices and AP/AR scores across multiple IOU
thresholds. This application evaluates the cart detection model being used in the cart delivery
application. The app will output confusion matrices and AP/AR metrics to
/tmp/object_detection_metrics by default
.
To run the application on a recorded log that captures the cart delivery scenario, run the following:
bazel run packages/ml/apps/evaluate_object_detection:verify_confusion_matrices -- --mode log
This will produce statistics in /tmp/object_detection_metrics/object_detection_metrics.json
.
The metrics below were computed on the default log. The average precision and recall are computed
for each IOU threshold. For example, in the metrics below, the AP50 (AP for IOU threshold 0.5) is
90.55%.
{
"trial_name": "object_detection_metrics_2020-05-20",
"iou_thresholds": [
0.5,
0.8,
0.95
],
"statistics": [
{
"class_name": "Dolly",
"precisions": [
0.9055374592833876,
0.5597176981541803,
0.14332247557003258
],
"recalls": [
0.38108293351610695,
0.23554946310258168,
0.06031528444139822
],
"area_under_curve": 0.16822016775579188
}
]
}
The application can also be run on a simulation scene as follows:
bazel run packages/ml/apps/evaluate_object_detection:verify_confusion_matrices -- --mode sim
Two evaluation scenes are provided in the Factory of the Future scene. Run either of these alongside the evaluation application. The application will run for 100 seconds and output the metrics after completion.
Scenario 17 spawns a cart in front of the camera at various angles and positions between 1.5 and 2.5 meters from the camera. To run Scenario 17, use the following command from within the IsaacSim release folder:
./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 17
Scenario 18 spawns multiple carts along the robot’s path as it drives along the factory floor. To run Scenario 18, use the following command from within the IsaacSim release folder:
./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 18