NVIDIA TAO Toolkit v5.2.0
TAO Toolkit v5.2.0

DINO

DINO is an object-detection model included in the TAO Toolkit. It supports the following tasks:

  • convert

  • train

  • evaluate

  • inference

  • export

These tasks can be invoked from the TAO Toolkit Launcher using the following convention on the command-line:

Copy
Copied!
            

tao model dino <sub_task> <args_per_subtask>

where, args_per_subtask are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.

DINO expects directories of images for training or validation and annotated JSON files in COCO format.

Note

The category_id from your COCO JSON file should start from 1 because 0 is set as a background class. In addition, dataset.num_classes should be set to max class_id + 1. For instance, even though there are only 80 classes used in COCO, the largest class_id is 90, so dataset.num_classes should be set to 91.

Sharding the Data (Optional)

Note

Sharding is not necessary if the annotation is already in JSON format and your dataset is smaller than the COCO dataset. This subtask also assumes that your dataset is in KITTI format.

For a large dataset, you can optionally use convert to shard the dataset into smaller chunks to reduce the memory burden. In this process, KITTI-based annotations are converted into smaller sharded JSON files, similar to other object detection networks. Here is an example spec file for converting KITTI-based folders into multiple sharded JSON files.

Copy
Copied!
            

input_source: /workspace/tao-experiments/data/sequence.txt results_dir: /workspace/tao-experiments/sharded image_dir_name: images label_dir_name: labels num_shards: 32 num_partitions: 1 mapping_path: /path/to/your_category_mapping

The details of each parameter are summarized in the table below:

Parameter Data Type Default Description Supported Values
input_source string None The .txt file listing data sources
results_dir string None The output directory where sharded JSON files will be stored
image_dir_name string None The relative path to the directory containing images from the path listed in the input_source .txt file
label_dir_name string None The relative path to the directory containing JSON data from the path listed in the input_source .txt file
num_shards unsigned int 32 The number of shards per partition >0
num_partitions unsigned int 1 The number of partitions in the data >0
mapping_path string None Path to a JSON file containing the class mapping

The category mapping should contain mapping of your dataset and be in reverse alphabetical order. The default mapping is shown below:

Copy
Copied!
            

DEFAULT_TARGET_CLASS_MAPPING = { "Person": "person", "Person Group": "person", "Rider": "person", "backpack": "bag", "face": "face", "large_bag": "bag", "person": "person", "person group": "person", "person_group": "person", "personal_bag": "bag", "rider": "person", "rolling_bag": "bag", "rollingbag": "bag", "largebag": "bag", "personalbag": "bag" }

The following example shows how to use the convert command:

Copy
Copied!
            

tao model dino convert -e /path/to/spec.yaml

The training experiment spec file for DINO includes model, train, and dataset parameters. Here is an example spec file for training a DINO model with a resnet_50 backbone on a COCO dataset.

Copy
Copied!
            

dataset: train_data_sources: - image_dir: /path/to/coco/train2017/ json_file: /path/to/coco/annotations/instances_train2017.json val_data_sources: - image_dir: /path/to/coco/val2017/ json_file: /path/to/coco/annotations/instances_val2017.json num_classes: 91 batch_size: 4 workers: 8 augmentation: scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800] input_mean: [0.485, 0.456, 0.406] input_std: [0.229, 0.224, 0.225] horizontal_flip_prob: 0.5 train_random_resize: [400, 500, 600] train_random_crop_min: 384 train_random_crop_max: 600 random_resize_max_size: 1333 test_random_resize: 800 model: pretrained_model_path: /path/to/your-fan-small-pretrained-model backbone: fan_small train_backbone: True num_feature_levels: 4 dec_layers: 6 enc_layers: 6 num_queries: 300 num_queries: 900 dropout_ratio: 0.0 dim_feedforward: 2048 train: optim: lr_backbone: 2e-5 lr: 2e-4 lr_steps: [11] num_epochs: 12

Parameter Data Type Default Description Supported Values
model dict config The configuration of the model architecture
train dict config The configuration of the training task
dataset dict config The configuration of the dataset
evaluate dict config The configuration of the evaluation task
inference dict config The configuration of the inference task
export dict config The configuration of the ONNX export task
gen_trt_engine dict config The configuration of the TensorRT generation task. Only used in tao deploy
encryption_key string None The encryption key to encrypt and decrypt model files
results_dir string None The directory where experiment results are saved

model

The model parameter provides options to change the DINO architecture.

Copy
Copied!
            

model: pretrained_model_path: /path/to/your-fan-small-pretrained-model backbone: fan_small train_backbone: True num_feature_levels: 4 dec_layers: 6 enc_layers: 6 num_queries: 300 num_queries: 900 dropout_ratio: 0.0 dim_feedforward: 2048

Parameter Datatype Default Description Supported Values
pretrained_backbone_path string None The optional path to the pretrained backbone file string to the path

backbone

string

resnet_50

The backbone name of the model. GCViT, FAN, ResNet 50, and NVDINOv2 are supported.

resnet_50, gc_vit_xxtiny,
gc_vit_xtiny, gc_vit_tiny,
gc_vit_small, gc_vit_base,
gc_vit_large, fan_tiny,
fan_small, fan_base,
fan_large, vit_large_nvdinov2

train_backbone bool True A flag specifying whether to train the backbone or not True, False
num_feature_levels unsigned int 4 The number of feature levels to use in the model 1,2,3,4,5

return_interm_indices

int list

[1, 2, 3, 4]

The index of feature levels to use in the model. The length must match num_feature_levels.

[0, 1, 2, 3, 4], [1, 2, 3, 4],
[1, 2, 3], [1, 2], [1]

dec_layers unsigned int 6 The number of decoder layers in the transformer >0
enc_layers unsigned int 6 The number of encoder layers in the transformer >0
num_queries unsigned int 900 The number of queries >0
dim_feedforward unsigned int 2048 The dimension of the feedforward network >0
num_select unsigned int 300 The number of top-K predictions selected during post-process >0
use_dn bool True A flag specifying whether to enbable contrastive de-noising training in DINO True, False
dn_number unsigned_int 100 The number of de-noising queries in DINO >0
dn_box_noise_scale float 1.0 The scale of noise applied to boxes during contrastive de-noising. If this value is 0, noise is not applied. >=0
dn_label_noise_ratio float 0.5 The scale of noise applied to labels during contrastive de-noising. If this value is 0, noise is not applied. >=0
pe_temperatureH unsigned_int 20 The temperature applied to the height dimension of Positional Sine Embedding >0
pe_temperatureW unsigned_int 20 The temperature applied to the width dimension of Positional Sine Embedding >0

fix_refpoints_hw

signed_int

-1

If this value is -1, width and height are learned seperately for each box. If this value is -2,
a shared width and height are learned. A value greater than 0 specifies learning with a fixed number.

>0, -1, -2

dropout_ratio float 0.0 The probability to drop hidden units 0.0 ~ 1.0
cls_loss_coef float 2.0 The relative weight of the classification error in the matching cost >0.0
bbox_loss_coef float 5.0 The relative weight of the L1 error of the bounding box coordinates in the matching cost >0.0
giou_loss_coef float 2.0 The relative weight of the GIoU loss of the bounding box in the matching cost >0.0
focal_alpha float 0.25 The alpha in the focal loss >0.0
aux_loss bool True A flag specifying whether to use auxiliary decoding losses (loss at each decoder layer) True, False

train

The train parameter defines the hyperparameters of the training process.

Copy
Copied!
            

train: optim: lr: 0.0002 lr_backbone: 0.00002 momentum: 0.9 weight_decay: 0.0001 lr_scheduler: MultiStep lr_steps: [11] lr_decay: 0.1 num_epochs: 12 checkpoint_interval: 1 precision: fp32 distributed_strategy: ddp activation_checkpoint: True num_gpus: 8 num_nodes: 1

Parameter Datatype Default Description Supported Values
optim dict config The config for the optimizer, including the learning rate, learning scheduler, and weight decay >0
num_epochs unsigned int 12 The total number of epochs to run the experiment >0
checkpoint_interval unsigned int 1 The interval at which the checkpoints are saved >0
validation_interval unsigned int 1 The epoch interval at which the validation is run >0
clip_grad_norm float 0.1 amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping >=0
precision string fp32 Specifying “fp16” enables precision training. Training with fp16 can help save GPU memory. fp32, fp16
distributed_strategy string ddp The multi-GPU training strategy. DDP (Distributed Data Parallel) and Sharded DDP are supported. ddp, ddp_sharded
activation_checkpoint bool True A True value instructs train to recompute in backward pass to save GPU memory, rather than storing activations. True, False
resume_training_checkpoint_path string The intermediate PyTorch Lightning checkpoint to resume training from
pretrained_model_path string Path to pretrained model checkpoint path to load for finetuning
num_gpus unsigned int 1 The number of GPUs to use >0
num_nodes unsigned int 1 The number of nodes. If the value is larger than 1, multi-node is enabled >0
freeze string list [] The list of layer names in the model to freeze. Example [“backbone”, “transformer.encoder”, “input_proj”]
verbose bool False Whether to print detailed learning rate scaling from the optimizer True, False

optim

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

Copy
Copied!
            

optim: lr: 0.0002 lr_backbone: 0.00002 momentum: 0.9 weight_decay: 0.0001 lr_scheduler: MultiStep lr_steps: [11] lr_decay: 0.1

Parameter

Datatype

Default

Description

Supported Values

lr float 1e-4 The initial learning rate for training the model, excluding the backbone >0.0
lr_backbone float 1e-5 The initial learning rate for training the backbone >0.0
lr_linear_proj_mult float 0.1 The initial learning rate for training the linear projection layer >0.0
momentum float 0.9 The momentum for the AdamW optimizer >0.0
weight_decay float 1e-4 The weight decay coefficient >0.0

lr_scheduler

string

MultiStep

The learning scheduler:
* MultiStep : Decrease the lr by lr_decay from lr_steps
* StepLR : Decrease the lr by lr_decay at every lr_step_size

MultiStep/StepLR

lr_decay float 0.1 The decreasing factor for the learning rate scheduler >0.0
lr_decay_rate float 0.65 The layer-wise learning decay rate used for ViT only >0.0
lr_steps int list [11] The steps to decrease the learning rate for the MultiStep scheduler int list
lr_step_size unsigned int 11 The steps to decrease the learning rate for the StepLR scheduler >0
lr_monitor string val_loss The monitor value for the AutoReduce scheduler val_loss/train_loss
optimizer string AdamW The optimizer to use during training AdamW/SGD

dataset

The dataset parameter defines the dataset source, training batch size, and augmentation.

Copy
Copied!
            

dataset: train_data_sources: - image_dir: /path/to/coco/images/train2017/ json_file: /path/to/coco/annotations/instances_train2017.json val_data_sources: - image_dir: /path/to/coco/images/val2017/ json_file: /path/to/coco/annotations/instances_val2017.json test_data_sources: image_dir: /path/to/coco/images/val2017/ json_file: /path/to/coco/annotations/instances_val2017.json infer_data_sources: image_dir: /path/to/coco/images/val2017/ classmap: /path/to/coco/annotations/coco_classmap.txt num_classes: 91 batch_size: 4 workers: 8

Parameter Datatype Default Description Supported Values

train_data_sources

list dict

The training data sources:
* image_dir : The directory that contains the training images
* json_file : The path of the JSON file, which uses training-annotation COCO format

val_data_sources

list dict

The validation data sources:
* image_dir : The directory that contains the validation images
* json_file : The path of the JSON file, which uses validation-annotation COCO format

test_data_sources

dict

The test data sources for evaluation:
* image_dir : The directory that contains the test images
* json_file : The path of the JSON file, which uses test-annotation COCO format

infer_data_sources

dict

The infer data sources for inference:
* image_dir : The directory that contains the inference images
* classmap : The path of the .txt file that contains class names

augmentation dict config The parameters to define the augmentation method
num_classes unsigned int 91 The number of classes in the training data >0
batch_size unsigned int 4 The batch size for training and validation >0
workers unsigned int 8 The number of parallel workers processing data >0

train_sampler

string

default_sampler

The minibatch sampling method. Non-default sampling methods can be enabled for multi-node
jobs. This config doesn’t have any effect if dataset_type isn’t set to default

default_sampler, non_uniform_sampler,
uniform_sampler

dataset_type

string

serialized

If set to default, we follow the standard CocoDetection` dataset structure
from the torchvision which loads COCO annotation in every subprocess. This leads to redudant
copy of data and can cause RAM to explod if workers` is high. If set to serialized,
the data is serialized through pickle and torch.Tensor` that allows the data to be shared
across subprocess. As a result, RAM usage can be greatly improved.

serialized, default

augmentation

The augmentation parameter contains hyperparameters for augmentation.

Copy
Copied!
            

augmentation: scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800] input_mean: [0.485, 0.456, 0.406] input_std: [0.229, 0.224, 0.225] horizontal_flip_prob: 0.5 train_random_resize: [400, 500, 600] train_random_crop_min: 384 train_random_crop_max: 600 random_resize_max_size: 1333 test_random_resize: 800

Parameter Datatype Default Description Supported Values

scales

int list

[480, 512, 544, 576,
608, 640, 672, 704,
736, 768, 800]

A list of sizes to perform random resize.

input_mean float list [0.485, 0.456, 0.406] The input mean for RGB frames: (input - mean) / std float list / size=1 or 3
input_std float list [0.229, 0.224, 0.225] The input standard deviation for RGB frames: (input - mean) / std float list / size=1 or 3
horizontal_flip_prob float 0.5 The probability for horizonal flip during training >=0
train_random_resize int list [400, 500, 600] A list of sizes to perform random resize for training data int list
train_random_crop_min unsigned int 384 The minimum random crop size for training data >0
train_random_crop_max unsigned int 600 The maximum random crop size for training data >0
random_resize_max_size unsigned int 1333 The maximum random resize size for training data >0
test_random_resize unsigned int 800 The random resize size for test data >0

fixed_padding

bool

True

A flag specifying whether to resize the image (with no padding) to
(sorted(scales[-1]), random_resize_max_size) to prevent a CPU
memory leak.

True/False

fixed_random_crop

unsigned int

A flag to enable Large Scale Jittering, which is used for ViT backbones.
The resulting image resolution is fixed to fixed_random_crop.

Divisible by 32

Example spec file for ViT backbones

Note

The following spec file is only relevant for TAO Toolkit versions 5.2 and later. Vision Transformer (ViT) requires a different augmentation and learning rate decay to work as backbone to a detector.

Copy
Copied!
            

dataset: train_data_sources: - image_dir: /path/to/coco/train2017/ json_file: /path/to/coco/annotations/instances_train2017.json val_data_sources: - image_dir: /path/to/coco/val2017/ json_file: /path/to/coco/annotations/instances_val2017.json num_classes: 91 batch_size: 4 workers: 8 augmentation: input_mean: [0.485, 0.456, 0.406] input_std: [0.229, 0.224, 0.225] horizontal_flip_prob: 0.5 fixed_random_crop: 1024 random_resize_max_size: 1024 test_random_resize: 1024 fixed_padding: True model: pretrained_model_path: /path/to/nvdinov2_patch16_model backbone: vit_large_nvdinov2 train_backbone: False num_feature_levels: 4 dec_layers: 6 enc_layers: 6 num_queries: 900 dropout_ratio: 0.0 dim_feedforward: 2048 train: optim: lr_backbone: 2e-5 lr: 2e-4 lr_steps: [11] layer_decay_rate: 0.65 num_epochs: 12

To train a DINO model, use this command:

Copy
Copied!
            

tao model dino train [-h] -e <experiment_spec> [-r <results_dir>] [-k <key>]

Required Arguments

  • -e, --experiment_spec: The experiment specification file to set up the training experiment

Optional Arguments

  • -r, --results_dir: The path to the folder where the experiment outputs should be written. If this argument is not specified, the results_dir from the spec file will be used.

  • -k, --key: A user-specific encoding key to save or load a .tlt model. If this argument is not specified, the model checkpoint will not be encrypted.

  • --gpus: The number of GPUs used to run training

  • --num_nodes: The number of nodes used to run training. If this value is larger than 1, distributed multi-node training is enabled.

  • -h, --help: Show this help message and exit.

Sample Usage

Here’s an example of the train command:

Copy
Copied!
            

tao dino model train -e /path/to/spec.yaml

Optimizing Resource for training DINO

Training DINO requires strong GPUs (e.g. V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. In this section, we outline some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory

There are various ways to optimize GPU memory usage. One obvious trick is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. Hence, we recommend setting below configurations in order to optimize GPU consumption.

  • Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

  • Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.

  • Set train.distributed_strategy to ddp_sharded to enabled Sharded DDP training. This will share gradient calculation across different processes to help reduce GPU memory.

  • Try using more lightweight backbones like fan_tiny or freeze the backbone through setting model.train_backbone to False.

  • Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

Optimize CPU Memory

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory if the size of your annotation file is very large. Hence, we recommend setting below configurations in order to optimize CPU consumption.

  • Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.

  • Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leak and the CPU memory to slowly stacks up until it becomes Out of Memory in the middle of training. This is the limitation of PyTorch so we advise setting fixed_padding to True to help stablize the CPU memory usage.

evaluate

The evaluate parameter defines the hyperparameters of the evaluate process.

Copy
Copied!
            

evaluate: checkpoint: /path/to/model.pth conf_threshold: 0.0 num_gpus: 1

Parameter Datatype Default Description Supported Values
checkpoint string Path to PyTorch model to evaluate
trt_engine string Path to TensorRT model to evaluate. Should be only used with tao deploy
num_gpus unsigned int 1 The number of GPUs to use >0
conf_threshold float 0.0 Confidence threshold to filter predictions >=0

To run evaluation with a DINO model, use this command:

Copy
Copied!
            

tao model dino evaluate [-h] -e <experiment_spec> [-r <results_dir>] [-k <key>] evaluate.checkpoint=<model to be evaluated>

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the evaluation experiment

Optional Arguments

  • -k, --key: A user-specific encoding key to save or load a .tlt model. If this value is not specified, a .pth model must be used.

  • -r, --results_dir: The directory where the evaluation result is stored

  • evaluate.checkpoint: The .tlt or .pth model to be evaluated

Sample Usage

Here’s an example of using the evaluate command:

Copy
Copied!
            

tao model dino evaluate -e /path/to/spec.yaml -r /path/to/results/ evaluate.checkpoint=/path/to/model.pth

inference

The inference parameter defines the hyperparameters of the inference process.

Copy
Copied!
            

inference: checkpoint: /path/to/model.pth conf_threshold: 0.5 num_gpus: 1 color_map: person: red car: blue

Parameter Datatype Default Description Supported Values
checkpoint string Path to PyTorch model to inference
trt_engine string Path to TensorRT model to inference. Should be only used with tao deploy
num_gpus unsigned int 1 The number of GPUs to use >0
conf_threshold float 0.5 Confidence threshold to filter predictions >=0
color_map dict Color map of the bounding boxes for each class string dict

The inference tool for DINO models can be used to visualize bboxes and generate frame-by- frame KITTI format labels on a directory of images.

Copy
Copied!
            

tao model dino inference [-h] -e <experiment spec file> [-r <results_dir>] [-k <key>] inference.checkpoint=<model to be inferenced>

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the inference experiment

Optional Arguments

  • -k, --key: A user-specific encoding key to save or load a .tlt model. If this value is not specified, a .pth model must be used

  • -r, --results_dir: The directory where the inference result is stored

  • inference.checkpoint: The .tlt or .pth model to inference

Sample Usage

Here’s an example of using the inference command:

Copy
Copied!
            

tao model dino inference -e /path/to/spec.yaml -r /path/to/results/ inference.checkpoint=/path/to/model.pth

export

The export parameter defines the hyperparameters of the export process.

Copy
Copied!
            

export: checkpoint: /path/to/model.pth onnx_file: /path/to/model.onnx on_cpu: False opset_version: 12 input_channel: 3 input_width: 960 input_height: 544 batch_size: -1

Parameter Datatype Default Description Supported Values
checkpoint string The path to the PyTorch model to export
onnx_file string The path to the .onnx file
on_cpu bool True If this value is True, the DMHA module will be exported as standard pytorch. If this value is False, the module will be exported using the TRT Plugin. True, False
opset_version unsigned int 12 The opset version of the exported ONNX >0
input_channel unsigned int 3 The input channel size. Only the value 3 is supported. 3
input_width unsigned int 960 The input width >0
input_height unsigned int 544 The input height >0
batch_size unsigned int -1 The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size. >=-1
Copy
Copied!
            

tao model dino export [-h] -e <experiment spec file> [-r <results_dir>] [-k <key>] export.checkpoint=<model to export> export.onnx_file=<onnx path>

Required Arguments

  • -e, --experiment_spec: The path to an experiment spec file

Optional Arguments

  • -k, --key: A user-specific encoding key to save or load a .tlt model. If this value is not specified, a .pth model must be used

  • -r, --results_dir: The directory where the inference result is stored

  • export.checkpoint: The .tlt or .pth model to export

  • export.onnx_file: The path where the .etlt or .onnx model will be saved

Sample Usage

Here’s an example of using the export command:

Copy
Copied!
            

tao model dino export -e /path/to/spec.yaml export.checkpoint=/path/to/model.pth export.onnx_file=/path/to/model.onnx

Refer to the Integrating a Deformable DETR Model documentation for DINO page for more information about deploying a Deformable DETR model to DeepStream.

Previous Deformable DETR
Next EfficientDet (TF1)
© Copyright 2024, NVIDIA. Last updated on Mar 18, 2024.