Grounding DINO#

Grounding DINO is an open vocabulary object-detection model included in the TAO. Through joint training of text and image data, Grounding DINO is able to accept wide range of text data as input and output the corresponding bounding boxes.

It supports the following tasks:

  • train

  • evaluate

  • inference

  • export

Each task is explained in detail in the following sections.

Note

  • Throughout this documentation are references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.

    • For instructions on creating a dataset using the remote client, refer to the Creating a dataset section in the Remote Client documentation.

    • For instructions on creating an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

  • The spec format is YAML for TAO Launcher, and JSON for FTMS Client.

  • File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.

Data Input for Grounding DINO#

Grounding DINO expects directories of images for training files to be under ODVG format with JSONL and validation to be annotated JSON files in COCO format.

Note

Unlike other object detection networks in TAO, the category_id from your COCO JSON file for Grounding DINO should start from 0 and every category id must be contiguous. Meaning the category can range from 0 to num_classes - 1. Because the original COCO annotation does not have a contiguous category id, see the TAO Data Service tao dataset annotations convert.

Creating an Experiment Spec File#

The training experiment spec file for Grounding DINO includes model, train, and dataset parameters. The following is an example spec file for finetuning a Grounding DINO model with a swin_tiny_224_1k backbone on a COCO dataset:

Use the following command to get an experiment spec file for Grounding DINO:

BASE_EXPERIMENT_ID=$(tao grounding_dino list-base-experiments | jq -r '.[0].id')
SPECS=$(tao grounding_dino get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.jsonl  # odvg format
      label_map:  /path/to/coco/annotations/instances_train2017_labelmap.json
  val_data_sources:
    - image_dir: /path/to/coco/val2017/
      json_file: /path/to/coco/annotations/instances_val2017_contiguous.json  # category ids need to be contiguous
  max_labels: 80  # Max number of postive + negative labels passed to the text encoder
  batch_size: 4
  workers: 8
  dataset_type: serialized  # To reduce the system memory usage
  augmentation:
    scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
    input_mean: [0.485, 0.456, 0.406]
    input_std: [0.229, 0.224, 0.225]
    horizontal_flip_prob: 0.5
    train_random_resize: [400, 500, 600]
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1333
    test_random_resize: 800
model:
  backbone: swin_tiny_224_1k
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True  # Adding bias in the contrastive embedding layer for training stability
train:
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10, 20]
  num_epochs: 30
  freeze: ["backbone.0", "bert"]  # if only finetuning
  pretrained_model_path: /path/to/your-gdino-pretrained-model  # if only finetuning
  precision: bf16  # for efficient training

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

encryption_key

string

FALSE

results_dir

string

/results

FALSE

wandb

collection

FALSE

model

collection

Configurable parameters to construct the model for a Grounding DINO experiment.

FALSE

dataset

collection

Configurable parameters to construct the dataset for a Grounding DINO experiment.

FALSE

train

collection

Configurable parameters to construct the trainer for a Grounding DINO experiment.

FALSE

evaluate

collection

Configurable parameters to construct the evaluator for a Grounding DINO experiment.

FALSE

inference

collection

Configurable parameters to construct the inferencer for a Grounding DINO experiment.

FALSE

export

collection

Configurable parameters to construct the exporter for a Grounding DINO experiment.

FALSE

gen_trt_engine

collection

Configurable parameters to construct the TensorRT engine builder for a Grounding DINO experiment.

FALSE

model#

The model parameter provides options to change the Grounding DINO architecture.

model:
  pretrained_model_path: /path/to/your-gdino-pretrained-model
  backbone: swin_tiny_224_1k
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

pretrained_backbone_path

string

[Optional] Path to a pretrained backbone file.

FALSE
backbone
string
The backbone name of the model.
TAO implementation of Groudning DINO support Swin.
swin_tiny_224_1k




swin_tiny_224_1k,swin_base_224_22k,swin_base_384_22k,swin_large_224_22k,swin_large_384_22k
FALSE

num_queries

int

The number of queries

900

1

inf

TRUE

num_feature_levels

int

The number of feature levels to use in the model

4

1

5

FALSE

set_cost_class

float

The relative weight of the classification error in the matching cost.

1.0

0.0

inf

FALSE

set_cost_bbox

float

The relative weight of the L1 error of the bounding box coordinates in the matching cost.

5.0

0.0

inf

FALSE

set_cost_giou

float

The relative weight of the GIoU loss of the bounding box in the matching cost.

2.0

0.0

inf

FALSE

cls_loss_coef

float

The relative weight of the classification error in the final loss.

2.0

0.0

inf

FALSE

bbox_loss_coef

float

The relative weight of the L1 error of the bounding box coordinates in the final loss.

5.0

0.0

inf

FALSE

giou_loss_coef

float

The relative weight of the GIoU loss of the bounding box in the final loss.

2.0

0.0

inf

FALSE

num_select

int

The number of top-K predictions selected during post-process

300

1

TRUE

interm_loss_coef

float

1.0

FALSE

no_interm_box_loss

bool

No intermediate bbox loss.

False

FALSE

pre_norm

bool

Flag to add layer norm in the encoder or not.

False

FALSE

two_stage_type

string

Type of two stage in DINO

standard

standard,no

FALSE

decoder_sa_type

string

Type of decoder self attention.

sa

sa,ca_label,ca_content

FALSE

embed_init_tgt

bool

Flag to add target embedding

True

FALSE
fix_refpoints_hw
int
If this value is -1, width and height are learned seperately for each box.
If this value is -2, a shared width and height are learned.
A value greater than 0 specifies learning with a fixed number.
-1
-2
inf



FALSE

pe_temperatureH

int

The temperature applied to the height dimension of the positional sine embedding.

20

1

inf

FALSE

pe_temperatureW

int

The temperature applied to the width dimension of the positional sine embedding.

20

1

inf

FALSE

return_interm_indices

list

The index of feature levels to use in the model. The length must match num_feature_levels.

[1, 2, 3, 4]

FALSE

use_dn

bool

A flag specifying whether to enbable contrastive de-noising training in DINO

True

FALSE

dn_number

int

The number of denoising queries in DINO.

0

0

inf

FALSE

dn_box_noise_scale

float

The scale of noise applied to boxes during contrastive de-noising. If this value is 0, noise is not applied.

1.0

0.0

inf

FALSE
dn_label_noise_ratio
float
The scale of the noise applied to labels during
contrastive denoising. If this value is 0, then noise is
no applied.
0.5
0.0






FALSE

focal_alpha

float

The alpha value in the focal loss.

0.25

FALSE

focal_gamma

float

The gamma value in the focal loss.

2.0

FALSE

clip_max_norm

float

0.1

FALSE

nheads

int

Number of heads

8

FALSE

dropout_ratio

float

The probability to drop hidden units.

0.0

0.0

1.0

FALSE

hidden_dim

int

Dimension of the hidden units.

256

FALSE

enc_layers

int

Numer of encoder layers in the transformer

6

1

TRUE

dec_layers

int

Numer of decoder layers in the transformer.

6

1

TRUE

dim_feedforward

int

Dimension of the feedforward network.

2048

1

FALSE

dec_n_points

int

Number of reference points in the decoder.

4

1

FALSE

enc_n_points

int

Number of reference points in the encoder.

4

1

FALSE
aux_loss
bool
A flag specifying whether to use auxiliary.
decoding losses (loss at each decoder layer)
True






FALSE

dilation

bool

A flag specifying whether enable dilation or not in the backbone.

False

FALSE
train_backbone
bool
Flag to set backbone weights as trainable or frozen.
When set to False, the backbone weights are frozen.
True






FALSE
text_encoder_type
string
BERT encoder type. If only the name of the type is provided,
the weight is download from the Hugging Face Hub.
If a path is provided, then we load the weight from the local path.
bert-base-uncased









FALSE

max_text_len

int

Maximum text length of BERT.

256

1

FALSE

class_embed_bias

bool

Flag to set bias in the contrastive embedding.

False

FALSE
log_scale
string
[Optional] The initial value of a learnable parameter to multiply with the similarity
matrix to normalize the output. Defaults to None.
- If set to ‘auto’, the similarity matrix is normalized by
a fixed value sqrt(d_c) where d_c is the channel number.
- If set to ‘none’ or None, there is no normalization applied.
none















FALSE

loss_types

list

Losses to be used during training.

[‘labels’, ‘boxes’]

FALSE

backbone_names

list

Prefix of the tensor names corresponding to the backbone.

[‘backbone.0’, ‘bert’]

FALSE

linear_proj_names

list

Linear projection layer names.

[‘reference_points’, ‘sampling_offsets’]

FALSE

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    lr: 0.0002
    lr_backbone: 0.00002
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [10, 20]
    lr_decay: 0.1
  num_epochs: 30
  checkpoint_interval: 1
  precision: bf16
  distributed_strategy: ddp
  activation_checkpoint: True
  num_gpus: 8
  num_nodes: 1
  freeze: ["backbone.0", "bert"]
  pretrained_model_path: /path/to/pretrained/model

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

The number of GPUs to run the train job.

1

1

FALSE

gpu_ids

list

List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus.

[0]

FALSE

num_nodes

int

Number of nodes to run the training on. If > 1, then multi-node is enabled.

1

FALSE

seed

int

The seed for the initializer in PyTorch. If < 0, disable fixed seed.

1234

-1

inf

FALSE

cudnn

collection

FALSE

num_epochs

int

Number of epochs to run the training.

10

1

inf

TRUE

checkpoint_interval

int

The interval (in epochs) at which a checkpoint is saved. Helps resume training.

1

1

FALSE

validation_interval

int

The interval (in epochs) at which a evaluation is triggered by the validation dataset.

1

1

FALSE

resume_training_checkpoint_path

string

Path to the checkpoint to resume training from.

FALSE

results_dir

string

Path to where all the assets generated from a task are stored.

FALSE
freeze
list
List of layer names to freeze.
Example: [“backbone”, “transformer.encoder”, “input_proj”].
[]






FALSE

pretrained_model_path

string

Path to a pre-trained Deformable DETR model to initialize the current training from.

FALSE
clip_grad_norm
float
Amount to clip the gradient by L2 Norm.
A value of 0.0 specifies no clipping.
0.1






FALSE
is_dry_run
bool
Whether to run the trainer in Dry Run mode. This serves
as a good means to validate the spec file and run a sanity check on the trainer
without actually initializing and running the trainer.
False









FALSE

optim

collection

Hyper parameters to configure the optimizer.

FALSE

precision

string

Precision to run the training on.

fp32

fp16,fp32,bf16

FALSE
distributed_strategy
string
The multi-GPU training strategy.
DDP (Distributed Data Parallel) and Fully Sharded DDP are supported.
ddp




ddp,fsdp
FALSE
activation_checkpoint
bool
A True value instructs train to recompute in backward pass to save GPU memory,
rather than storing activations.
True






FALSE

verbose

bool

Flag to enable printing of detailed learning rate scaling from the optimizer.

False

FALSE

optim#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  lr: 0.0002
  lr_backbone: 0.00002
  momentum: 0.9
  weight_decay: 0.0001
  lr_scheduler: MultiStep
  lr_steps: [10, 20]
  lr_decay: 0.1

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

optimizer

string

Type of optimizer used to train the network.

AdamW

AdamW,SGD

FALSE

monitor_name

string

The metric value to be monitored for the AutoReduce Scheduler.

val_loss

val_loss,train_loss

FALSE

lr

float

The initial learning rate for training the model, excluding the backbone.

0.0002

TRUE

lr_backbone

float

The initial learning rate for training the backbone.

2e-05

TRUE

lr_linear_proj_mult

float

The initial learning rate for training the linear projection layer.

0.1

TRUE

momentum

float

The momentum for the AdamW optimizer.

0.9

TRUE

weight_decay

float

The weight decay coefficient.

0.0001

TRUE
lr_scheduler
string
The learning scheduler:
* MultiStep : Decrease the lr by lr_decay from lr_steps
* StepLR : Decrease the lr by lr_decay at every lr_step_size.
MultiStep






MultiStep,StepLR
FALSE
lr_steps
list
The steps at which the learning rate must be decreased.
This is applicable only with the MultiStep LR.
[10]






FALSE

lr_step_size

int

The number of steps to decrease the learning rate in the StepLR.

10

TRUE

lr_decay

float

The decreasing factor for the learning rate scheduler.

0.1

TRUE

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.jsonl  # odvg format
      label_map:  /path/to/coco/annotations/instances_train2017_labelmap.json
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/refcoco.jsonl  # grounding dataset which doesn't require label_map
  val_data_sources:
    image_dir: /path/to/coco/val2017/
    json_file: /path/to/coco/annotations/instances_val2017_contiguous.json  # category ids need to be contiguous
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
  infer_data_sources:
    - image_dir: /path/to/coco/images/val2017/
      captions: ["black cat", "car"]
  max_labels: 80
  batch_size: 4
  workers: 8

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled
train_data_sources
list
The list of data sources for training:
* image_dir : The directory that contains the training images
* json_file : The path of the JSONL file, which uses training-annotation ODVG format
* label_map: (Optional) The path of the label mapping only required for detection dataset
[{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘label_map’: ‘’}, {‘image_dir’: ‘’, ‘json_file’: ‘’}]












FALSE
val_data_sources
collection
The data source for validation:
* image_dir : The directory that contains the validation images
* json_file : The path of the JSON file, which uses validation-annotation COCO format.
Note: category id must start from 0 if to calculate validation loss.
Run Data Services annotation convert to making the categories contiguous.
{‘image_dir’: ‘’, ‘json_file’: ‘’}















FALSE
test_data_sources
collection
The data source for testing:
* image_dir : The directory that contains the test images
* json_file : The path of the JSON file, which uses test-annotation COCO format
{‘image_dir’: ‘’, ‘json_file’: ‘’}









FALSE
infer_data_sources
collection
The data source for inference:
* image_dir : The list of directories that contains the inference images
* captions : The list of caption to run inference
{‘image_dir’: [‘’], ‘captions’: [‘’]}









FALSE

batch_size

int

The batch size for training and validation

4

1

inf

TRUE

workers

int

The number of parallel workers processing data

8

1

inf

TRUE
pin_memory
bool
Flag to enable the dataloader to allocated pagelocked memory for faster
of data between the CPU and GPU.
True






FALSE
dataset_type
string
If set to default, the standard map-style dataset structure
from torch is followed, which loads ODVG annotation in every subprocess. This leads to a redudant
copy of data and can cause RAM to explode if workers is high. If set to serialized,
the data is serialized through pickle and torch.Tensor that allows the data to be shared
across subprocesses. As a result, RAM usage can be greatly improved.
serialized










serialized,default
FALSE
max_labels
int
The total number of labels to sample from. After sampling positive labels,
random negative samples are sampled so that the total number of labels is equal to max_labels.
For detection dataset, negative labels are categories not present in the image.
For grounding dataset, negative labels are phrases in the original caption not present in the image.
Setting higher max_labels may improve robustness of the model with the cost of longer training time.
50
1
inf





FALSE

eval_class_ids

list

IDs of the classes for evaluation.

[1]

FALSE

augmentation

collection

Configuration parameters for data augmentation.

FALSE

augmentation#

The augmentation parameter contains hyperparameters for augmentation.

augmentation:
  scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
  input_mean: [0.485, 0.456, 0.406]
  input_std: [0.229, 0.224, 0.225]
  horizontal_flip_prob: 0.5
  train_random_resize: [400, 500, 600]
  train_random_crop_min: 384
  train_random_crop_max: 600
  random_resize_max_size: 1333
  test_random_resize: 800

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

scales

list

A list of sizes to perform random resize on.

[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]

FALSE

input_mean

list

The input mean for RGB frames

[0.485, 0.456, 0.406]

FALSE

input_std

list

The input standard deviation per pixel for RGB frames

[0.229, 0.224, 0.225]

FALSE

train_random_resize

list

A list of sizes to perform random resize for training data

[400, 500, 600]

FALSE

horizontal_flip_prob

float

The probability for horizonal flip during training

0.5

0.0

1.0

TRUE

train_random_crop_min

int

The minimum random crop size for training data

384

1

inf

TRUE

train_random_crop_max

int

The maximum random crop size for training data

600

1

inf

TRUE

random_resize_max_size

int

The maximum random resize size for training data

1333

1

inf

TRUE

test_random_resize

int

The random resize size for test data

800

1

inf

TRUE
fixed_padding
bool
A flag specifying whether to resize the image (with no padding) to
(sorted(scales[-1]), random_resize_max_size) to prevent a CPU “ memory leak.
TRUE






FALSE
fixed_random_crop
int
A flag to enable Large Scale Jittering, which is used for ViT backbones.
The resulting image resolution is fixed to fixed_random_crop.
1024
1
inf


FALSE

Training the Model#

To train a Grounding DINO model, use this command:

TRAIN_JOB_ID=$(tao grounding_dino create-job \
  --kind experiment \
  --name "grounding_dino_train" \
  --action train \
  --workspace-id $WORKSPACE_ID \
  --specs "$TRAIN_SPECS" \
  --train-datasets '["'$DATASET_ID'"]' \
  --eval-dataset "$DATASET_ID" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

tao model grounding_dino train [-h] -e <experiment_spec>

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The experiment specification file to set up the training experiment.

Optional Arguments

The following arguments are optional to run the command.

  • -h, --help: Show this help message and exit.

Sample Usage

The following is an example of the train command:

tao grounding_dino model train -e /path/to/spec.yaml

Optimizing Resource for Training Grounding DINO#

Training Grounding DINO requires strong GPUs (example: V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with limited resources.

Optimize GPU Memory#

There are various ways to optimize GPU memory usage. One trick is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. We recommend setting the following configurations to optimize GPU consumption:

  • Set train.precision to bf16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

  • Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.

  • Set train.distributed_strategy to fsdp to enabled Fully Sharded Data Parallel training. This shares gradient calculations across different processes to help reduce GPU memory.

  • Try using more lightweight backbones like swin_tiny_224_1k or freeze the backbone through setting model.train_backbone to False.

  • Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

Optimize CPU Memory#

To speed up data loading, typically you set a high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory, if the size of your annotation file is very large. We recommend setting the following configurations to optimize CPU consumption:

  • Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.

  • Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leak and the CPU memory to slowly stacks up until it becomes Out of Memory in the middle of training. This is the limitation of PyTorch so we advise setting fixed_padding to True to help stablize the CPU memory usage.

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluate process.

evaluate:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.0
  num_gpus: 1

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

1

FALSE

gpu_ids

list

[0]

FALSE

num_nodes

int

1

FALSE

checkpoint

string

???

FALSE

results_dir

string

FALSE

input_width

int

Width of the input image tensor.

1

FALSE

input_height

int

Height of the input image tensor.

1

FALSE
trt_engine
string
Path to the TensorRT engine to be used for evaluation.
This only works with tao-deploy.








FALSE
conf_threshold
float
The value of the confidence threshold to be used when
filtering out the final list of boxes.
0.0






FALSE

To run evaluation with a Grounding DINO model, use this command:

EVAL_JOB_ID=$(tao grounding_dino create-job \
  --kind experiment \
  --name "grounding_dino_evaluate" \
  --action evaluate \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --eval-dataset "$DATASET_ID" \
  --specs "$EVALUATE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

tao model grounding_dino evaluate [-h] -e <experiment_spec> \
                                      evaluate.checkpoint=<model to be evaluated>

Required Arguments

The following arguments are required.

  • -e, --experiment_spec: The experiment spec file to set up the evaluation experiment.

Optional Arguments

The following arguments are optional to run the command.

  • evaluate.checkpoint: The .pth model to be evaluated.

Sample Usage

The following is an example of using the evaluate command:

tao model grounding_dino evaluate -e /path/to/spec.yaml evaluate.checkpoint=/path/to/model.pth

Running Inference with a Grounding Model#

inference#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.5
  num_gpus: 1
  color_map:
    "black cat": red
    car: blue
dataset:
  infer_data_sources:
    image_dir: /data/raw-data/val2017/
    captions: ["black cat", "cat"]

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

1

FALSE

gpu_ids

list

[0]

FALSE

num_nodes

int

1

FALSE

checkpoint

string

???

FALSE

results_dir

string

FALSE
trt_engine
string
Path to the TensorRT engine to be used for evaluation.
This only works with tao-deploy.








FALSE

color_map

collection

Class-wise dictionary with colors to render boxes.

FALSE
conf_threshold
float
The value of the confidence threshold to be used when
filtering out the final list of boxes.
0.5






FALSE

is_internal

bool

Flag to render with internal directory structure.

False

FALSE

input_width

int

Width of the input image tensor.

960

32

FALSE

input_height

int

Height of the input image tensor.

544

32

FALSE

outline_width

int

Width in pixels of the bounding box outline.

3

1

FALSE

The inference tool for Grounding DINO models can be used to visualize bboxes and generate frame-by- frame KITTI format labels on a directory of images.

INFER_JOB_ID=$(tao grounding_dino create-job \
  --kind experiment \
  --name "grounding_dino_inference" \
  --action inference \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --inference-dataset "$DATASET_ID" \
  --specs "$INFERENCE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

tao model grounding_dino inference [-h] -e <experiment spec file>
                        inference.checkpoint=<model to be inferenced>

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The experiment spec file to set up the inference experiment.

Optional Arguments

The following arguments are optional to run the command.

  • inference.checkpoint: The .pth model to inference.

Sample Usage

The following is an example of using the inference command:

tao model grounding_dino inference -e /path/to/spec.yaml inference.checkpoint=/path/to/model.pth

Exporting the Model#

export#

The export parameter defines the hyperparameters of the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 17
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

results_dir

string

Path to where all the assets generated from a task are stored.

FALSE

gpu_id

int

The index of the GPU to build the TensorRT engine.

0

FALSE

checkpoint

string

Path to the checkpoint file to run export.

???

FALSE

onnx_file

string

Path to the onnx model file.

???

FALSE

on_cpu

bool

Flag to export CPU compatible model.

False

FALSE

input_channel

int

Number of channels in the input Tensor.

3

3

FALSE

input_width

int

Width of the input image tensor.

960

32

FALSE

input_height

int

Height of the input image tensor.

544

32

FALSE

opset_version

int
Operator set version of the ONNX model used to generate

the TensorRT engine.

17

1

FALSE

batch_size

int
The batch size of the input Tensor for the engine.

A value of -1 implies dynamic tensor shapes.

-1

-1

FALSE

verbose

bool

Flag to enable verbose TensorRT logging.

False

FALSE

 
EXPORT_JOB_ID=$(tao grounding_dino create-job \
  --kind experiment \
  --name "grounding_dino_export" \
  --action export \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --specs "$EXPORT_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

tao model grounding_dino export [-h] -e <experiment spec file>
                      export.checkpoint=<model to export>
                      export.onnx_file=<onnx path>

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The path to an experiment spec file.

Optional Arguments

The following arguments are optional to run the command.

  • export.checkpoint: The .pth model to export.

  • export.onnx_file: The path where the .onnx model is saved.

Sample Usage

The following is an example of using the export command:

tao model grounding_dino export -e /path/to/spec.yaml export.checkpoint=/path/to/model.pth export.onnx_file=/path/to/model.onnx

TensorRT Engine Generation, Validation, and int8 Calibration#

For deployment, refer to TAO Deploy documentation for Grounding DINO.