Mask Grounding DINO#

Mask Grounding DINO is an open vocabulary instance segmentation model included in the TAO. It supports the following tasks:

  • train

  • evaluate

  • inference

  • export

These tasks can be invoked from the TAO Launcher using the following convention on the command-line:

tao model mask_grounding_dino <sub_task> <args_per_subtask>

where, args_per_subtask are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.

Data Input for Mask Grounding DINO#

Mask Grounding DINO expects directories of images for training files to be under ODVG format with JSONL and validation to be annotated JSON files in COCO format.

Note

Unlike other instance segmentation models in TAO, category_id in your COCO JSON file for Mask Grounding DINO must start from 0, and every category ID must be contiguous. The category IDs must range from 0 to num_classes - 1. Because the original COCO annotation does not have a contiguous category id, see the TAO Data Service tao dataset annotations convert.

Creating an Experiment Spec File#

BASE_EXPERIMENT_ID=$(tao mask_grounding_dino list-base-experiments | jq -r '.[0].id')
SPECS=$(tao mask_grounding_dino get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

The training experiment spec file for Mask Grounding DINO includes model, train, and dataset parameters. This is an example spec file for finetuning a Mask Grounding DINO model with a swin_tiny_224_1k backbone on a COCO dataset.

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.jsonl  # odvg format
      label_map:  /path/to/coco/annotations/instances_train2017_labelmap.json
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/refcoco-like/annotations/instances_train2017.jsonl  # odvg format
  val_data_sources:
    image_dir: /path/to/coco/val2017/
    json_file: /path/to/refcoco-like/annotations/instances_val2017_contiguous.jsonl  # category ids need to be contiguous
    data_type: VG # or OD
  max_labels: 80  # Max number of positive + negative labels passed to the text encoder
  batch_size: 4
  workers: 8
  dataset_type: serialized  # To reduce the system memory usage
  augmentation:
    scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
    input_mean: [0.485, 0.456, 0.406]
    input_std: [0.229, 0.224, 0.225]
    horizontal_flip_prob: 0.5
    train_random_resize: [400, 500, 600]
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1333
    test_random_resize: 800
model:
  backbone: swin_tiny_224_1k
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True  # Adding bias in the contrastive embedding layer for training stability
  num_region_queries: 100 # 0 if not using ReLA, otherwise, the number of region queries
  loss_types: ['labels', 'boxes', 'masks', 'rela'] # Remove rela loss if not use ReLA
train:
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10, 20]
  num_epochs: 30
  freeze: ["backbone.0", "bert"]  # if only finetuning
  pretrained_model_path: /path/to/your-gdino-pretrained-model  # if only finetuning
  precision: bf16  # for efficient training

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

encryption_key

string

False

results_dir

string

/results

False

wandb

collection

False

model

collection

Configurable parameters to construct the model for a Mask Grounding DINO experiment.

False

dataset

collection

Configurable parameters to construct the dataset for a Mask Grounding DINO experiment.

False

train

collection

Configurable parameters to construct the trainer for a Mask Grounding DINO experiment.

False

evaluate

collection

Configurable parameters to construct the evaluator for a Mask Grounding DINO experiment.

False

inference

collection

Configurable parameters to construct the inferencer for a Mask Grounding DINO experiment.

False

export

collection

Configurable parameters to construct the exporter for a Mask Grounding DINO experiment.

False

gen_trt_engine

collection

Configurable parameters to construct the TensorRT engine builder for a Mask Grounding DINO experiment.

False

model#

The model parameter provides options to change the Mask Grounding DINO architecture.

model:
  pretrained_model_path: /path/to/your-gdino-pretrained-model
  backbone: swin_tiny_224_1k
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True
  num_region_queries: 100
  loss_types: ['labels', 'boxes', 'masks', 'rela']

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

pretrained_backbone_path

string

[Optional] Path to a pretrained backbone file.

False

backbone

string

Backbone name of the model.

The TAO implementation of Grounding DINO supports Swin.

swin_tiny_224_1k

swin_tiny_224_1k,swin_base_224_22k,swin_base_384_22k,swin_large_224_22k,swin_large_384_22k

False

num_queries

int

Number of queries.

900

1

inf

True

num_feature_levels

int

Number of feature levels to use in the model.

4

1

5

False

set_cost_class

float

Relative weight of the classification error in the matching cost.

1.0

0.0

inf

False

set_cost_bbox

float

Relative weight of the L1 error of the bounding box coordinates in the matching cost.

5.0

0.0

inf

False

set_cost_giou

float

Relative weight of the GIoU loss of the bounding box in the matching cost.

2.0

0.0

inf

False

cls_loss_coef

float

Relative weight of the classification error in the final loss.

2.0

0.0

inf

False

bbox_loss_coef

float

Relative weight of the L1 error of the bounding box coordinates in the final loss.

5.0

0.0

inf

False

giou_loss_coef

float

Relative weight of the GIoU loss of the bounding box in the final loss.

2.0

0.0

inf

False

rela_nt_loss_coef

float

Relative weight of the No-Target loss of the region query in the final loss.

1.0

0.0

inf

False

rela_minimap_loss_coef

float

Relative weight of the Minimap loss of the region query in the final loss.

0.5

0.0

inf

False

rela_union_mask_loss_coef

float

Relative weight of the Union Mask loss of the region query in the final loss.

2.0

0.0

inf

False

num_select

int

Number of top-K predictions selected during post-process.

300

1

True

num_region_queries

int

Number of region queries. 0 if not using ReLA, otherwise, the number of region queries.

100

0

True

interm_loss_coef

float

1.0

False

no_interm_box_loss

bool

True: No intermediate bbox loss.

False

False

pre_norm

bool

True: Add layer norm in the encoder.

False

False

two_stage_type

string

Type of two stage in DINO.

standard

standard,no

False

decoder_sa_type

string

Type of decoder self attention.

sa

sa,ca_label,ca_content

False

embed_init_tgt

bool

True: Add target embedding.

True

False

fix_refpoints_hw

int

If -1, width and height are learned separately for each box.

If -2, a shared width and height are learned.

A value greater than 0 specifies learning with a fixed number.

-1

-2

inf

False

pe_temperatureH

int

Temperature applied to the height dimension of the positional sine embedding.

20

1

inf

False

pe_temperatureW

int

Temperature applied to the width dimension of the positional sine embedding.

20

1

inf

False

return_interm_indices

list

Index of feature levels to use in the model. The length must match num_feature_levels.

[1, 2, 3, 4]

False

use_dn

bool

True: Enable contrastive de-noising training in DINO.

True

False

dn_number

int

Number of denoising queries in DINO.

0

0

inf

False

dn_box_noise_scale

float

Scale of noise applied to boxes during contrastive de-noising. If 0, noise is not applied.

1.0

0.0

inf

False

dn_label_noise_ratio

float

Scale of the noise applied to labels during contrastive denoising. If 0, noise is not applied.

0.5

0.0

False

focal_alpha

float

Alpha value in the focal loss.

0.25

False

focal_gamma

float

Gamma value in the focal loss.

2.0

False

clip_max_norm

float

0.1

False

nheads

int

Number of heads.

8

False

dropout_ratio

float

Probability of dropping hidden units.

0.0

0.0

1.0

False

hidden_dim

int

Dimension of the hidden units.

256

False

enc_layers

int

Number of encoder layers in the transformer.

6

1

True

dec_layers

int

Number of decoder layers in the transformer.

6

1

True

dim_feedforward

int

Dimension of the feedforward network.

2048

1

False

dec_n_points

int

Number of reference points in the decoder.

4

1

False

enc_n_points

int

Number of reference points in the encoder.

4

1

False

aux_loss

bool

True: Use auxiliary decoding losses (loss at each decoder layer).

True

False

dilation

bool

True: enable dilation in the backbone.

False

False

train_backbone

bool

True: Set backbone weights as trainable or frozen. False: Backbone weights are frozen.

True

False

text_encoder_type

string

BERT encoder type. If only the name of the type is provided, the weight is downloaded from the Hugging Face Hub. If a path is provided, we load the weight from the local path.

bert-base-uncased

False

max_text_len

int

Maximum text length of BERT.

256

1

False

class_embed_bias

bool

True: Set bias in the contrastive embedding.

False

False

log_scale

string

[Optional] Initial value of a learnable parameter to multiply with the similarity matrix to normalize the output. Defaults to 'None'.

  • If set to 'auto', the similarity matrix is normalized by a fixed value sqrt(d_c) where d_c is the channel number.

  • If set to 'none' or None, no normalization is applied.

none

False

loss_types

list

Losses to be used during training.

[‘labels’, ‘boxes’]

False

backbone_names

list

Prefix of tensor names corresponding to the backbone.

[‘backbone.0’, ‘bert’]

False

linear_proj_names

list

Linear projection layer names.

[‘reference_points’, ‘sampling_offsets’]

False

has_mask

bool

True: Enable mask head in Grounding Dino.

True

False

mask_loss_coef

float

Relative weight of mask error in the final loss.

2.0

False

dice_loss_coef

float

Relative weight of dice loss of the segmentation in the final loss.

5.0

False

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    lr: 0.0002
    lr_backbone: 0.00002
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [10, 20]
    lr_decay: 0.1
  num_epochs: 30
  checkpoint_interval: 1
  precision: bf16
  distributed_strategy: ddp
  activation_checkpoint: True
  num_gpus: 8
  num_nodes: 1
  freeze: ["backbone.0", "bert"]
  pretrained_model_path: /path/to/pretrained/model

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

Number of GPUs to run the train job.

1

1

False

gpu_ids

list

List of GPU IDs to run training on. Length of gpu_ids must match value of train.num_gpus.

[0]

False

num_nodes

int

Number of nodes for training. >1 enables multi-node.

1

False

seed

int

Seed for PyTorch initializer. <0 disables fixed seed.

1234

-1

inf

False

cudnn

collection

cuDNN configuration.

False

num_epochs

int

Number of training epochs.

10

1

inf

True

checkpoint_interval

int

Interval (in epochs) to save checkpoints. Helps resume training.

1

1

False

validation_interval

int

Interval (in epochs) to run evaluation on validation dataset.

1

1

False

resume_training_checkpoint_path

string

Path to checkpoint for resuming training.

False

results_dir

string

Path to store all assets generated from a task.

False

freeze

list

Layers to freeze. Example: [“backbone”, “transformer.encoder”, “input_proj”].

[]

False

pretrained_model_path

string

Path to pretrained Deformable DETR model for initialization.

False

clip_grad_norm

float

Clip gradient by L2 norm. 0.0 disables gradient clipping.

0.1

False

is_dry_run

bool

True: Run trainer in Dry Run mode. Validates spec file and runs sanity check without initializing trainer.

False

False

optim

collection

Hyperparameters for optimizer configuration.

False

precision

string

Training precision.

fp32

fp16,fp32,bf16

False

distributed_strategy

string

Multi-GPU training strategy. Supports DDP (Distributed Data Parallel) and FSDP (Fully Sharded DDP).

ddp

ddp,fsdp

False

activation_checkpoint

bool

True: Recompute activations in backward pass to save GPU memory. This avoids storing intermediate activations.

True

False

verbose

bool

True: Enable detailed optimizer learning rate printing.

False

False

optim#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  lr: 0.0002
  lr_backbone: 0.00002
  momentum: 0.9
  weight_decay: 0.0001
  lr_scheduler: MultiStep
  lr_steps: [10, 20]
  lr_decay: 0.1

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

optimizer

string

Optimizer type for training.

AdamW

AdamW,SGD

False

monitor_name

string

Metric monitored by AutoReduce Scheduler.

val_loss

val_loss,train_loss

False

lr

float

Initial learning rate for model (excluding backbone).

0.0002

True

lr_backbone

float

Initial learning rate for backbone.

2e-05

True

lr_linear_proj_mult

float

Initial learning rate multiplier for linear projection layer.

0.1

True

momentum

float

Momentum for AdamW optimizer.

0.9

True

weight_decay

float

Weight decay coefficient.

0.0001

True

lr_scheduler

string

Learning rate scheduler type.

  • MultiStep: decrease lr by lr_decay at lr_steps.

  • StepLR: decrease lr by lr_decay every lr_step_size.

MultiStep

MultiStep,StepLR

False

lr_steps

list

Steps at which lr decreases (for MultiStep LR).

[10]

False

lr_step_size

int

Number of steps between lr decreases (for StepLR).

10

True

lr_decay

float

Factor to decrease lr for scheduler.

0.1

True

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.jsonl  # odvg format
      label_map:  /path/to/coco/annotations/instances_train2017_labelmap.json
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/refcoco-like/annotations/instances_train2017.jsonl  # odvg format
  val_data_sources:
    image_dir: /path/to/coco/val2017/
    json_file: /path/to/refcoco-like/annotations/instances_val2017_contiguous.jsonl  # category ids need to be contiguous
    data_type: VG # or OD
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
    data_type: OD # or VG
  infer_data_sources:
    image_dir: /path/to/coco/images/val2017/
    data_type: OD # or VG
    captions: ["black cat", "car"] # or json file that contains the image path and captions
  max_labels: 80
  batch_size: 4
  workers: 8

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

train_data_sources

list

List of training data sources:

  • image_dir: Directory containing training images.

  • json_file: Path to JSONL in ODVG training format.

  • label_map: Optional path for detection dataset label mapping.

[{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘label_map’: ‘’}, {‘image_dir’: ‘’, ‘json_file’: ‘’}]

False

val_data_sources

collection

Validation data source:

  • image_dir: Directory containing validation images.

  • json_file: Path to JSON in COCO validation format.

  • data_type: Dataset type, OD or VG.

Category ID must start from 0 to calculate validation loss. Run Data Services annotation conversion to make categories contiguous.

{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘data_type’: ‘’}

False

test_data_sources

collection

Test data source:

  • image_dir: Directory containing test images.

  • json_file: Path to JSON in COCO test format.

  • data_type: Dataset type, OD or VG.

{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘data_type’: ‘’}

False

infer_data_sources

collection

Inference data source:

  • image_dir: Directory containing inference images.

  • data_type: Dataset type, OD or VG.

  • captions: List of captions, use for OD inference only.

  • json_file: Path to JSON with image_path+caption pairs for VG.

{‘image_dir’: ‘’, ‘data_type’: ‘’}

False

batch_size

int

Batch size for training and validation.

4

1

inf

True

workers

int

Number of parallel data loader workers.

8

1

inf

True

pin_memory

bool

True: Allocate pagelocked memory for faster CPU-GPU data transfer.

True

False

dataset_type

string

Dataset structure type.

  • default: Standard map-style, loads ODVG in each subprocess, can increase RAM.

  • serialized: Serialized via pickle and torch.Tensor, shared across subprocesses.

serialized

serialized,default

False

max_labels

int

Total labels to sample. After positive labels, samples negative labels to reach max_labels.

  • OD: negative labels = categories absent in image.

  • Grounding: negative labels = phrases not in image captions.

Higher max_labels may improve robustness at cost of longer training.

50

1

inf

False

eval_class_ids

list

Class IDs for evaluation.

[1]

False

augmentation

collection

Data augmentation parameters.

False

has_mask

bool

True: Load mask annotations from dataset.

False

augmentation#

The augmentation parameter contains hyperparameters for augmentation.

augmentation:
  scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
  input_mean: [0.485, 0.456, 0.406]
  input_std: [0.229, 0.224, 0.225]
  horizontal_flip_prob: 0.5
  train_random_resize: [400, 500, 600]
  train_random_crop_min: 384
  train_random_crop_max: 600
  random_resize_max_size: 1333
  test_random_resize: 800

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

scales

list

Sizes to perform random resize.

[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]

False

input_mean

list

Input mean for RGB frames.

[0.485, 0.456, 0.406]

False

input_std

list

Input standard deviation per pixel for RGB frames.

[0.229, 0.224, 0.225]

False

train_random_resize

list

Sizes to perform random resize for training data.

[400, 500, 600]

False

horizontal_flip_prob

float

Probability for horizontal flip during training.

0.5

0.0

1.0

True

train_random_crop_min

int

Minimum random crop size for training data.

384

1

inf

True

train_random_crop_max

int

Maximum random crop size for training data.

600

1

inf

True

random_resize_max_size

int

Maximum random resize size for training data.

1333

1

inf

True

test_random_resize

int

Random resize size for test data.

800

1

inf

True

fixed_padding

bool

True: Resize image to (sorted(scales[-1]), random_resize_max_size) without padding. This prevents a CPU memory leak.

True

False

fixed_random_crop

int

Determines the resulting image resolution. 0 disables Large Scale Jittering (cropping).

1024

1

inf

False

Training the Model#

To train a Mask Grounding DINO model, use this command:

TRAIN_JOB_ID=$(tao mask_grounding_dino create-job \
  --kind experiment \
  --name "mask_grounding_dino_train" \
  --action train \
  --workspace-id $WORKSPACE_ID \
  --specs "$TRAIN_SPECS" \
  --train-datasets '["'$DATASET_ID'"]' \
  --eval-dataset "$DATASET_ID" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask_grounding_dino train [-h] -e <experiment_spec>

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The experiment specification file to set up the training experiment

Optional Arguments

The following arguments are optional to run the command.

  • -h, --help: Show this help message and exit.

Sample Usage

This is an example of the train command:

tao mask_grounding_dino model train -e /path/to/spec.yaml

Optimizing Resource for Training Grounding DINO#

Training Mask Grounding DINO requires strong GPUs (for example: V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory#

There are various ways to optimize GPU memory usage. One trick is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. We recommend setting the following configurations to optimize GPU consumption.

  • Set train.precision to bf16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

  • Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.

  • Set train.distributed_strategy to fsdp to enabled Fully Sharded Data Parallel training. This will share gradient calculation across different processes to help reduce GPU memory.

  • Try using more lightweight backbones like swin_tiny_224_1k or freeze the backbone through setting model.train_backbone to False.

  • Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

Optimize CPU Memory#

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory if the size of your annotation file is very large. Hence, we recommend setting below configurations in order to optimize CPU consumption.

  • Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.

  • Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leak and the CPU memory to slowly stacks up until it becomes Out of Memory in the middle of training. This is the limitation of PyTorch so we advise setting fixed_padding to True to help stablize the CPU memory usage.

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluate process.

evaluate:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.0
  num_gpus: 1
  ioi_threshold: 0.5
  nms_threshold: 0.2
  text_threshold: 0.3

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

1

False

gpu_ids

list

[0]

False

num_nodes

int

1

False

checkpoint

string

???

False

results_dir

string

False

input_width

int

Width of the input image tensor.

1

False

input_height

int

Height of the input image tensor.

1

False

trt_engine

string

Path to the TensorRT engine to be used for evaluation. This only works with tao-deploy.

False

conf_threshold

float

Confidence threshold on box scores for filtering final masks and boxes.

0.0

False

ioi_threshold

float

Intersection over instance (ioi) threshold between ReLA output and instance masks for filtering final masks and boxes.

0.5

False

nms_threshold

float

Non-max suppression threshold on boxes to filter final masks and boxes.

0.2

False

text_threshold

float

Text threshold for extracting phrases from expressions.

0.3

False

To run evaluation with a Mask Grounding DINO model, use this command:

EVAL_JOB_ID=$(tao mask_grounding_dino create-job \
  --kind experiment \
  --name "mask_grounding_dino_evaluate" \
  --action evaluate \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --eval-dataset "$DATASET_ID" \
  --specs "$EVALUATE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask_grounding_dino evaluate [-h] -e <experiment_spec> \
                                     evaluate.checkpoint=<model to be evaluated>

Required Arguments

The following arguments are required.

  • -e, --experiment_spec: The experiment spec file to set up the evaluation experiment

Optional Arguments

The following arguments are optional to run the command.

  • evaluate.checkpoint: The .pth model to be evaluated

Sample Usage

This is an example of using the evaluate command:

tao model mask_grounding_dino evaluate -e /path/to/spec.yaml evaluate.checkpoint=/path/to/model.pth

Running Inference with a Grounding Model#

inference#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.5
  num_gpus: 1
  color_map:
    "black cat": red
    car: blue
  ioi_threshold: 0.5
  nms_threshold: 0.2
  text_threshold: 0.3
dataset:
  infer_data_sources:
    image_dir: /data/raw-data/val2017/
    captions: ["black cat", "cat"] # or json file that contains the image path and captions for VG
    data_type: OD # or VG

Field

value_type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

1

False

gpu_ids

list

[0]

False

num_nodes

int

1

False

checkpoint

string

???

False

results_dir

string

False

trt_engine

string

Path to the TensorRT engine to be used for evaluation.

This only works with tao-deploy.

False

color_map

collection

Class-wise dictionary with colors to render boxes.

False

conf_threshold

float

Confidence threshold on box scores for filtering final masks and boxes.

0.0

False

ioi_threshold

float

Intersection over instance (ioi) threshold between ReLA output and instance masks for filtering final masks and boxes.

0.5

False

nms_threshold

float

Non-max suppression threshold on boxes to filter final masks and boxes.

0.2

False

text_threshold

float

Text threshold for extracting phrases from expressions.

0.3

False

is_internal

bool

True: Render with internal directory structure.

False

False

input_width

int

Width of the input image tensor.

960

32

False

input_height

int

Height of the input image tensor.

544

32

False

outline_width

int

Width in pixels of the bounding box outline.

3

1

False

The inference tool for Mask Grounding DINO models can be used to visualize bboxes and generate frame-by- frame KITTI format labels on a directory of images.

INFERENCE_JOB_ID=$(tao mask_grounding_dino create-job \
  --kind experiment \
  --name "mask_grounding_dino_inference" \
  --action inference \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --inference-dataset "$DATASET_ID" \
  --specs "$INFERENCE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask_grounding_dino inference [-h] -e <experiment spec file>
                        inference.checkpoint=<model to be inferenced>

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The experiment spec file to set up the inference experiment

Optional Arguments

The following arguments are optional to run the command.

  • inference.checkpoint: The .pth model to inference

Sample Usage

This is an example of using the inference command:

tao model mask_grounding_dino inference -e /path/to/spec.yaml inference.checkpoint=/path/to/model.pth

Exporting the Model#

export#

The export parameter defines the hyperparameters of the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 17
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Field

Value Type

Description

default_value

valid_min

valid_max

valid_options

automl_enabled

results_dir

string

Path to where all the assets generated from a task are stored.

False

gpu_id

int

The index of the GPU to build the TensorRT engine.

0

False

checkpoint

string

Path to the checkpoint file to run export.

???

False

onnx_file

string

Path to the ONNX model file.

???

False

on_cpu

bool

True: Export CPU compatible model.

False

False

input_channel

int

Number of channels in the input tensor.

3

3

False

input_width

int

Width of the input image tensor.

960

32

False

input_height

int

Height of the input image tensor.

544

32

False

opset_version

int

Operator set version of the ONNX model used to generate the TensorRT engine.

17

1

False

batch_size

int

The batch size of the input Tensor for the engine. A value of -1 implies dynamic tensor shapes.

-1

-1

False

verbose

bool

True: Enable verbose TensorRT logging.

False

False

EXPORT_JOB_ID=$(tao mask_grounding_dino create-job \
  --kind experiment \
  --name "mask_grounding_dino_export" \
  --action export \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --specs "$EXPORT_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask_grounding_dino export [-h] -e <experiment spec file>
                      export.checkpoint=<model to export>
                      export.onnx_file=<onnx path>

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The path to an experiment spec file

Optional Arguments

The following arguments are optional to run the command.

  • export.checkpoint: The .pth model to export

  • export.onnx_file: The path where the .onnx model is saved

Sample Usage

This is an example of using the export command:

tao model mask_grounding_dino export -e /path/to/spec.yaml export.checkpoint=/path/to/model.pth export.onnx_file=/path/to/model.onnx

TensorRT Engine Generation, Validation, and int8 Calibration#

For deployment, refer to TAO Deploy documentation for Mask Grounding DINO.