Mask Grounding DINO#

Mask Grounding DINO is an open vocabulary instance segmentation model included in the TAO. It supports the following tasks:

train
evaluate
inference
export

These tasks can be invoked from the TAO Launcher using the following convention on the command-line:

tao model mask_grounding_dino <sub_task> <args_per_subtask>

where, args_per_subtask are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.

Data Input for Mask Grounding DINO#

Mask Grounding DINO expects directories of images for training files to be under ODVG format with JSONL and validation to be annotated JSON files in COCO format.

Note

Unlike other instance segmentation models in TAO, category_id in your COCO JSON file for Mask Grounding DINO must start from 0, and every category ID must be contiguous. The category IDs must range from 0 to num_classes - 1. Because the original COCO annotation does not have a contiguous category id, see the TAO Data Service tao dataset annotations convert.

Creating an Experiment Spec File#

TAO Client (v2 API)

BASE_EXPERIMENT_ID=$(tao mask_grounding_dino list-base-experiments | jq -r '.[0].id')
SPECS=$(tao mask_grounding_dino get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

The training experiment spec file for Mask Grounding DINO includes model, train, and dataset parameters. This is an example spec file for finetuning a Mask Grounding DINO model with a swin_tiny_224_1k backbone on a COCO dataset.

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.jsonl  # odvg format
      label_map:  /path/to/coco/annotations/instances_train2017_labelmap.json
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/refcoco-like/annotations/instances_train2017.jsonl  # odvg format
  val_data_sources:
    image_dir: /path/to/coco/val2017/
    json_file: /path/to/refcoco-like/annotations/instances_val2017_contiguous.jsonl  # category ids need to be contiguous
    data_type: VG # or OD
  max_labels: 80  # Max number of positive + negative labels passed to the text encoder
  batch_size: 4
  workers: 8
  dataset_type: serialized  # To reduce the system memory usage
  augmentation:
    scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
    input_mean: [0.485, 0.456, 0.406]
    input_std: [0.229, 0.224, 0.225]
    horizontal_flip_prob: 0.5
    train_random_resize: [400, 500, 600]
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1333
    test_random_resize: 800
model:
  backbone: swin_tiny_224_1k
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True  # Adding bias in the contrastive embedding layer for training stability
  num_region_queries: 100 # 0 if not using ReLA, otherwise, the number of region queries
  loss_types: ['labels', 'boxes', 'masks', 'rela'] # Remove rela loss if not use ReLA
train:
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10, 20]
  num_epochs: 30
  freeze: ["backbone.0", "bert"]  # if only finetuning
  pretrained_model_path: /path/to/your-gdino-pretrained-model  # if only finetuning
  precision: bf16  # for efficient training

Field	value_type	Description	default_value	automl_enabled
`encryption_key`	string			False
`results_dir`	string		/results	False
`wandb`	collection			False
`model`	collection	Configurable parameters to construct the model for a Mask Grounding DINO experiment.		False
`dataset`	collection	Configurable parameters to construct the dataset for a Mask Grounding DINO experiment.		False
`train`	collection	Configurable parameters to construct the trainer for a Mask Grounding DINO experiment.		False
`evaluate`	collection	Configurable parameters to construct the evaluator for a Mask Grounding DINO experiment.		False
`inference`	collection	Configurable parameters to construct the inferencer for a Mask Grounding DINO experiment.		False
`export`	collection	Configurable parameters to construct the exporter for a Mask Grounding DINO experiment.		False
`gen_trt_engine`	collection	Configurable parameters to construct the TensorRT engine builder for a Mask Grounding DINO experiment.		False

model#

The model parameter provides options to change the Mask Grounding DINO architecture.

model:
  pretrained_model_path: /path/to/your-gdino-pretrained-model
  backbone: swin_tiny_224_1k
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True
  num_region_queries: 100
  loss_types: ['labels', 'boxes', 'masks', 'rela']

Field	value_type	Description	default_value	valid_min	valid_max	valid_options	automl_enabled
`pretrained_backbone_path`	string	[Optional] Path to a pretrained backbone file.					False
`backbone`	string	Backbone name of the model. The TAO implementation of Grounding DINO supports Swin.	swin_tiny_224_1k			swin_tiny_224_1k,swin_base_224_22k,swin_base_384_22k,swin_large_224_22k,swin_large_384_22k	False
`num_queries`	int	Number of queries.	900	1	inf		True
`num_feature_levels`	int	Number of feature levels to use in the model.	4	1	5		False
`set_cost_class`	float	Relative weight of the classification error in the matching cost.	1.0	0.0	inf		False
`set_cost_bbox`	float	Relative weight of the L1 error of the bounding box coordinates in the matching cost.	5.0	0.0	inf		False
`set_cost_giou`	float	Relative weight of the GIoU loss of the bounding box in the matching cost.	2.0	0.0	inf		False
`cls_loss_coef`	float	Relative weight of the classification error in the final loss.	2.0	0.0	inf		False
`bbox_loss_coef`	float	Relative weight of the L1 error of the bounding box coordinates in the final loss.	5.0	0.0	inf		False
`giou_loss_coef`	float	Relative weight of the GIoU loss of the bounding box in the final loss.	2.0	0.0	inf		False
`rela_nt_loss_coef`	float	Relative weight of the No-Target loss of the region query in the final loss.	1.0	0.0	inf		False
`rela_minimap_loss_coef`	float	Relative weight of the Minimap loss of the region query in the final loss.	0.5	0.0	inf		False
`rela_union_mask_loss_coef`	float	Relative weight of the Union Mask loss of the region query in the final loss.	2.0	0.0	inf		False
`num_select`	int	Number of top-K predictions selected during post-process.	300	1			True
`num_region_queries`	int	Number of region queries. 0 if not using ReLA, otherwise, the number of region queries.	100	0			True
`interm_loss_coef`	float		1.0				False
`no_interm_box_loss`	bool	True: No intermediate bbox loss.	False				False
`pre_norm`	bool	True: Add layer norm in the encoder.	False				False
`two_stage_type`	string	Type of two stage in DINO.	standard			standard,no	False
`decoder_sa_type`	string	Type of decoder self attention.	sa			sa,ca_label,ca_content	False
`embed_init_tgt`	bool	True: Add target embedding.	True				False
`fix_refpoints_hw`	int	If -1, width and height are learned separately for each box. If -2, a shared width and height are learned. A value greater than 0 specifies learning with a fixed number.	-1	-2	inf		False
`pe_temperatureH`	int	Temperature applied to the height dimension of the positional sine embedding.	20	1	inf		False
`pe_temperatureW`	int	Temperature applied to the width dimension of the positional sine embedding.	20	1	inf		False
`return_interm_indices`	list	Index of feature levels to use in the model. The length must match num_feature_levels.	[1, 2, 3, 4]				False
`use_dn`	bool	True: Enable contrastive de-noising training in DINO.	True				False
`dn_number`	int	Number of denoising queries in DINO.	0	0	inf		False
`dn_box_noise_scale`	float	Scale of noise applied to boxes during contrastive de-noising. If 0, noise is not applied.	1.0	0.0	inf		False
`dn_label_noise_ratio`	float	Scale of the noise applied to labels during contrastive denoising. If 0, noise is not applied.	0.5	0.0			False
`focal_alpha`	float	Alpha value in the focal loss.	0.25				False
`focal_gamma`	float	Gamma value in the focal loss.	2.0				False
`clip_max_norm`	float		0.1				False
`nheads`	int	Number of heads.	8				False
`dropout_ratio`	float	Probability of dropping hidden units.	0.0	0.0	1.0		False
`hidden_dim`	int	Dimension of the hidden units.	256				False
`enc_layers`	int	Number of encoder layers in the transformer.	6	1			True
`dec_layers`	int	Number of decoder layers in the transformer.	6	1			True
`dim_feedforward`	int	Dimension of the feedforward network.	2048	1			False
`dec_n_points`	int	Number of reference points in the decoder.	4	1			False
`enc_n_points`	int	Number of reference points in the encoder.	4	1			False
`aux_loss`	bool	True: Use auxiliary decoding losses (loss at each decoder layer).	True				False
`dilation`	bool	True: enable dilation in the backbone.	False				False
`train_backbone`	bool	True: Set backbone weights as trainable or frozen. False: Backbone weights are frozen.	True				False
`text_encoder_type`	string	BERT encoder type. If only the name of the type is provided, the weight is downloaded from the Hugging Face Hub. If a path is provided, we load the weight from the local path.	bert-base-uncased				False
`max_text_len`	int	Maximum text length of BERT.	256	1			False
`class_embed_bias`	bool	True: Set bias in the contrastive embedding.	False				False
`log_scale`	string	[Optional] Initial value of a learnable parameter to multiply with the similarity matrix to normalize the output. Defaults to `'None'`. If set to `'auto'`, the similarity matrix is normalized by a fixed value `sqrt(d_c)` where `d_c` is the channel number. If set to `'none'` or `None`, no normalization is applied.	none				False
`loss_types`	list	Losses to be used during training.	[‘labels’, ‘boxes’]				False
`backbone_names`	list	Prefix of tensor names corresponding to the backbone.	[‘backbone.0’, ‘bert’]				False
`linear_proj_names`	list	Linear projection layer names.	[‘reference_points’, ‘sampling_offsets’]				False
`has_mask`	bool	True: Enable mask head in Grounding Dino.	True				False
`mask_loss_coef`	float	Relative weight of mask error in the final loss.	2.0				False
`dice_loss_coef`	float	Relative weight of dice loss of the segmentation in the final loss.	5.0				False

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    lr: 0.0002
    lr_backbone: 0.00002
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [10, 20]
    lr_decay: 0.1
  num_epochs: 30
  checkpoint_interval: 1
  precision: bf16
  distributed_strategy: ddp
  activation_checkpoint: True
  num_gpus: 8
  num_nodes: 1
  freeze: ["backbone.0", "bert"]
  pretrained_model_path: /path/to/pretrained/model

Field	value_type	Description	default_value	valid_min	valid_max	valid_options	automl_enabled
`num_gpus`	int	Number of GPUs to run the train job.	1	1			False
`gpu_ids`	list	List of GPU IDs to run training on. Length of `gpu_ids` must match value of `train.num_gpus`.	[0]				False
`num_nodes`	int	Number of nodes for training. >1 enables multi-node.	1				False
`seed`	int	Seed for PyTorch initializer. <0 disables fixed seed.	1234	-1	inf		False
`cudnn`	collection	cuDNN configuration.					False
`num_epochs`	int	Number of training epochs.	10	1	inf		True
`checkpoint_interval`	int	Interval (in epochs) to save checkpoints. Helps resume training.	1	1			False
`validation_interval`	int	Interval (in epochs) to run evaluation on validation dataset.	1	1			False
`resume_training_checkpoint_path`	string	Path to checkpoint for resuming training.					False
`results_dir`	string	Path to store all assets generated from a task.					False
`freeze`	list	Layers to freeze. Example: [“backbone”, “transformer.encoder”, “input_proj”].	[]				False
`pretrained_model_path`	string	Path to pretrained Deformable DETR model for initialization.					False
`clip_grad_norm`	float	Clip gradient by L2 norm. 0.0 disables gradient clipping.	0.1				False
`is_dry_run`	bool	True: Run trainer in Dry Run mode. Validates spec file and runs sanity check without initializing trainer.	False				False
`optim`	collection	Hyperparameters for optimizer configuration.					False
`precision`	string	Training precision.	fp32			fp16,fp32,bf16	False
`distributed_strategy`	string	Multi-GPU training strategy. Supports DDP (Distributed Data Parallel) and FSDP (Fully Sharded DDP).	ddp			ddp,fsdp	False
`activation_checkpoint`	bool	True: Recompute activations in backward pass to save GPU memory. This avoids storing intermediate activations.	True				False
`verbose`	bool	True: Enable detailed optimizer learning rate printing.	False				False

optim#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  lr: 0.0002
  lr_backbone: 0.00002
  momentum: 0.9
  weight_decay: 0.0001
  lr_scheduler: MultiStep
  lr_steps: [10, 20]
  lr_decay: 0.1

Field	value_type	Description	default_value	valid_options	automl_enabled
`optimizer`	string	Optimizer type for training.	AdamW	AdamW,SGD	False
`monitor_name`	string	Metric monitored by `AutoReduce` Scheduler.	val_loss	val_loss,train_loss	False
`lr`	float	Initial learning rate for model (excluding backbone).	0.0002		True
`lr_backbone`	float	Initial learning rate for backbone.	2e-05		True
`lr_linear_proj_mult`	float	Initial learning rate multiplier for linear projection layer.	0.1		True
`momentum`	float	Momentum for AdamW optimizer.	0.9		True
`weight_decay`	float	Weight decay coefficient.	0.0001		True
`lr_scheduler`	string	Learning rate scheduler type. MultiStep: decrease lr by lr_decay at lr_steps. StepLR: decrease lr by lr_decay every lr_step_size.	MultiStep	MultiStep,StepLR	False
`lr_steps`	list	Steps at which lr decreases (for MultiStep LR).	[10]		False
`lr_step_size`	int	Number of steps between lr decreases (for StepLR).	10		True
`lr_decay`	float	Factor to decrease lr for scheduler.	0.1		True

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.jsonl  # odvg format
      label_map:  /path/to/coco/annotations/instances_train2017_labelmap.json
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/refcoco-like/annotations/instances_train2017.jsonl  # odvg format
  val_data_sources:
    image_dir: /path/to/coco/val2017/
    json_file: /path/to/refcoco-like/annotations/instances_val2017_contiguous.jsonl  # category ids need to be contiguous
    data_type: VG # or OD
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
    data_type: OD # or VG
  infer_data_sources:
    image_dir: /path/to/coco/images/val2017/
    data_type: OD # or VG
    captions: ["black cat", "car"] # or json file that contains the image path and captions
  max_labels: 80
  batch_size: 4
  workers: 8

Field	value_type	Description	default_value	valid_min	valid_max	valid_options	automl_enabled
`train_data_sources`	list	List of training data sources: `image_dir`: Directory containing training images. `json_file`: Path to JSONL in ODVG training format. `label_map`: Optional path for detection dataset label mapping.	[{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘label_map’: ‘’}, {‘image_dir’: ‘’, ‘json_file’: ‘’}]				False
`val_data_sources`	collection	Validation data source: `image_dir`: Directory containing validation images. `json_file`: Path to JSON in COCO validation format. `data_type`: Dataset type, OD or VG. Category ID must start from 0 to calculate validation loss. Run Data Services annotation conversion to make categories contiguous.	{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘data_type’: ‘’}				False
`test_data_sources`	collection	Test data source: `image_dir`: Directory containing test images. `json_file`: Path to JSON in COCO test format. `data_type`: Dataset type, OD or VG.	{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘data_type’: ‘’}				False
`infer_data_sources`	collection	Inference data source: `image_dir`: Directory containing inference images. `data_type`: Dataset type, OD or VG. `captions`: List of captions, use for OD inference only. `json_file`: Path to JSON with image_path+caption pairs for VG.	{‘image_dir’: ‘’, ‘data_type’: ‘’}				False
`batch_size`	int	Batch size for training and validation.	4	1	inf		True
`workers`	int	Number of parallel data loader workers.	8	1	inf		True
`pin_memory`	bool	True: Allocate pagelocked memory for faster CPU-GPU data transfer.	True				False
`dataset_type`	string	Dataset structure type. `default`: Standard map-style, loads ODVG in each subprocess, can increase RAM. `serialized`: Serialized via pickle and torch.Tensor, shared across subprocesses.	serialized			serialized,default	False
`max_labels`	int	Total labels to sample. After positive labels, samples negative labels to reach `max_labels`. OD: negative labels = categories absent in image. Grounding: negative labels = phrases not in image captions. Higher `max_labels` may improve robustness at cost of longer training.	50	1	inf		False
`eval_class_ids`	list	Class IDs for evaluation.	[1]				False
`augmentation`	collection	Data augmentation parameters.					False
`has_mask`	bool	True: Load mask annotations from dataset.					False

augmentation#

The augmentation parameter contains hyperparameters for augmentation.

augmentation:
  scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
  input_mean: [0.485, 0.456, 0.406]
  input_std: [0.229, 0.224, 0.225]
  horizontal_flip_prob: 0.5
  train_random_resize: [400, 500, 600]
  train_random_crop_min: 384
  train_random_crop_max: 600
  random_resize_max_size: 1333
  test_random_resize: 800

Field	value_type	Description	default_value	valid_min	valid_max	automl_enabled
`scales`	list	Sizes to perform random resize.	[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]			False
`input_mean`	list	Input mean for RGB frames.	[0.485, 0.456, 0.406]			False
`input_std`	list	Input standard deviation per pixel for RGB frames.	[0.229, 0.224, 0.225]			False
`train_random_resize`	list	Sizes to perform random resize for training data.	[400, 500, 600]			False
`horizontal_flip_prob`	float	Probability for horizontal flip during training.	0.5	0.0	1.0	True
`train_random_crop_min`	int	Minimum random crop size for training data.	384	1	inf	True
`train_random_crop_max`	int	Maximum random crop size for training data.	600	1	inf	True
`random_resize_max_size`	int	Maximum random resize size for training data.	1333	1	inf	True
`test_random_resize`	int	Random resize size for test data.	800	1	inf	True
`fixed_padding`	bool	True: Resize image to (sorted(scales[-1]), random_resize_max_size) without padding. This prevents a CPU memory leak.	True			False
`fixed_random_crop`	int	Determines the resulting image resolution. 0 disables Large Scale Jittering (cropping).	1024	1	inf	False

Training the Model#

To train a Mask Grounding DINO model, use this command:

TAO Client (v2 API)

TRAIN_JOB_ID=$(tao mask_grounding_dino create-job \
  --kind experiment \
  --name "mask_grounding_dino_train" \
  --action train \
  --workspace-id $WORKSPACE_ID \
  --specs "$TRAIN_SPECS" \
  --train-datasets '["'$DATASET_ID'"]' \
  --eval-dataset "$DATASET_ID" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model mask_grounding_dino train [-h] -e <experiment_spec>

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The experiment specification file to set up the training experiment

Optional Arguments

The following arguments are optional to run the command.

-h, --help: Show this help message and exit.

Sample Usage

This is an example of the train command:

tao mask_grounding_dino model train -e /path/to/spec.yaml

Optimizing Resource for Training Grounding DINO#

Training Mask Grounding DINO requires strong GPUs (for example: V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory#

There are various ways to optimize GPU memory usage. One trick is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. We recommend setting the following configurations to optimize GPU consumption.

Set train.precision to bf16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.
Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.
Set train.distributed_strategy to fsdp to enabled Fully Sharded Data Parallel training. This will share gradient calculation across different processes to help reduce GPU memory.
Try using more lightweight backbones like swin_tiny_224_1k or freeze the backbone through setting model.train_backbone to False.
Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

Optimize CPU Memory#

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory if the size of your annotation file is very large. Hence, we recommend setting below configurations in order to optimize CPU consumption.

Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.
Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leak and the CPU memory to slowly stacks up until it becomes Out of Memory in the middle of training. This is the limitation of PyTorch so we advise setting fixed_padding to True to help stablize the CPU memory usage.

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluate process.

evaluate:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.0
  num_gpus: 1
  ioi_threshold: 0.5
  nms_threshold: 0.2
  text_threshold: 0.3

Field	value_type	Description	default_value	valid_min	automl_enabled
`num_gpus`	int		1		False
`gpu_ids`	list		[0]		False
`num_nodes`	int		1		False
`checkpoint`	string		???		False
`results_dir`	string				False
`input_width`	int	Width of the input image tensor.		1	False
`input_height`	int	Height of the input image tensor.		1	False
`trt_engine`	string	Path to the TensorRT engine to be used for evaluation. This only works with `tao-deploy`.			False
`conf_threshold`	float	Confidence threshold on box scores for filtering final masks and boxes.	0.0		False
`ioi_threshold`	float	Intersection over instance (ioi) threshold between ReLA output and instance masks for filtering final masks and boxes.	0.5		False
`nms_threshold`	float	Non-max suppression threshold on boxes to filter final masks and boxes.	0.2		False
`text_threshold`	float	Text threshold for extracting phrases from expressions.	0.3		False

To run evaluation with a Mask Grounding DINO model, use this command:

TAO Client (v2 API)

EVAL_JOB_ID=$(tao mask_grounding_dino create-job \
  --kind experiment \
  --name "mask_grounding_dino_evaluate" \
  --action evaluate \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --eval-dataset "$DATASET_ID" \
  --specs "$EVALUATE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model mask_grounding_dino evaluate [-h] -e <experiment_spec> \
                                     evaluate.checkpoint=<model to be evaluated>

Required Arguments

The following arguments are required.

-e, --experiment_spec: The experiment spec file to set up the evaluation experiment

Optional Arguments

The following arguments are optional to run the command.

evaluate.checkpoint: The .pth model to be evaluated

Sample Usage

This is an example of using the evaluate command:

tao model mask_grounding_dino evaluate -e /path/to/spec.yaml evaluate.checkpoint=/path/to/model.pth

Running Inference with a Grounding Model#

inference#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.5
  num_gpus: 1
  color_map:
    "black cat": red
    car: blue
  ioi_threshold: 0.5
  nms_threshold: 0.2
  text_threshold: 0.3
dataset:
  infer_data_sources:
    image_dir: /data/raw-data/val2017/
    captions: ["black cat", "cat"] # or json file that contains the image path and captions for VG
    data_type: OD # or VG

Field	value_type	Description	default_value	valid_min	automl_enabled
`num_gpus`	int		1		False
`gpu_ids`	list		[0]		False
`num_nodes`	int		1		False
`checkpoint`	string		???		False
`results_dir`	string				False
`trt_engine`	string	Path to the TensorRT engine to be used for evaluation. This only works with `tao-deploy`.			False
`color_map`	collection	Class-wise dictionary with colors to render boxes.			False
`conf_threshold`	float	Confidence threshold on box scores for filtering final masks and boxes.	0.0		False
`ioi_threshold`	float	Intersection over instance (ioi) threshold between ReLA output and instance masks for filtering final masks and boxes.	0.5		False
`nms_threshold`	float	Non-max suppression threshold on boxes to filter final masks and boxes.	0.2		False
`text_threshold`	float	Text threshold for extracting phrases from expressions.	0.3		False
`is_internal`	bool	True: Render with internal directory structure.	False		False
`input_width`	int	Width of the input image tensor.	960	32	False
`input_height`	int	Height of the input image tensor.	544	32	False
`outline_width`	int	Width in pixels of the bounding box outline.	3	1	False

The inference tool for Mask Grounding DINO models can be used to visualize bboxes and generate frame-by- frame KITTI format labels on a directory of images.

TAO Client (v2 API)

INFERENCE_JOB_ID=$(tao mask_grounding_dino create-job \
  --kind experiment \
  --name "mask_grounding_dino_inference" \
  --action inference \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --inference-dataset "$DATASET_ID" \
  --specs "$INFERENCE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model mask_grounding_dino inference [-h] -e <experiment spec file>
                        inference.checkpoint=<model to be inferenced>

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The experiment spec file to set up the inference experiment

Optional Arguments

The following arguments are optional to run the command.

inference.checkpoint: The .pth model to inference

Sample Usage

This is an example of using the inference command:

tao model mask_grounding_dino inference -e /path/to/spec.yaml inference.checkpoint=/path/to/model.pth

Exporting the Model#

export#

The export parameter defines the hyperparameters of the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 17
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Field	Value Type	Description	default_value	valid_min	automl_enabled
`results_dir`	string	Path to where all the assets generated from a task are stored.			False
`gpu_id`	int	The index of the GPU to build the TensorRT engine.	0		False
`checkpoint`	string	Path to the checkpoint file to run export.	???		False
`onnx_file`	string	Path to the ONNX model file.	???		False
`on_cpu`	bool	True: Export CPU compatible model.	False		False
`input_channel`	int	Number of channels in the input tensor.	3	3	False
`input_width`	int	Width of the input image tensor.	960	32	False
`input_height`	int	Height of the input image tensor.	544	32	False
`opset_version`	int	Operator set version of the ONNX model used to generate the TensorRT engine.	17	1	False
`batch_size`	int	The batch size of the input Tensor for the engine. A value of `-1` implies dynamic tensor shapes.	-1	-1	False
`verbose`	bool	True: Enable verbose TensorRT logging.	False		False

TAO Client (v2 API)

EXPORT_JOB_ID=$(tao mask_grounding_dino create-job \
  --kind experiment \
  --name "mask_grounding_dino_export" \
  --action export \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --specs "$EXPORT_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model mask_grounding_dino export [-h] -e <experiment spec file>
                      export.checkpoint=<model to export>
                      export.onnx_file=<onnx path>

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The path to an experiment spec file

Optional Arguments

The following arguments are optional to run the command.

export.checkpoint: The .pth model to export
export.onnx_file: The path where the .onnx model is saved

Sample Usage

This is an example of using the export command:

tao model mask_grounding_dino export -e /path/to/spec.yaml export.checkpoint=/path/to/model.pth export.onnx_file=/path/to/model.onnx

TensorRT Engine Generation, Validation, and int8 Calibration#

For deployment, refer to TAO Deploy documentation for Mask Grounding DINO.