Mask Grounding DINO#

Mask Grounding DINO is an open vocabulary instance segmentation model included in the TAO. It supports the following tasks:

train
evaluate
inference
export

These tasks can be invoked from the TAO Launcher using the following convention on the command-line:

tao model mask_grounding_dino <sub_task> <args_per_subtask>

where, args_per_subtask are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.

Data Input for Mask Grounding DINO#

Mask Grounding DINO expects directories of images for training files to be under ODVG format with JSONL and validation to be annotated JSON files in COCO format.

Note

Unlike other instance segmentation models in TAO, start the category_id from your COCO JSON file for Mask Grounding DINO from 0 and every category id must be contiguous. Meaning range the category from 0 to num_classes - 1. Because the original COCO annotation does not have a contiguous category id, see the TAO Data Service tao dataset annotations convert.

Creating an Experiment Spec File#

SPECS=$(tao-client mask_grounding_dino get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

The training experiment spec file for Mask Grounding DINO includes model, train, and dataset parameters. This is an example spec file for finetuning a Mask Grounding DINO model with a swin_tiny_224_1k backbone on a COCO dataset.

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.jsonl  # odvg format
      label_map:  /path/to/coco/annotations/instances_train2017_labelmap.json
  val_data_sources:
    - image_dir: /path/to/coco/val2017/
      json_file: /path/to/coco/annotations/instances_val2017_contiguous.json  # category ids need to be contiguous
  max_labels: 80  # Max number of postive + negative labels passed to the text encoder
  batch_size: 4
  workers: 8
  dataset_type: serialized  # To reduce the system memory usage
  augmentation:
    scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
    input_mean: [0.485, 0.456, 0.406]
    input_std: [0.229, 0.224, 0.225]
    horizontal_flip_prob: 0.5
    train_random_resize: [400, 500, 600]
    train_random_crop_min: 384
    train_random_crop_max: 600
    random_resize_max_size: 1333
    test_random_resize: 800
model:
  backbone: swin_tiny_224_1k
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True  # Adding bias in the contrastive embedding layer for training stability
train:
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10, 20]
  num_epochs: 30
  freeze: ["backbone.0", "bert"]  # if only finetuning
  pretrained_model_path: /path/to/your-gdino-pretrained-model  # if only finetuning
  precision: bf16  # for efficient training

Field	value_type	Description	default_value	automl_enabled
`encryption_key`	string			FALSE
`results_dir`	string		/results	FALSE
`wandb`	collection			FALSE
`model`	collection	Configurable parameters to construct the model for a Mask Grounding DINO experiment.		FALSE
`dataset`	collection	Configurable parameters to construct the dataset for a Mask Grounding DINO experiment.		FALSE
`train`	collection	Configurable parameters to construct the trainer for a Mask Grounding DINO experiment.		FALSE
`evaluate`	collection	Configurable parameters to construct the evaluator for a Mask Grounding DINO experiment.		FALSE
`inference`	collection	Configurable parameters to construct the inferencer for a Mask Grounding DINO experiment.		FALSE
`export`	collection	Configurable parameters to construct the exporter for a Mask Grounding DINO experiment.		FALSE
`gen_trt_engine`	collection	Configurable parameters to construct the TensorRT engine builder for a Mask Grounding DINO experiment.		FALSE

model#

The model parameter provides options to change the Mask Grounding DINO architecture.

model:
  pretrained_model_path: /path/to/your-gdino-pretrained-model
  backbone: swin_tiny_224_1k
  train_backbone: True
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_queries: 900
  dropout_ratio: 0.0
  dim_feedforward: 2048
  log_scale: auto
  class_embed_bias: True
  loss_types: ['labels', 'boxes', 'masks']

Field	value_type	Description	default_value	valid_min	valid_max	valid_options	automl_enabled
`pretrained_backbone_path`	string	[Optional] Path to a pretrained backbone file.					FALSE
`backbone`	string	The backbone name of the model. TAO implementation of Groudning DINO support Swin.	swin_tiny_224_1k			swin_tiny_224_1k,swin_base_224_22k,swin_base_384_22k,swin_large_224_22k,swin_large_384_22k	FALSE
`num_queries`	int	The number of queries	900	1	inf		TRUE
`num_feature_levels`	int	The number of feature levels to use in the model	4	1	5		FALSE
`set_cost_class`	float	The relative weight of the classification error in the matching cost.	1.0	0.0	inf		FALSE
`set_cost_bbox`	float	The relative weight of the L1 error of the bounding box coordinates in the matching cost.	5.0	0.0	inf		FALSE
`set_cost_giou`	float	The relative weight of the GIoU loss of the bounding box in the matching cost.	2.0	0.0	inf		FALSE
`cls_loss_coef`	float	The relative weight of the classification error in the final loss.	2.0	0.0	inf		FALSE
`bbox_loss_coef`	float	The relative weight of the L1 error of the bounding box coordinates in the final loss.	5.0	0.0	inf		FALSE
`giou_loss_coef`	float	The relative weight of the GIoU loss of the bounding box in the final loss.	2.0	0.0	inf		FALSE
`num_select`	int	The number of top-K predictions selected during post-process	300	1			TRUE
`interm_loss_coef`	float		1.0				FALSE
`no_interm_box_loss`	bool	No intermediate bbox loss.	False				FALSE
`pre_norm`	bool	Flag to add layer norm in the encoder or not.	False				FALSE
`two_stage_type`	string	Type of two stage in DINO	standard			standard,no	FALSE
`decoder_sa_type`	string	Type of decoder self attention.	sa			sa,ca_label,ca_content	FALSE
`embed_init_tgt`	bool	Flag to add target embedding	True				FALSE
`fix_refpoints_hw`	int	If this value is -1, width and height are learned seperately for each box. If this value is -2, a shared width and height are learned. A value greater than 0 specifies learning with a fixed number.	-1	-2	inf		FALSE
`pe_temperatureH`	int	The temperature applied to the height dimension of the positional sine embedding.	20	1	inf		FALSE
`pe_temperatureW`	int	The temperature applied to the width dimension of the positional sine embedding.	20	1	inf		FALSE
`return_interm_indices`	list	The index of feature levels to use in the model. The length must match num_feature_levels.	[1, 2, 3, 4]				FALSE
`use_dn`	bool	A flag specifying whether to enbable contrastive de-noising training in DINO	True				FALSE
`dn_number`	int	The number of denoising queries in DINO.	0	0	inf		FALSE
`dn_box_noise_scale`	float	The scale of noise applied to boxes during contrastive de-noising. If this value is 0, noise is not applied.	1.0	0.0	inf		FALSE
`dn_label_noise_ratio`	float	The scale of the noise applied to labels during contrastive denoising. If this value is 0, then noise is no applied.	0.5	0.0			FALSE
`focal_alpha`	float	The alpha value in the focal loss.	0.25				FALSE
`focal_gamma`	float	The gamma value in the focal loss.	2.0				FALSE
`clip_max_norm`	float		0.1				FALSE
`nheads`	int	Number of heads	8				FALSE
`dropout_ratio`	float	The probability to drop hidden units.	0.0	0.0	1.0		FALSE
`hidden_dim`	int	Dimension of the hidden units.	256				FALSE
`enc_layers`	int	Numer of encoder layers in the transformer	6	1			TRUE
`dec_layers`	int	Numer of decoder layers in the transformer	6	1			TRUE
`dim_feedforward`	int	Dimension of the feedforward network	2048	1			FALSE
`dec_n_points`	int	Number of reference points in the decoder.	4	1			FALSE
`enc_n_points`	int	Number of reference points in the encoder.	4	1			FALSE
`aux_loss`	bool	A flag specifying whether to use auxiliary decoding losses (loss at each decoder layer)	True				FALSE
`dilation`	bool	A flag specifying whether enable dilation or not in the backbone.	False				FALSE
`train_backbone`	bool	Flag to set backbone weights as trainable or frozen. When set to False, the backbone weights will be frozen.	True				FALSE
`text_encoder_type`	string	BERT encoder type. If only the name of the type is provided, the weight is download from the Hugging Face Hub. If a path is provided, then we load the weight from the local path.	bert-base-uncased				FALSE
`max_text_len`	int	Maximum text length of BERT.	256	1			FALSE
`class_embed_bias`	bool	Flag to set bias in the contrastive embedding.	False				FALSE
`log_scale`	string	[Optional] The initial value of a learnable parameter to multiply with the similarity matrix to normalize the output. Defaults to None. - If set to ‘auto’, the similarity matrix is normalized by a fixed value `sqrt(d_c)` where `d_c` is the channel number. - If set to ‘none’ or `None`, there is no normalization applied.	none				FALSE
`loss_types`	list	Losses to be used during training	[‘labels’, ‘boxes’]				FALSE
`backbone_names`	list	Prefix of the tensor names corresponding to the backbone.	[‘backbone.0’, ‘bert’]				FALSE
`linear_proj_names`	list	Linear projection layer names.	[‘reference_points’, ‘sampling_offsets’]				FALSE
`has_mask`	bool	Flag to enable mask head in Grounding Dino	True				FALSE
`mask_loss_coef`	float	The relative weight of the mask error in the final loss.	2.0				FALSE
`dice_loss_coef`	float	The relative weight of the dice loss of the segmentation in the final loss.	5.0				FALSE

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    lr: 0.0002
    lr_backbone: 0.00002
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [10, 20]
    lr_decay: 0.1
  num_epochs: 30
  checkpoint_interval: 1
  precision: bf16
  distributed_strategy: ddp
  activation_checkpoint: True
  num_gpus: 8
  num_nodes: 1
  freeze: ["backbone.0", "bert"]
  pretrained_model_path: /path/to/pretrained/model

Field	value_type	Description	default_value	valid_min	valid_max	valid_options	automl_enabled
`num_gpus`	int	The number of GPUs to run the train job.	1	1			FALSE
`gpu_ids`	list	List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus.	[0]				FALSE
`num_nodes`	int	Number of nodes to run the training on. If > 1, then multi-node is enabled.	1				FALSE
`seed`	int	The seed for the initializer in PyTorch. If < 0, disable fixed seed.	1234	-1	inf		FALSE
`cudnn`	collection						FALSE
`num_epochs`	int	Number of epochs to run the training.	10	1	inf		TRUE
`checkpoint_interval`	int	The interval (in epochs) at which a checkpoint is saved. Helps resume training.	1	1			FALSE
`validation_interval`	int	The interval (in epochs) at which a evaluation is triggered on the validation dataset.	1	1			FALSE
`resume_training_checkpoint_path`	string	Path to the checkpoint to resume training from.					FALSE
`results_dir`	string	Path to where all the assets generated from a task are stored.					FALSE
`freeze`	list	List of layer names to freeze. Example: [“backbone”, “transformer.encoder”, “input_proj”].	[]				FALSE
`pretrained_model_path`	string	Path to a pre-trained Deformable DETR model to initialize the current training from.					FALSE
`clip_grad_norm`	float	Amount to clip the gradient by L2 Norm. A value of 0.0 specifies no clipping.	0.1				FALSE
`is_dry_run`	bool	Whether to run the trainer in Dry Run mode. This serves as a good means to validate the spec file and run a sanity check on the trainer without actually initializing and running the trainer.	False				FALSE
`optim`	collection	Hyper parameters to configure the optimizer.					FALSE
`precision`	string	Precision to run the training on.	fp32			fp16,fp32,bf16	FALSE
`distributed_strategy`	string	The multi-GPU training strategy. DDP (Distributed Data Parallel) and Fully Sharded DDP are supported.	ddp			ddp,fsdp	FALSE
`activation_checkpoint`	bool	A True value instructs train to recompute in backward pass to save GPU memory, rather than storing activations.	True				FALSE
`verbose`	bool	Flag to enable printing of detailed learning rate scaling from the optimizer.	False				FALSE

optim#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  lr: 0.0002
  lr_backbone: 0.00002
  momentum: 0.9
  weight_decay: 0.0001
  lr_scheduler: MultiStep
  lr_steps: [10, 20]
  lr_decay: 0.1

Field	value_type	Description	default_value	valid_options	automl_enabled
`optimizer`	string	Type of optimizer used to train the network.	AdamW	AdamW,SGD	FALSE
`monitor_name`	string	The metric value to be monitored for the `AutoReduce` Scheduler.	val_loss	val_loss,train_loss	FALSE
`lr`	float	The initial learning rate for training the model, excluding the backbone.	0.0002		TRUE
`lr_backbone`	float	The initial learning rate for training the backbone.	2e-05		TRUE
`lr_linear_proj_mult`	float	The initial learning rate for training the linear projection layer.	0.1		TRUE
`momentum`	float	The momentum for the AdamW optimizer.	0.9		TRUE
`weight_decay`	float	The weight decay coefficient.	0.0001		TRUE
`lr_scheduler`	string	The learning scheduler: * MultiStep : Decrease the lr by lr_decay from lr_steps * StepLR : Decrease the lr by lr_decay at every lr_step_size.	MultiStep	MultiStep,StepLR	FALSE
`lr_steps`	list	The steps at which the learning rate must be decreased. This is applicable only with the MultiStep LR.	[10]		FALSE
`lr_step_size`	int	The number of steps to decrease the learning rate in the StepLR.	10		TRUE
`lr_decay`	float	The decreasing factor for the learning rate scheduler.	0.1		TRUE

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_data_sources:
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/instances_train2017.jsonl  # odvg format
      label_map:  /path/to/coco/annotations/instances_train2017_labelmap.json
    - image_dir: /path/to/coco/train2017/
      json_file: /path/to/coco/annotations/refcoco.jsonl  # grounding dataset which doesn't require label_map
  val_data_sources:
    image_dir: /path/to/coco/val2017/
    json_file: /path/to/coco/annotations/instances_val2017_contiguous.json  # category ids need to be contiguous
  test_data_sources:
    image_dir: /path/to/coco/images/val2017/
    json_file: /path/to/coco/annotations/instances_val2017.json
  infer_data_sources:
    - image_dir: /path/to/coco/images/val2017/
      captions: ["black cat", "car"]
  max_labels: 80
  batch_size: 4
  workers: 8

Field	value_type	Description	default_value	valid_min	valid_max	valid_options	automl_enabled
`train_data_sources`	list	The list of data sources for training: * image_dir : The directory that contains the training images * json_file : The path of the JSONL file, which uses training-annotation ODVG format * label_map: (Optional) The path of the label mapping only required for detection dataset	[{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘label_map’: ‘’}, {‘image_dir’: ‘’, ‘json_file’: ‘’}]				FALSE
`val_data_sources`	collection	The data source for validation: * image_dir : The directory that contains the validation images * json_file : The path of the JSON file, which uses validation-annotation COCO format. Note that category id needs to start from 0 if we want to calculate validation loss. Run Data Services annotation convert to making the categories contiguous.	{‘image_dir’: ‘’, ‘json_file’: ‘’}				FALSE
`test_data_sources`	collection	The data source for testing: * image_dir : The directory that contains the test images * json_file : The path of the JSON file, which uses test-annotation COCO format	{‘image_dir’: ‘’, ‘json_file’: ‘’}				FALSE
`infer_data_sources`	collection	The data source for inference: * image_dir : The list of directories that contains the inference images * captions : The list of caption to run inference	{‘image_dir’: [‘’], ‘captions’: [‘’]}				FALSE
`batch_size`	int	The batch size for training and validation	4	1	inf		TRUE
`workers`	int	The number of parallel workers processing data	8	1	inf		TRUE
`pin_memory`	bool	Flag to enable the dataloader to allocated pagelocked memory for faster of data between the CPU and GPU.	True				FALSE
`dataset_type`	string	If set to default, we follow the standard map-style dataset structure from torch which loads ODVG annotation in every subprocess. This leads to redudant copy of data and can cause RAM to explod if workers is high. If set to serialized, the data is serialized through pickle and torch.Tensor that allows the data to be shared across subprocess. As a result, RAM usage can be greatly improved.	serialized			serialized,default	FALSE
`max_labels`	int	The total number of labels to sample from. After sampling positive labels, we randomly sample negative samples so that total number of labels equal to max_labels. For detection dataset, negative labels are categories not present in the image. For grounding dataset, negative labels are phrases in the original caption not present in the image. Setting higher max_labels may improve robustness of the model with the cost of longer training time.	50	1	inf		FALSE
`eval_class_ids`	list	IDs of the classes for evaluation.	[1]				FALSE
`augmentation`	collection	Configuration parameters for data augmentation					FALSE
`has_mask`	bool	Flag to load mask annotation from dataset.					FALSE

augmentation#

The augmentation parameter contains hyperparameters for augmentation.

augmentation:
  scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
  input_mean: [0.485, 0.456, 0.406]
  input_std: [0.229, 0.224, 0.225]
  horizontal_flip_prob: 0.5
  train_random_resize: [400, 500, 600]
  train_random_crop_min: 384
  train_random_crop_max: 600
  random_resize_max_size: 1333
  test_random_resize: 800

Field	value_type	Description	default_value	valid_min	valid_max	automl_enabled
`scales`	list	A list of sizes to perform random resize.	[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]			FALSE
`input_mean`	list	The input mean for RGB frames	[0.485, 0.456, 0.406]			FALSE
`input_std`	list	The input standard deviation per pixel for RGB frames	[0.229, 0.224, 0.225]			FALSE
`train_random_resize`	list	A list of sizes to perform random resize for training data	[400, 500, 600]			FALSE
`horizontal_flip_prob`	float	The probability for horizonal flip during training	0.5	0.0	1.0	TRUE
`train_random_crop_min`	int	The minimum random crop size for training data	384	1	inf	TRUE
`train_random_crop_max`	int	The maximum random crop size for training data	600	1	inf	TRUE
`random_resize_max_size`	int	The maximum random resize size for training data	1333	1	inf	TRUE
`test_random_resize`	int	The random resize size for test data	800	1	inf	TRUE
`fixed_padding`	bool	A flag specifying whether to resize the image (with no padding) to (sorted(scales[-1]), random_resize_max_size) to prevent a CPU “ memory leak.	TRUE			FALSE
`fixed_random_crop`	int	A flag to enable Large Scale Jittering, which is used for ViT backbones. The resulting image resolution is fixed to fixed_random_crop.	1024	1	inf	FALSE

Training the Model#

To train a Mask Grounding DINO model, use this command:

TRAIN_JOB_ID=$(tao-client mask_grounding_dino experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask_grounding_dino train [-h] -e <experiment_spec>

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The experiment specification file to set up the training experiment

Optional Arguments

The following arguments are optional to run the command.

-h, --help: Show this help message and exit.

Sample Usage

This is an example of the train command:

tao mask_grounding_dino model train -e /path/to/spec.yaml

Optimizing Resource for Training Groudning DINO#

Training Mask Grounding DINO requires strong GPUs (for example: V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory#

There are various ways to optimize GPU memory usage. One trick is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. We recommend setting the following configurations to optimize GPU consumption.

Set train.precision to bf16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.
Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.
Set train.distributed_strategy to fsdp to enabled Fully Sharded Data Parallel training. This will share gradient calculation across different processes to help reduce GPU memory.
Try using more lightweight backbones like swin_tiny_224_1k or freeze the backbone through setting model.train_backbone to False.
Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

Optimize CPU Memory#

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory if the size of your annotation file is very large. Hence, we recommend setting below configurations in order to optimize CPU consumption.

Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.
Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leak and the CPU memory to slowly stacks up until it becomes Out of Memory in the middle of training. This is the limitation of PyTorch so we advise setting fixed_padding to True to help stablize the CPU memory usage.

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluate process.

evaluate:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.0
  num_gpus: 1

Field	value_type	Description	default_value	valid_min	automl_enabled
`num_gpus`	int		1		FALSE
`gpu_ids`	list		[0]		FALSE
`num_nodes`	int		1		FALSE
`checkpoint`	string		???		FALSE
`results_dir`	string				FALSE
`input_width`	int	Width of the input image tensor.		1	FALSE
`input_height`	int	Height of the input image tensor.		1	FALSE
`trt_engine`	string	Path to the TensorRT engine to be used for evaluation. This only works with `tao-deploy`.			FALSE
`conf_threshold`	float	The value of the confidence threshold to be used when filtering out the final list of boxes.	0.0		FALSE

To run evaluation with a Mask Grounding DINO model, use this command:

EVAL_JOB_ID=$(tao-client mask_grounding_dino experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask_grounding_dino evaluate [-h] -e <experiment_spec> \
                                     evaluate.checkpoint=<model to be evaluated>

Required Arguments

The following arguments are required.

-e, --experiment_spec: The experiment spec file to set up the evaluation experiment

Optional Arguments

The following arguments are optional to run the command.

evaluate.checkpoint: The .pth model to be evaluated

Sample Usage

This is an example of using the evaluate command:

tao model mask_grounding_dino evaluate -e /path/to/spec.yaml evaluate.checkpoint=/path/to/model.pth

Running Inference with a Grounding Model#

inference#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  conf_threshold: 0.5
  num_gpus: 1
  color_map:
    "black cat": red
    car: blue
dataset:
  infer_data_sources:
    image_dir: /data/raw-data/val2017/
    captions: ["black cat", "cat"]

Field	value_type	Description	default_value	valid_min	automl_enabled
`num_gpus`	int		1		FALSE
`gpu_ids`	list		[0]		FALSE
`num_nodes`	int		1		FALSE
`checkpoint`	string		???		FALSE
`results_dir`	string				FALSE
`trt_engine`	string	Path to the TensorRT engine to be used for evaluation. This only works with `tao-deploy`.			FALSE
`color_map`	collection	Class-wise dictionary with colors to render boxes.			FALSE
`conf_threshold`	float	The value of the confidence threshold to be used when filtering out the final list of boxes.	0.5		FALSE
`is_internal`	bool	Flag to render with internal directory structure.	False		FALSE
`input_width`	int	Width of the input image tensor.	960	32	FALSE
`input_height`	int	Height of the input image tensor.	544	32	FALSE
`outline_width`	int	Width in pixels of the bounding box outline.	3	1	FALSE

The inference tool for Mask Grounding DINO models can be used to visualize bboxes and generate frame-by- frame KITTI format labels on a directory of images.

INFERENCE_JOB_ID=$(tao-client mask_grounding_dino experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask_grounding_dino inference [-h] -e <experiment spec file>
                        inference.checkpoint=<model to be inferenced>

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The experiment spec file to set up the inference experiment

Optional Arguments

The following arguments are optional to run the command.

inference.checkpoint: The .pth model to inference

Sample Usage

This is an example of using the inference command:

tao model mask_grounding_dino inference -e /path/to/spec.yaml inference.checkpoint=/path/to/model.pth

Exporting the Model#

export#

The export parameter defines the hyperparameters of the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 17
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Field	Value Type	Description	default_value	valid_min	automl_enabled
`results_dir`	string	Path to where all the assets generated from a task are stored.			FALSE
`gpu_id`	int	The index of the GPU to build the TensorRT engine.	0		FALSE
`checkpoint`	string	Path to the checkpoint file to run export.	???		FALSE
`onnx_file`	string	Path to the ONNX model file.	???		FALSE
`on_cpu`	bool	Flag to export CPU compatible model.	False		FALSE
`input_channel`	int	Number of channels in the input tensor.	3	3	FALSE
`input_width`	int	Width of the input image tensor.	960	32	FALSE
`input_height`	int	Height of the input image tensor.	544	32	FALSE
`opset_version`	int	Operator set version of the ONNX model used to generate the TensorRT engine.	17	1	FALSE
`batch_size`	int	The batch size of the input Tensor for the engine. A value of `-1` implies dynamic tensor shapes.	-1	-1	FALSE
`verbose`	bool	Flag to enable verbose TensorRT logging.	False		FALSE

EXPORT_JOB_ID=$(tao-client mask_grounding_dino experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask_grounding_dino export [-h] -e <experiment spec file>
                      export.checkpoint=<model to export>
                      export.onnx_file=<onnx path>

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The path to an experiment spec file

Optional Arguments

The following arguments are optional to run the command.

export.checkpoint: The .pth model to export
export.onnx_file: The path where the .onnx model is saved

Sample Usage

This is an example of using the export command:

tao model mask_grounding_dino export -e /path/to/spec.yaml export.checkpoint=/path/to/model.pth export.onnx_file=/path/to/model.onnx

TensorRT Engine Generation, Validation, and int8 Calibration#

For deployment, refer to TAO Deploy documentation for Mask Grounding DINO.