Mask Grounding DINO#
Mask Grounding DINO is an open vocabulary instance segmentation model included in the TAO. It supports the following tasks:
train
evaluate
inference
export
These tasks can be invoked from the TAO Launcher using the following convention on the command-line:
tao model mask_grounding_dino <sub_task> <args_per_subtask>
where, args_per_subtask are the command-line arguments required for a given subtask. Each
subtask is explained in detail in the following sections.
Data Input for Mask Grounding DINO#
Mask Grounding DINO expects directories of images for training files to be under ODVG format with JSONL and validation to be annotated JSON files in COCO format.
Note
Unlike other instance segmentation models in TAO, category_id in your COCO JSON file for Mask Grounding DINO
must start from 0, and every category ID must be contiguous. The category IDs must range from 0 to num_classes - 1.
Because the original COCO annotation does not have a contiguous category id, see the TAO Data Service
tao dataset annotations convert.
Creating an Experiment Spec File#
BASE_EXPERIMENT_ID=$(tao mask_grounding_dino list-base-experiments | jq -r '.[0].id')
SPECS=$(tao mask_grounding_dino get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
The training experiment spec file for Mask Grounding DINO includes model, train, and dataset parameters.
This is an example spec file for finetuning a Mask Grounding DINO model with a swin_tiny_224_1k backbone on a COCO dataset.
dataset:
train_data_sources:
- image_dir: /path/to/coco/train2017/
json_file: /path/to/coco/annotations/instances_train2017.jsonl # odvg format
label_map: /path/to/coco/annotations/instances_train2017_labelmap.json
- image_dir: /path/to/coco/train2017/
json_file: /path/to/refcoco-like/annotations/instances_train2017.jsonl # odvg format
val_data_sources:
image_dir: /path/to/coco/val2017/
json_file: /path/to/refcoco-like/annotations/instances_val2017_contiguous.jsonl # category ids need to be contiguous
data_type: VG # or OD
max_labels: 80 # Max number of positive + negative labels passed to the text encoder
batch_size: 4
workers: 8
dataset_type: serialized # To reduce the system memory usage
augmentation:
scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
input_mean: [0.485, 0.456, 0.406]
input_std: [0.229, 0.224, 0.225]
horizontal_flip_prob: 0.5
train_random_resize: [400, 500, 600]
train_random_crop_min: 384
train_random_crop_max: 600
random_resize_max_size: 1333
test_random_resize: 800
model:
backbone: swin_tiny_224_1k
train_backbone: True
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
num_queries: 900
dropout_ratio: 0.0
dim_feedforward: 2048
log_scale: auto
class_embed_bias: True # Adding bias in the contrastive embedding layer for training stability
num_region_queries: 100 # 0 if not using ReLA, otherwise, the number of region queries
loss_types: ['labels', 'boxes', 'masks', 'rela'] # Remove rela loss if not use ReLA
train:
optim:
lr_backbone: 2e-5
lr: 2e-4
lr_steps: [10, 20]
num_epochs: 30
freeze: ["backbone.0", "bert"] # if only finetuning
pretrained_model_path: /path/to/your-gdino-pretrained-model # if only finetuning
precision: bf16 # for efficient training
Field |
value_type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
False |
|||||
|
string |
/results |
False |
||||
|
collection |
False |
|||||
|
collection |
Configurable parameters to construct the model for a Mask Grounding DINO experiment. |
False |
||||
|
collection |
Configurable parameters to construct the dataset for a Mask Grounding DINO experiment. |
False |
||||
|
collection |
Configurable parameters to construct the trainer for a Mask Grounding DINO experiment. |
False |
||||
|
collection |
Configurable parameters to construct the evaluator for a Mask Grounding DINO experiment. |
False |
||||
|
collection |
Configurable parameters to construct the inferencer for a Mask Grounding DINO experiment. |
False |
||||
|
collection |
Configurable parameters to construct the exporter for a Mask Grounding DINO experiment. |
False |
||||
|
collection |
Configurable parameters to construct the TensorRT engine builder for a Mask Grounding DINO experiment. |
False |
model#
The model parameter provides options to change the Mask Grounding DINO architecture.
model:
pretrained_model_path: /path/to/your-gdino-pretrained-model
backbone: swin_tiny_224_1k
train_backbone: True
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
dropout_ratio: 0.0
dim_feedforward: 2048
log_scale: auto
class_embed_bias: True
num_region_queries: 100
loss_types: ['labels', 'boxes', 'masks', 'rela']
Field |
value_type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
[Optional] Path to a pretrained backbone file. |
False |
||||
|
string |
Backbone name of the model. The TAO implementation of Grounding DINO supports Swin. |
swin_tiny_224_1k |
swin_tiny_224_1k,swin_base_224_22k,swin_base_384_22k,swin_large_224_22k,swin_large_384_22k |
False |
||
|
int |
Number of queries. |
900 |
1 |
inf |
True |
|
|
int |
Number of feature levels to use in the model. |
4 |
1 |
5 |
False |
|
|
float |
Relative weight of the classification error in the matching cost. |
1.0 |
0.0 |
inf |
False |
|
|
float |
Relative weight of the L1 error of the bounding box coordinates in the matching cost. |
5.0 |
0.0 |
inf |
False |
|
|
float |
Relative weight of the GIoU loss of the bounding box in the matching cost. |
2.0 |
0.0 |
inf |
False |
|
|
float |
Relative weight of the classification error in the final loss. |
2.0 |
0.0 |
inf |
False |
|
|
float |
Relative weight of the L1 error of the bounding box coordinates in the final loss. |
5.0 |
0.0 |
inf |
False |
|
|
float |
Relative weight of the GIoU loss of the bounding box in the final loss. |
2.0 |
0.0 |
inf |
False |
|
|
float |
Relative weight of the No-Target loss of the region query in the final loss. |
1.0 |
0.0 |
inf |
False |
|
|
float |
Relative weight of the Minimap loss of the region query in the final loss. |
0.5 |
0.0 |
inf |
False |
|
|
float |
Relative weight of the Union Mask loss of the region query in the final loss. |
2.0 |
0.0 |
inf |
False |
|
|
int |
Number of top-K predictions selected during post-process. |
300 |
1 |
True |
||
|
int |
Number of region queries. 0 if not using ReLA, otherwise, the number of region queries. |
100 |
0 |
True |
||
|
float |
1.0 |
False |
||||
|
bool |
True: No intermediate bbox loss. |
False |
False |
|||
|
bool |
True: Add layer norm in the encoder. |
False |
False |
|||
|
string |
Type of two stage in DINO. |
standard |
standard,no |
False |
||
|
string |
Type of decoder self attention. |
sa |
sa,ca_label,ca_content |
False |
||
|
bool |
True: Add target embedding. |
True |
False |
|||
|
int |
If -1, width and height are learned separately for each box. If -2, a shared width and height are learned. A value greater than 0 specifies learning with a fixed number. |
-1 |
-2 |
inf |
False |
|
|
int |
Temperature applied to the height dimension of the positional sine embedding. |
20 |
1 |
inf |
False |
|
|
int |
Temperature applied to the width dimension of the positional sine embedding. |
20 |
1 |
inf |
False |
|
|
list |
Index of feature levels to use in the model. The length must match num_feature_levels. |
[1, 2, 3, 4] |
False |
|||
|
bool |
True: Enable contrastive de-noising training in DINO. |
True |
False |
|||
|
int |
Number of denoising queries in DINO. |
0 |
0 |
inf |
False |
|
|
float |
Scale of noise applied to boxes during contrastive de-noising. If 0, noise is not applied. |
1.0 |
0.0 |
inf |
False |
|
|
float |
Scale of the noise applied to labels during contrastive denoising. If 0, noise is not applied. |
0.5 |
0.0 |
False |
||
|
float |
Alpha value in the focal loss. |
0.25 |
False |
|||
|
float |
Gamma value in the focal loss. |
2.0 |
False |
|||
|
float |
0.1 |
False |
||||
|
int |
Number of heads. |
8 |
False |
|||
|
float |
Probability of dropping hidden units. |
0.0 |
0.0 |
1.0 |
False |
|
|
int |
Dimension of the hidden units. |
256 |
False |
|||
|
int |
Number of encoder layers in the transformer. |
6 |
1 |
True |
||
|
int |
Number of decoder layers in the transformer. |
6 |
1 |
True |
||
|
int |
Dimension of the feedforward network. |
2048 |
1 |
False |
||
|
int |
Number of reference points in the decoder. |
4 |
1 |
False |
||
|
int |
Number of reference points in the encoder. |
4 |
1 |
False |
||
|
bool |
True: Use auxiliary decoding losses (loss at each decoder layer). |
True |
False |
|||
|
bool |
True: enable dilation in the backbone. |
False |
False |
|||
|
bool |
True: Set backbone weights as trainable or frozen. False: Backbone weights are frozen. |
True |
False |
|||
|
string |
BERT encoder type. If only the name of the type is provided, the weight is downloaded from the Hugging Face Hub. If a path is provided, we load the weight from the local path. |
bert-base-uncased |
False |
|||
|
int |
Maximum text length of BERT. |
256 |
1 |
False |
||
|
bool |
True: Set bias in the contrastive embedding. |
False |
False |
|||
|
string |
[Optional] Initial value of a learnable parameter to multiply with the similarity
matrix to normalize the output. Defaults to
|
none |
False |
|||
|
list |
Losses to be used during training. |
[‘labels’, ‘boxes’] |
False |
|||
|
list |
Prefix of tensor names corresponding to the backbone. |
[‘backbone.0’, ‘bert’] |
False |
|||
|
list |
Linear projection layer names. |
[‘reference_points’, ‘sampling_offsets’] |
False |
|||
|
bool |
True: Enable mask head in Grounding Dino. |
True |
False |
|||
|
float |
Relative weight of mask error in the final loss. |
2.0 |
False |
|||
|
float |
Relative weight of dice loss of the segmentation in the final loss. |
5.0 |
False |
train#
The train parameter defines the hyperparameters of the training process.
train:
optim:
lr: 0.0002
lr_backbone: 0.00002
momentum: 0.9
weight_decay: 0.0001
lr_scheduler: MultiStep
lr_steps: [10, 20]
lr_decay: 0.1
num_epochs: 30
checkpoint_interval: 1
precision: bf16
distributed_strategy: ddp
activation_checkpoint: True
num_gpus: 8
num_nodes: 1
freeze: ["backbone.0", "bert"]
pretrained_model_path: /path/to/pretrained/model
Field |
value_type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of GPUs to run the train job. |
1 |
1 |
False |
||
|
list |
List of GPU IDs to run training on. Length of |
[0] |
False |
|||
|
int |
Number of nodes for training. >1 enables multi-node. |
1 |
False |
|||
|
int |
Seed for PyTorch initializer. <0 disables fixed seed. |
1234 |
-1 |
inf |
False |
|
|
collection |
cuDNN configuration. |
False |
||||
|
int |
Number of training epochs. |
10 |
1 |
inf |
True |
|
|
int |
Interval (in epochs) to save checkpoints. Helps resume training. |
1 |
1 |
False |
||
|
int |
Interval (in epochs) to run evaluation on validation dataset. |
1 |
1 |
False |
||
|
string |
Path to checkpoint for resuming training. |
False |
||||
|
string |
Path to store all assets generated from a task. |
False |
||||
|
list |
Layers to freeze. Example: [“backbone”, “transformer.encoder”, “input_proj”]. |
[] |
False |
|||
|
string |
Path to pretrained Deformable DETR model for initialization. |
False |
||||
|
float |
Clip gradient by L2 norm. 0.0 disables gradient clipping. |
0.1 |
False |
|||
|
bool |
True: Run trainer in Dry Run mode. Validates spec file and runs sanity check without initializing trainer. |
False |
False |
|||
|
collection |
Hyperparameters for optimizer configuration. |
False |
||||
|
string |
Training precision. |
fp32 |
fp16,fp32,bf16 |
False |
||
|
string |
Multi-GPU training strategy. Supports DDP (Distributed Data Parallel) and FSDP (Fully Sharded DDP). |
ddp |
ddp,fsdp |
False |
||
|
bool |
True: Recompute activations in backward pass to save GPU memory. This avoids storing intermediate activations. |
True |
False |
|||
|
bool |
True: Enable detailed optimizer learning rate printing. |
False |
False |
optim#
The optim parameter defines the config for the optimizer in training, including the
learning rate, learning scheduler, and weight decay.
optim:
lr: 0.0002
lr_backbone: 0.00002
momentum: 0.9
weight_decay: 0.0001
lr_scheduler: MultiStep
lr_steps: [10, 20]
lr_decay: 0.1
Field |
value_type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Optimizer type for training. |
AdamW |
AdamW,SGD |
False |
||
|
string |
Metric monitored by |
val_loss |
val_loss,train_loss |
False |
||
|
float |
Initial learning rate for model (excluding backbone). |
0.0002 |
True |
|||
|
float |
Initial learning rate for backbone. |
2e-05 |
True |
|||
|
float |
Initial learning rate multiplier for linear projection layer. |
0.1 |
True |
|||
|
float |
Momentum for AdamW optimizer. |
0.9 |
True |
|||
|
float |
Weight decay coefficient. |
0.0001 |
True |
|||
|
string |
Learning rate scheduler type.
|
MultiStep |
MultiStep,StepLR |
False |
||
|
list |
Steps at which lr decreases (for MultiStep LR). |
[10] |
False |
|||
|
int |
Number of steps between lr decreases (for StepLR). |
10 |
True |
|||
|
float |
Factor to decrease lr for scheduler. |
0.1 |
True |
dataset#
The dataset parameter defines the dataset source, training batch size, and
augmentation.
dataset:
train_data_sources:
- image_dir: /path/to/coco/train2017/
json_file: /path/to/coco/annotations/instances_train2017.jsonl # odvg format
label_map: /path/to/coco/annotations/instances_train2017_labelmap.json
- image_dir: /path/to/coco/train2017/
json_file: /path/to/refcoco-like/annotations/instances_train2017.jsonl # odvg format
val_data_sources:
image_dir: /path/to/coco/val2017/
json_file: /path/to/refcoco-like/annotations/instances_val2017_contiguous.jsonl # category ids need to be contiguous
data_type: VG # or OD
test_data_sources:
image_dir: /path/to/coco/images/val2017/
json_file: /path/to/coco/annotations/instances_val2017.json
data_type: OD # or VG
infer_data_sources:
image_dir: /path/to/coco/images/val2017/
data_type: OD # or VG
captions: ["black cat", "car"] # or json file that contains the image path and captions
max_labels: 80
batch_size: 4
workers: 8
Field |
value_type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
List of training data sources:
|
[{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘label_map’: ‘’}, {‘image_dir’: ‘’, ‘json_file’: ‘’}] |
False |
|||
|
collection |
Validation data source:
Category ID must start from 0 to calculate validation loss. Run Data Services annotation conversion to make categories contiguous. |
{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘data_type’: ‘’} |
False |
|||
|
collection |
Test data source:
|
{‘image_dir’: ‘’, ‘json_file’: ‘’, ‘data_type’: ‘’} |
False |
|||
|
collection |
Inference data source:
|
{‘image_dir’: ‘’, ‘data_type’: ‘’} |
False |
|||
|
int |
Batch size for training and validation. |
4 |
1 |
inf |
True |
|
|
int |
Number of parallel data loader workers. |
8 |
1 |
inf |
True |
|
|
bool |
True: Allocate pagelocked memory for faster CPU-GPU data transfer. |
True |
False |
|||
|
string |
Dataset structure type.
|
serialized |
serialized,default |
False |
||
|
int |
Total labels to sample. After positive labels, samples negative labels to reach
Higher |
50 |
1 |
inf |
False |
|
|
list |
Class IDs for evaluation. |
[1] |
False |
|||
|
collection |
Data augmentation parameters. |
False |
||||
|
bool |
True: Load mask annotations from dataset. |
False |
augmentation#
The augmentation parameter contains hyperparameters for augmentation.
augmentation:
scales: [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
input_mean: [0.485, 0.456, 0.406]
input_std: [0.229, 0.224, 0.225]
horizontal_flip_prob: 0.5
train_random_resize: [400, 500, 600]
train_random_crop_min: 384
train_random_crop_max: 600
random_resize_max_size: 1333
test_random_resize: 800
Field |
value_type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
Sizes to perform random resize. |
[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800] |
False |
|||
|
list |
Input mean for RGB frames. |
[0.485, 0.456, 0.406] |
False |
|||
|
list |
Input standard deviation per pixel for RGB frames. |
[0.229, 0.224, 0.225] |
False |
|||
|
list |
Sizes to perform random resize for training data. |
[400, 500, 600] |
False |
|||
|
float |
Probability for horizontal flip during training. |
0.5 |
0.0 |
1.0 |
True |
|
|
int |
Minimum random crop size for training data. |
384 |
1 |
inf |
True |
|
|
int |
Maximum random crop size for training data. |
600 |
1 |
inf |
True |
|
|
int |
Maximum random resize size for training data. |
1333 |
1 |
inf |
True |
|
|
int |
Random resize size for test data. |
800 |
1 |
inf |
True |
|
|
bool |
True: Resize image to (sorted(scales[-1]), random_resize_max_size) without padding. This prevents a CPU memory leak. |
True |
False |
|||
|
int |
Determines the resulting image resolution. 0 disables Large Scale Jittering (cropping). |
1024 |
1 |
inf |
False |
Training the Model#
To train a Mask Grounding DINO model, use this command:
TRAIN_JOB_ID=$(tao mask_grounding_dino create-job \
--kind experiment \
--name "mask_grounding_dino_train" \
--action train \
--workspace-id $WORKSPACE_ID \
--specs "$TRAIN_SPECS" \
--train-datasets '["'$DATASET_ID'"]' \
--eval-dataset "$DATASET_ID" \
--base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
--encryption-key "nvidia_tlt" | jq -r '.id')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mask_grounding_dino train [-h] -e <experiment_spec>
Required Arguments
The following arguments are required to run the command.
-e, --experiment_spec: The experiment specification file to set up the training experiment
Optional Arguments
The following arguments are optional to run the command.
-h, --help: Show this help message and exit.
Sample Usage
This is an example of the train command:
tao mask_grounding_dino model train -e /path/to/spec.yaml
Optimizing Resource for Training Grounding DINO#
Training Mask Grounding DINO requires strong GPUs (for example: V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.
Optimize GPU Memory#
There are various ways to optimize GPU memory usage. One trick is to reduce dataset.batch_size. However, this can cause your training to take longer than usual.
We recommend setting the following configurations to optimize GPU consumption.
Set
train.precisiontobf16to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.Set
train.activation_checkpointtoTrueto enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.Set
train.distributed_strategytofsdpto enabled Fully Sharded Data Parallel training. This will share gradient calculation across different processes to help reduce GPU memory.Try using more lightweight backbones like
swin_tiny_224_1kor freeze the backbone through settingmodel.train_backboneto False.Try changing the augmentation resolution in
dataset.augmentationdepending on your dataset.
Optimize CPU Memory#
To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory if the size of your annotation file is very large. Hence, we recommend setting below configurations in order to optimize CPU consumption.
Set
dataset.dataset_typetoserializedso that the COCO-based annotation data can be shared across different subprocesses.Set
dataset.augmentation.fixed_paddingto True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leak and the CPU memory to slowly stacks up until it becomes Out of Memory in the middle of training. This is the limitation of PyTorch so we advise settingfixed_paddingto True to help stablize the CPU memory usage.
Evaluating the Model#
evaluate#
The evaluate parameter defines the hyperparameters of the evaluate process.
evaluate:
checkpoint: /path/to/model.pth
conf_threshold: 0.0
num_gpus: 1
ioi_threshold: 0.5
nms_threshold: 0.2
text_threshold: 0.3
Field |
value_type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
1 |
False |
||||
|
list |
[0] |
False |
||||
|
int |
1 |
False |
||||
|
string |
??? |
False |
||||
|
string |
False |
|||||
|
int |
Width of the input image tensor. |
1 |
False |
|||
|
int |
Height of the input image tensor. |
1 |
False |
|||
|
string |
Path to the TensorRT engine to be used for evaluation.
This only works with |
False |
||||
|
float |
Confidence threshold on box scores for filtering final masks and boxes. |
0.0 |
False |
|||
|
float |
Intersection over instance (ioi) threshold between ReLA output and instance masks for filtering final masks and boxes. |
0.5 |
False |
|||
|
float |
Non-max suppression threshold on boxes to filter final masks and boxes. |
0.2 |
False |
|||
|
float |
Text threshold for extracting phrases from expressions. |
0.3 |
False |
To run evaluation with a Mask Grounding DINO model, use this command:
EVAL_JOB_ID=$(tao mask_grounding_dino create-job \
--kind experiment \
--name "mask_grounding_dino_evaluate" \
--action evaluate \
--workspace-id $WORKSPACE_ID \
--parent-job-id $TRAIN_JOB_ID \
--eval-dataset "$DATASET_ID" \
--specs "$EVALUATE_SPECS" \
--base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
--encryption-key "nvidia_tlt" | jq -r '.id')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mask_grounding_dino evaluate [-h] -e <experiment_spec> \
evaluate.checkpoint=<model to be evaluated>
Required Arguments
The following arguments are required.
-e, --experiment_spec: The experiment spec file to set up the evaluation experiment
Optional Arguments
The following arguments are optional to run the command.
evaluate.checkpoint: The.pthmodel to be evaluated
Sample Usage
This is an example of using the evaluate command:
tao model mask_grounding_dino evaluate -e /path/to/spec.yaml evaluate.checkpoint=/path/to/model.pth
Running Inference with a Grounding Model#
inference#
The inference parameter defines the hyperparameters of the inference process.
inference:
checkpoint: /path/to/model.pth
conf_threshold: 0.5
num_gpus: 1
color_map:
"black cat": red
car: blue
ioi_threshold: 0.5
nms_threshold: 0.2
text_threshold: 0.3
dataset:
infer_data_sources:
image_dir: /data/raw-data/val2017/
captions: ["black cat", "cat"] # or json file that contains the image path and captions for VG
data_type: OD # or VG
Field |
value_type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
1 |
False |
||||
|
list |
[0] |
False |
||||
|
int |
1 |
False |
||||
|
string |
??? |
False |
||||
|
string |
False |
|||||
|
string |
Path to the TensorRT engine to be used for evaluation. This only works with |
False |
||||
|
collection |
Class-wise dictionary with colors to render boxes. |
False |
||||
|
float |
Confidence threshold on box scores for filtering final masks and boxes. |
0.0 |
False |
|||
|
float |
Intersection over instance (ioi) threshold between ReLA output and instance masks for filtering final masks and boxes. |
0.5 |
False |
|||
|
float |
Non-max suppression threshold on boxes to filter final masks and boxes. |
0.2 |
False |
|||
|
float |
Text threshold for extracting phrases from expressions. |
0.3 |
False |
|||
|
bool |
True: Render with internal directory structure. |
False |
False |
|||
|
int |
Width of the input image tensor. |
960 |
32 |
False |
||
|
int |
Height of the input image tensor. |
544 |
32 |
False |
||
|
int |
Width in pixels of the bounding box outline. |
3 |
1 |
False |
The inference tool for Mask Grounding DINO models can be used to visualize bboxes and generate frame-by- frame KITTI format labels on a directory of images.
INFERENCE_JOB_ID=$(tao mask_grounding_dino create-job \
--kind experiment \
--name "mask_grounding_dino_inference" \
--action inference \
--workspace-id $WORKSPACE_ID \
--parent-job-id $TRAIN_JOB_ID \
--inference-dataset "$DATASET_ID" \
--specs "$INFERENCE_SPECS" \
--base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
--encryption-key "nvidia_tlt" | jq -r '.id')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mask_grounding_dino inference [-h] -e <experiment spec file>
inference.checkpoint=<model to be inferenced>
Required Arguments
The following arguments are required to run the command.
-e, --experiment_spec: The experiment spec file to set up the inference experiment
Optional Arguments
The following arguments are optional to run the command.
inference.checkpoint: The.pthmodel to inference
Sample Usage
This is an example of using the inference command:
tao model mask_grounding_dino inference -e /path/to/spec.yaml inference.checkpoint=/path/to/model.pth
Exporting the Model#
export#
The export parameter defines the hyperparameters of the export process.
export:
checkpoint: /path/to/model.pth
onnx_file: /path/to/model.onnx
on_cpu: False
opset_version: 17
input_channel: 3
input_width: 960
input_height: 544
batch_size: -1
Field |
Value Type |
Description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored. |
False |
||||
|
int |
The index of the GPU to build the TensorRT engine. |
0 |
False |
|||
|
string |
Path to the checkpoint file to run export. |
??? |
False |
|||
|
string |
Path to the ONNX model file. |
??? |
False |
|||
|
bool |
True: Export CPU compatible model. |
False |
False |
|||
|
int |
Number of channels in the input tensor. |
3 |
3 |
False |
||
|
int |
Width of the input image tensor. |
960 |
32 |
False |
||
|
int |
Height of the input image tensor. |
544 |
32 |
False |
||
|
int |
Operator set version of the ONNX model used to generate the TensorRT engine. |
17 |
1 |
False |
||
|
int |
The batch size of the input Tensor for the engine.
A value of |
-1 |
-1 |
False |
||
|
bool |
True: Enable verbose TensorRT logging. |
False |
False |
EXPORT_JOB_ID=$(tao mask_grounding_dino create-job \
--kind experiment \
--name "mask_grounding_dino_export" \
--action export \
--workspace-id $WORKSPACE_ID \
--parent-job-id $TRAIN_JOB_ID \
--specs "$EXPORT_SPECS" \
--base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
--encryption-key "nvidia_tlt" | jq -r '.id')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mask_grounding_dino export [-h] -e <experiment spec file>
export.checkpoint=<model to export>
export.onnx_file=<onnx path>
Required Arguments
The following arguments are required to run the command.
-e, --experiment_spec: The path to an experiment spec file
Optional Arguments
The following arguments are optional to run the command.
export.checkpoint: The.pthmodel to exportexport.onnx_file: The path where the.onnxmodel is saved
Sample Usage
This is an example of using the export command:
tao model mask_grounding_dino export -e /path/to/spec.yaml export.checkpoint=/path/to/model.pth export.onnx_file=/path/to/model.onnx
TensorRT Engine Generation, Validation, and int8 Calibration#
For deployment, refer to TAO Deploy documentation for Mask Grounding DINO.