TAO v5.5.0
NVIDIA TAO v5.5.0

mask2former

Mask2Former supports the following tasks:

  • train

  • evaluate

  • inference

  • export

These tasks may be invoked from the TAO Launcher using the following convention on the command line:

Copy
Copied!
            

tao model mask2former <sub_task> <args_per_subtask>

where args_per_subtask are the command-line arguments required for a given subtask. Each of these subtasks are explained as follows.

Mask2Former supports 3 type of dataloaders corresponding to the semantic, panoptic and instance segmentation tasks.

Each dataloader requires a certain annotation format.

For the semantic segmentation task, each line of the JSONL annotation file encodes the locations of the raw image and the mask groundtruth.

For the panoptic and instance segmentation tasks, the annotation format follows the COCO panoptic and COCO format respectively.

Note

The category ids and annotation ids must be greater than 0.

Below is a sample Mask2Former spec file. It has six components –model, inference, evaluate, dataset, export, and train–as well as several global parameters, which are described below. The format of the spec file is a YAML file.

Here’s a sample of the Mask2Former spec file:

Copy
Copied!
            

results_dir: /workspace/mask2former_coco_swint data: contiguous_id: False label_map: /tlt3_experiments/mask2former_coco_effvit_b2/colormap.json type: 'coco_panoptic' train: panoptic_json: "/datasets/coco/annotations/panoptic_train2017.json" img_dir: "/datasets/coco/train2017" panoptic_dir: "/datasets/coco/panoptic_train2017" batch_size: 16 num_workers: 20 val: panoptic_json: "/datasets/coco/annotations/panoptic_val2017.json" img_dir: "/datasets/coco/val2017" panoptic_dir: "/datasets/coco/panoptic_val2017" batch_size: 1 num_workers: 2 target_size: [1024, 1024] test: img_dir: /workspace/test_images/ batch_size: 1 augmentation: train_min_size: [1024] train_max_size: 2560 train_crop_size: [1024, 1024] test_min_size: 1024 test_max_size: 2560 train: precision: 'fp16' num_gpus: 1 checkpoint_interval: 1 validation_interval: 5 num_epochs: 50 optim: lr_scheduler: "MultiStep" milestones: [44, 48] type: "AdamW" lr: 0.0001 weight_decay: 0.05 model: object_mask_threshold: 0. overlap_threshold: 0.8 mode: "semantic" backbone: pretrained_weights: "/workspace/mask2former_coco_swint/swin_tiny_patch4_window7_224_22k.pth" type: "swin" swin: type: "tiny" window_size: 7 ape: False pretrain_img_size: 224 mask_former: num_object_queries: 100 sem_seg_head: norm: "GN" num_classes: 200 inference: checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth" evaluate: checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth" export: checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth" input_channel: 3 input_width: 1024 input_height: 1024 opset_version: 17

Parameter Data Type Default Description Supported Values
model dict config The configuration of the model architecture
dataset dict config The configuration of the dataset
train dict config The configuration of the training task
evaluate dict config The configuration of the evaluation task
inference dict config The configuration of the inference task
encryption_key string None The encryption key to encrypt and decrypt model files
results_dir string /results The directory where experiment results are saved
export dict config The configuration of the ONNX export task

Model Config

The model configuration (model) defines the Mask2Former model structure. This model is used for training, evaluation, and inference. A detailed description is included in the table below. Currently, Mask2Former only supports Swin-Transformers and EfficientViT (experimental feature) models.

Field Description Data Type and Constraints Supported Value
backbone The backbone configuration Dict
sem_seg_head The configuration for the segmentation head Dict
mask_former The configuration for the mask2former architecture Dict
mode The postprocesing mode string ‘panoptic’, ‘semantic’, ‘instance’
object_mask_threshold Classification confidence threshold float 0.4
overlap_threshold Overlap threshold for panoptic inference float 0.8
test_topk_per_image Keep topk instances per image for instance inference Unsigned int 100

Backbone Config

The backbone configuration (backbone) defines the backbone structure. A detailed description is included in the table below. Currently, Mask2Former only supports Swin-Transformers and EfficientViT models.

Field Description Data Type and Constraints Recommended/Typical Value
type The backbone type str “swin”
pretrained_weights The path to the pretrained backbone model str
swin The configuration for the Swin backbones Dict
efficientvit The configuration for the EfficientViT backbones Dict

Swin Config

The swin configuration (swin) specifies the key parameters in a Swin Transformer backbone.

Field Description Data Type and Constraints Recommended/Typical Value
type The type of Swin Transformer (from tiny to huge) str “large”
pretrain_img_size The image size used in pretraining Unsigned int 384
out_indices The stages to extract feature maps List [0, 1, 2, 3]
out_features The names of the extracted feature maps List [“res2”, “res3”, “res4”, “res5”]

EfficientViT Config

The efficientvit configuration (efficientvit) specifies the key parameters in a EfficientViT backbone.

Field Description Data Type and Constraints Recommended/Typical Value
name The name of EfficientViT model (“b0”-“b3”, “l0”-“l3”) str “l2”
pretrain_img_size The image size used in pretraining Unsigned int 384
out_indices The stages to extract feature maps List [0, 1, 2, 3]
out_features The names of the extracted feature maps List [“res2”, “res3”, “res4”, “res5”]

Data Config

The data configuration (data) defines the data source, augmentation methods and pre-processing hyperparameters.

Field Description Data Type and Constraints Recommended/Typical Value
pixel_mean Image mean in RGB order List [0.485, 0.456, 0.406]
pixel_std Image standard deviation in RGB order List [0.229, 0.224, 0.225]
augmentation The augmentation settings Dict
contiguous_id Whether to use contiguous ids bool
label_map The path to the label mapping file string
workers The number of workers to load data for each GPU Unsigned int
train The train dataset config Dict
val The validation dataset config Dict
test The test dataset config Dict

Augmentation Config

The augmentation configuration (augmentation) defines the augmentation methods.

Parameter Datatype Description Supported Values
train_min_size int list A list of sizes to perform random resize for training data int list
train_max_size unsigned int The minimum random crop size for training data >0
train_crop_size int list The random crop size for training data in [H, W] int list
test_min_size unsigned int The minimum resize size for test data >0
test_max_size unsigned int The maximum resize size for test data >0

Dataset Config

The dataset configuration (dataset) defines the dataset directories, annotation file and batch size for either train, val or test.

Parameter Datatype Description
type str Dataset type (“ade”, “coco”, “coco_panoptic”)
panoptic_json str JSON file in COCO panoptic format
img_dir str Image directory (can be relative path to root_dir)
panoptic_dir str Directory of panoptic segmentation annotation images
root_dir str Root directory to img_dir
annot_file str JSON file in COCO/COCO_panoptic format or JSONL format for image/mask pair
batch_size unsigned int Batch size
num_workers unsigned int Number of workers to process the input data

Train Config

The train configuration defines the hyperparameters of the training process.

Copy
Copied!
            

train: precision: 'fp16' num_gpus: 1 checkpoint_interval: 10 validation_interval: 10 num_epochs: 50 optim: type: "AdamW" lr: 0.0001 weight_decay: 0.05

Parameter Datatype Default Description Supported Values
num_gpus unsigned int 1 The number of GPUs to use for distributed training >0
gpu_ids List[int] [0] The indices of the GPU’s to use for distributed training
seed unsigned int 1234 The random seed for random, NumPy, and torch >0
num_epochs unsigned int 10 The total number of epochs to run the experiment >0
checkpoint_interval unsigned int 1 The epoch interval at which the checkpoints are saved >0
validation_interval unsigned int 1 The epoch interval at which the validation is run >0
resume_training_checkpoint_path string The intermediate PyTorch Lightning checkpoint to resume training from
results_dir string /results/train The directory to save training results
optim dict config The config for the optimizer, including the learning rate, learning scheduler, and weight decay >0
clip_grad_type str full The type of gradient clip method
clip_grad_norm float 0.1 amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping >=0
precision string fp32 Specifying “fp16” enables precision training. Training with fp16 can help save GPU memory. fp32, fp16
distributed_strategy string ddp The multi-GPU training strategy. DDP (Distributed Data Parallel) and Sharded DDP are supported. ddp, ddp_sharded
activation_checkpoint bool True A True value instructs train to recompute in backward pass to save GPU memory, rather than storing activations. True, False
pretrained_model_path string Path to pretrained model checkpoint path to load for finetuning
num_nodes unsigned int 1 The number of nodes. If the value is larger than 1, multi-node is enabled >0
freeze string list [] The list of layer names in the model to freeze. Example [“backbone”, “transformer.encoder”, “input_proj”]
verbose bool False Whether to print detailed learning rate scaling from the optimizer True, False
iters_per_epoch unsigned int The number of samples per epoch

Optimizer Config

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

Parameter

Datatype

Default

Description

Supported Values

lr float 2e-4 The initial learning rate for training the model, excluding the backbone >0.0
momentum float 0.9 The momentum for the AdamW optimizer >0.0
weight_decay float 1e-4 The weight decay coefficient >0.0

lr_scheduler

string

MultiStep

The learning scheduler:
* MultiStep : Decrease the lr by lr_decay from lr_steps
* StepLR : Decrease the lr by lr_decay at every lr_step_size

MultiStep/StepLR

gamma float 0.1 The decreasing factor for the learning rate scheduler >0.0
milestones int list [11] The steps to decrease the learning rate for the MultiStep scheduler int list
monitor_name string val_loss The monitor value for the AutoReduce scheduler val_loss/train_loss
type string AdamW The type of optimizer to use during training AdamW/SGD

Evaluation Config

The evaluate parameter defines the hyperparameters of the evaluation process.

Copy
Copied!
            

evaluate: checkpoint: /path/to/model.pth num_gpus: 1

Parameter Datatype Default Description Supported Values
checkpoint string Path to PyTorch model to evaluate
trt_engine string Path to TensorRT model to evaluate. Must be only used with tao deploy
num_gpus unsigned int 1 The number of GPUs to use >0
gpu_ids unsigned int [0] The GPU ids to use
results_dir string /results/evaluate Path to the evaluation results directory

Inference Config

The inference parameter defines the hyperparameters of the inference process.

Copy
Copied!
            

inference: checkpoint: /path/to/model.pth num_gpus: 1

Parameter Datatype Default Description Supported Values
checkpoint string Path to PyTorch model to inference
trt_engine string Path to TensorRT model to inference. Must be only used with tao deploy
num_gpus unsigned int 1 The number of GPUs to use >0
gpu_ids unsigned int [0] The GPU ids to use
results_dir string /results/inference Path to the inference results directory

Export Config

The export parameter defines the hyperparameters of the export process.

Copy
Copied!
            

export: checkpoint: /path/to/model.pth onnx_file: /path/to/model.onnx on_cpu: False opset_version: 12 input_channel: 3 input_width: 960 input_height: 544 batch_size: -1

Parameter Datatype Default Description Supported Values
checkpoint string The path to the PyTorch model to export
onnx_file string The path to the .onnx file
on_cpu bool True If this value is True, the DMHA module will be exported as standard PyTorch. If this value is False, the module will be exported using the TRT Plugin. True, False
opset_version unsigned int 12 The opset version of the exported ONNX >0
input_channel unsigned int 3 The input channel size. Only the value 3 is supported. 3
input_width unsigned int 960 The input width >0
input_height unsigned int 544 The input height >0
batch_size unsigned int -1 The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size. >=-1

To train a Mask2Former model, use this command:

Copy
Copied!
            

tao model mask2former train [-h] -e <experiment_spec> [results_dir=<global_results_dir>] [model.<model_option>=<model_option_value>] [dataset.<dataset_option>=<dataset_option_value>] [train.<train_option>=<train_option_value>] [train.gpu_ids=<gpu indices>] [train.num_gpus=<number of gpus>]

Required Arguments

  • -e, --experiment_spec: The experiment specification file to set up the training experiment.

Optional Arguments

You can set optional arguments to override the option values in the experiment spec file.

Note

For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but inconsistent, for example num_gpus = 1, gpu_ids = [0, 1], then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2.

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. These are saved in train.results_dir, like so:

Copy
Copied!
            

$ ls /results/train 'model_epoch_000.pth' 'model_epoch_001.pth' 'model_epoch_002.pth' 'model_epoch_003.pth' 'model_epoch_004.pth'

The latest checkpoint will also be saved as mask2former_model_latest.pth. Training automatically resumes from mask2former_model_latest.pth, if it exists in train.results_dir. This is superseded by train.resume_training_checkpoint_path, if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:

  • Specify a new, empty results directory (Recommended)

  • Remove the latest checkpoint from the results directory

Optimizing Resource for Training Mask2Former

Training Mask2Former requires strong GPUs (for example, V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory

There are various ways to optimize GPU memory usage. A typical option is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. We recommend setting the following configurations to optimize GPU consumption:

  • Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

  • Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.

  • Set train.distributed_strategy to ddp_sharded to enabled Sharded DDP training. This shares gradient calculation across different processes to help reduce GPU memory.

  • Try using more lightweight backbones or freeze the backbone through setting train.freeze.

  • Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

Optimize CPU Memory

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory, if the size of your annotation file is very large. We recommend setting the following configurations to optimize CPU consumption.

To run evaluation with a Mask2Former model, use this command:

Copy
Copied!
            

tao model mask2former evaluate [-h] -e <experiment_spec> evaluate.checkpoint=<model to be evaluated> [evaluate.<evaluate_option>=<evaluate_option_value>] [evaluate.gpu_ids=<gpu indices>] [evaluate.num_gpus=<number of gpus>]

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the evaluation experiment.

  • evaluate.checkpoint: The .pth model to be evaluated.

Optional Arguments

The inference tool for Mask2Former models can be used to visualize bboxes and masks.

Copy
Copied!
            

tao model mask2former inference [-h] -e <experiment spec file> inference.checkpoint=<inference model> [inference.<evaluate_option>=<evaluate_option_value>] [inference.gpu_ids=<gpu indices>] [inference.num_gpus=<number of gpus>]

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the inference experiment.

  • inference.checkpoint: The .pth model to run inference on.

Optional Arguments

Copy
Copied!
            

tao model mask2former export [-h] -e <experiment spec file> [results_dir=<results_dir>] export.checkpoint=<model to export> export.onnx_file=<onnx path>

Required Arguments

  • -e, --experiment_spec: The path to an experiment spec file

  • export.checkpoint: The .pth model to export.

  • export.onnx_file: The path where the .etlt or .onnx model is saved.

Optional Arguments

Refer to the Integrating a Mask2Former Model page for more information about deploying a Mask2Former model to DeepStream.

Previous Data Input for Instance Segmentation
Next Mask Auto Labeler
© Copyright 2024, NVIDIA. Last updated on Aug 30, 2024.