Mask2former#

Mask2Former supports the following tasks:

train
evaluate
inference
export

Each task is explained in detail in the following sections.

Note

Throughout this documentation, you will see references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.
- For instructions on creating a dataset using the remote client, see the Creating a dataset section in the Remote Client documentation.
- For instructions on creating an experiment using the remote client, see the Creating an experiment section in the Remote Client documentation.
The spec format is YAML for TAO Launcher and JSON for FTMS Client.
File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher and not for FTMS Client.

Dataset Format#

Mask2Former supports 3 type of dataloaders corresponding to the semantic, panoptic and instance segmentation tasks.

Each dataloader requires a certain annotation format.

For the semantic segmentation task, each line of the JSONL annotation file encodes the locations of the raw image and the mask groundtruth.

For the panoptic and instance segmentation tasks, the annotation format follows the COCO panoptic and COCO format respectively.

Note

The category ids and annotation ids must be greater than 0.

Creating a Configuration File#

SPECS=$(tao-client mask2former get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Below is a sample Mask2Former spec file. It has six components –model, inference, evaluate, dataset, export, and train–as well as several global parameters, which are described below. The format of the spec file is a YAML file.

Here’s a sample of the Mask2Former spec file:

results_dir: /workspace/mask2former_coco_swint
data:
  contiguous_id: False
  label_map: /tlt3_experiments/mask2former_coco_effvit_b2/colormap.json
  type: 'coco_panoptic'
  train:
    panoptic_json: "/datasets/coco/annotations/panoptic_train2017.json"
    img_dir: "/datasets/coco/train2017"
    panoptic_dir: "/datasets/coco/panoptic_train2017"
    batch_size: 16
    num_workers: 20
  val:
    panoptic_json: "/datasets/coco/annotations/panoptic_val2017.json"
    img_dir: "/datasets/coco/val2017"
    panoptic_dir: "/datasets/coco/panoptic_val2017"
    batch_size: 1
    num_workers: 2
    target_size: [1024, 1024]
  test:
    img_dir: /workspace/test_images/
    batch_size: 1
  augmentation:
    train_min_size: [1024]
    train_max_size: 2560
    train_crop_size: [1024, 1024]
    test_min_size: 1024
    test_max_size: 2560
train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 1
  validation_interval: 5
  num_epochs: 50
  optim:
    lr_scheduler: "MultiStep"
    milestones: [44, 48]
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05
model:
  object_mask_threshold: 0.
  overlap_threshold: 0.8
  mode: "semantic"
  backbone:
    pretrained_weights: "/workspace/mask2former_coco_swint/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 200
inference:
  checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth"
evaluate:
  checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth"
export:
  checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth"
  input_channel: 3
  input_width: 1024
  input_height: 1024
  opset_version: 17

Parameter	Data Type	Default	Description	Supported Values
`model`	dict config	–	The configuration of the model architecture
`dataset`	dict config	–	The configuration of the dataset
`train`	dict config	–	The configuration of the training task
`evaluate`	dict config	–	The configuration of the evaluation task
`inference`	dict config	–	The configuration of the inference task
`encryption_key`	string	None	The encryption key to encrypt and decrypt model files
`results_dir`	string	/results	The directory where experiment results are saved
`export`	dict config	–	The configuration of the ONNX export task

Model Config#

The model configuration (model) defines the Mask2Former model structure. This model is used for training, evaluation, and inference. A detailed description is included in the table below. Currently, Mask2Former only supports Swin-Transformers and EfficientViT (experimental feature) models.

Field	Description	Data Type and Constraints	Supported Value
backbone	The backbone configuration	Dict
sem_seg_head	The configuration for the segmentation head	Dict
mask_former	The configuration for the mask2former architecture	Dict
mode	The postprocesing mode	string	‘panoptic’, ‘semantic’, ‘instance’
object_mask_threshold	Classification confidence threshold	float	0.4
overlap_threshold	Overlap threshold for panoptic inference	float	0.8
test_topk_per_image	Keep topk instances per image for instance inference	Unsigned int	100

Backbone Config#

The backbone configuration (backbone) defines the backbone structure. A detailed description is included in the table below. Currently, Mask2Former only supports Swin-Transformers and EfficientViT models.

Field	Description	Data Type and Constraints	Recommended/Typical Value
type	The backbone type	str	“swin”
pretrained_weights	The path to the pretrained backbone model	str
swin	The configuration for the Swin backbones	Dict
efficientvit	The configuration for the EfficientViT backbones	Dict

Swin Config#

The swin configuration (swin) specifies the key parameters in a Swin Transformer backbone.

Field	Description	Data Type and Constraints	Recommended/Typical Value
type	The type of Swin Transformer (from `tiny` to `huge`)	str	“large”
pretrain_img_size	The image size used in pretraining	Unsigned int	384
out_indices	The stages to extract feature maps	List	[0, 1, 2, 3]
out_features	The names of the extracted feature maps	List	[“res2”, “res3”, “res4”, “res5”]

EfficientViT Config#

The efficientvit configuration (efficientvit) specifies the key parameters in a EfficientViT backbone.

Field	Description	Data Type and Constraints	Recommended/Typical Value
name	The name of EfficientViT model (“b0”-“b3”, “l0”-“l3”)	str	“l2”
pretrain_img_size	The image size used in pretraining	Unsigned int	384
out_indices	The stages to extract feature maps	List	[0, 1, 2, 3]
out_features	The names of the extracted feature maps	List	[“res2”, “res3”, “res4”, “res5”]

Data Config#

The data configuration (data) defines the data source, augmentation methods and pre-processing hyperparameters.

Field	Description	Data Type and Constraints	Recommended/Typical Value
pixel_mean	Image mean in RGB order	List	[0.485, 0.456, 0.406]
pixel_std	Image standard deviation in RGB order	List	[0.229, 0.224, 0.225]
augmentation	The augmentation settings	Dict
contiguous_id	Whether to use contiguous ids	bool
label_map	The path to the label mapping file	string
workers	The number of workers to load data for each GPU	Unsigned int
train	The train dataset config	Dict
val	The validation dataset config	Dict
test	The test dataset config	Dict

Augmentation Config#

The augmentation configuration (augmentation) defines the augmentation methods.

Parameter	Datatype	Description	Supported Values
`train_min_size`	int list	A list of sizes to perform random resize for training data	int list
`train_max_size`	unsigned int	The minimum random crop size for training data	>0
`train_crop_size`	int list	The random crop size for training data in [H, W]	int list
`test_min_size`	unsigned int	The minimum resize size for test data	>0
`test_max_size`	unsigned int	The maximum resize size for test data	>0

Dataset Config#

The dataset configuration (dataset) defines the dataset directories, annotation file and batch size for either train, val or test.

Parameter	Datatype	Description
`type`	str	Dataset type (“ade”, “coco”, “coco_panoptic”)
`panoptic_json`	str	JSON file in COCO panoptic format
`img_dir`	str	Image directory (can be relative path to `root_dir`)
`panoptic_dir`	str	Directory of panoptic segmentation annotation images
`root_dir`	str	Root directory to `img_dir`
`annot_file`	str	JSON file in COCO/COCO_panoptic format or JSONL format for image/mask pair
`batch_size`	unsigned int	Batch size
`num_workers`	unsigned int	Number of workers to process the input data

Train Config#

The train configuration defines the hyperparameters of the training process.

train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 10
  validation_interval: 10
  num_epochs: 50
  optim:
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05

Parameter	Datatype	Default	Description	Supported Values
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed training	>0
`gpu_ids`	List[int]	[0]	The indices of the GPU’s to use for distributed training
`seed`	unsigned int	1234	The random seed for random, NumPy, and torch	>0
`num_epochs`	unsigned int	10	The total number of epochs to run the experiment	>0
`checkpoint_interval`	unsigned int	1	The epoch interval at which the checkpoints are saved	>0
`validation_interval`	unsigned int	1	The epoch interval at which the validation is run	>0
`resume_training_checkpoint_path`	string		The intermediate PyTorch Lightning checkpoint to resume training from
`results_dir`	string	/results/train	The directory to save training results
`optim`	dict config		The config for the optimizer, including the learning rate, learning scheduler, and weight decay	>0
`clip_grad_type`	str	full	The type of gradient clip method
`clip_grad_norm`	float	0.1	amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping	>=0
`precision`	string	fp32	Specifying “fp16” enables precision training. Training with fp16 can help save GPU memory.	fp32, fp16
`distributed_strategy`	string	ddp	The multi-GPU training strategy. DDP (Distributed Data Parallel) and Sharded DDP are supported.	ddp, ddp_sharded
`activation_checkpoint`	bool	True	A True value instructs train to recompute in backward pass to save GPU memory, rather than storing activations.	True, False
`pretrained_model_path`	string		Path to pretrained model checkpoint path to load for finetuning
`num_nodes`	unsigned int	1	The number of nodes. If the value is larger than 1, multi-node is enabled	>0
`freeze`	string list	[]	The list of layer names in the model to freeze. Example [“backbone”, “transformer.encoder”, “input_proj”]
`verbose`	bool	False	Whether to print detailed learning rate scaling from the optimizer	True, False
`iters_per_epoch`	unsigned int		The number of samples per epoch

Optimizer Config#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

Parameter	Datatype	Default	Description	Supported Values
`lr`	float	2e-4	The initial learning rate for training the model, excluding the backbone	>0.0
`momentum`	float	0.9	The momentum for the AdamW optimizer	>0.0
`weight_decay`	float	1e-4	The weight decay coefficient	>0.0
`lr_scheduler`	string	MultiStep	The learning scheduler: * `MultiStep` : Decrease the `lr` by `lr_decay` from `lr_steps` * `StepLR` : Decrease the `lr` by `lr_decay` at every `lr_step_size`	MultiStep/StepLR
`gamma`	float	0.1	The decreasing factor for the learning rate scheduler	>0.0
`milestones`	int list	[11]	The steps to decrease the learning rate for the `MultiStep` scheduler	int list
`monitor_name`	string	val_loss	The monitor value for the `AutoReduce` scheduler	val_loss/train_loss
`type`	string	AdamW	The type of optimizer to use during training	AdamW/SGD

Evaluation Config#

The evaluate parameter defines the hyperparameters of the evaluation process.

evaluate:
  checkpoint: /path/to/model.pth
  num_gpus: 1

Parameter	Datatype	Default	Description	Supported Values
`checkpoint`	string		Path to PyTorch model to evaluate
`trt_engine`	string		Path to TensorRT model to evaluate. Must be only used with tao deploy
`num_gpus`	unsigned int	1	The number of GPUs to use	>0
`gpu_ids`	unsigned int	[0]	The GPU ids to use
`results_dir`	string	/results/evaluate	Path to the evaluation results directory

Inference Config#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  num_gpus: 1

Parameter	Datatype	Default	Description	Supported Values
`checkpoint`	string		Path to PyTorch model to inference
`trt_engine`	string		Path to TensorRT model to inference. Must be only used with tao deploy
`num_gpus`	unsigned int	1	The number of GPUs to use	>0
`gpu_ids`	unsigned int	[0]	The GPU ids to use
`results_dir`	string	/results/inference	Path to the inference results directory

Export Config#

The export parameter defines the hyperparameters of the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 12
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Parameter	Datatype	Default	Description	Supported Values
`checkpoint`	string		The path to the PyTorch model to export
`onnx_file`	string		The path to the `.onnx` file
`on_cpu`	bool	True	If this value is True, the DMHA module will be exported as standard PyTorch. If this value is False, the module will be exported using the TRT Plugin.	True, False
`opset_version`	unsigned int	12	The opset version of the exported ONNX	>0
`input_channel`	unsigned int	3	The input channel size. Only the value 3 is supported.	3
`input_width`	unsigned int	960	The input width	>0
`input_height`	unsigned int	544	The input height	>0
`batch_size`	unsigned int	-1	The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size.	>=-1

Training the Model#

To train a Mask2Former model, use this command:

TRAIN_JOB_ID=$(tao-client mask2former experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask2former train [-h] -e <experiment_spec>
                    [results_dir=<global_results_dir>]
                    [model.<model_option>=<model_option_value>]
                    [dataset.<dataset_option>=<dataset_option_value>]
                    [train.<train_option>=<train_option_value>]
                    [train.gpu_ids=<gpu indices>]
                    [train.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The experiment specification file to set up the training experiment.

Optional Arguments

You can set optional arguments to override the option values in the experiment spec file.

-h, --help: Show this help message and exit.
model.<model_option>: The model options.
dataset.<dataset_option>: The dataset options.
train.<train_option>: The train options.
train.optim.<optim_option>: The optimizer options

Note

For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but inconsistent, for example num_gpus = 1, gpu_ids = [0, 1]`, then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2.

In some cases, you may encounter an issue with multi-GPU training resulting in a segmentation fault. You may circumvent this by setting the OMP_NUM_THREADS enviroment variable to 1. Depending upon your model of execution, you may use the following methods to set this variable

CLI Launcher

You may set this env variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in this section

{
    "Envs": [
        {
            "variable": "OMP_NUM_THREADSR",
            "value": "1"
        }
    ]
}

Docker

You may set environment variables in the docker by setting the -e flag in the docker command line.

docker run -it --rm --gpus all \
    -e OMP_NUM_THREADS=1 \
    -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. These are saved in train.results_dir, like so:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

The latest checkpoint will also be saved as mask2former_model_latest.pth. Training automatically resumes from mask2former_model_latest.pth, if it exists in train.results_dir. This is superseded by train.resume_training_checkpoint_path, if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:

Specify a new, empty results directory (Recommended)
Remove the latest checkpoint from the results directory

Optimizing Resource for Training Mask2Former#

Training Mask2Former requires strong GPUs (for example, V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory#

There are various ways to optimize GPU memory usage. A typical option is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. We recommend setting the following configurations to optimize GPU consumption:

Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.
Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.
Set train.distributed_strategy to ddp_sharded to enabled Sharded DDP training. This shares gradient calculation across different processes to help reduce GPU memory.
Try using more lightweight backbones or freeze the backbone through setting train.freeze.
Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

Optimize CPU Memory#

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory, if the size of your annotation file is very large. We recommend setting the following configurations to optimize CPU consumption.

Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.
Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leak and the CPU memory to slowly stacks up until it becomes Out of Memory in the middle of training. This is the limitation of PyTorch so we advise setting fixed_padding to True to help stablize the CPU memory usage.

Evaluating the Model#

To run evaluation with a Mask2Former model, use this command:

EVAL_JOB_ID=$(tao-client mask2former experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask2former evaluate [-h] -e <experiment_spec>
                    evaluate.checkpoint=<model to be evaluated>
                    [evaluate.<evaluate_option>=<evaluate_option_value>]
                    [evaluate.gpu_ids=<gpu indices>]
                    [evaluate.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The experiment spec file to set up the evaluation experiment.
evaluate.checkpoint: The .pth model to be evaluated.

Optional Arguments

The following arguments are optional to run the command.

evaluate.<evaluate_option>: The evaluate options.

Running Inference with Mask2Former Model#

The inference tool for Mask2Former models can be used to visualize bboxes and masks.

INFERENCE_JOB_ID=$(tao-client mask2former experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask2former inference [-h] -e <experiment spec file>
                    inference.checkpoint=<inference model>
                    [inference.<evaluate_option>=<evaluate_option_value>]
                    [inference.gpu_ids=<gpu indices>]
                    [inference.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The experiment spec file to set up the inference experiment.
inference.checkpoint: The .pth model to run inference on.

Optional Arguments

The following arguments are optional to run the command.

inference.<inference_option>: The inference options.

Exporting the Model#

EXPORT_JOB_ID=$(tao-client mask2former experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask2former export [-h] -e <experiment spec file>
                    [results_dir=<results_dir>]
                    export.checkpoint=<model to export>
                    export.onnx_file=<onnx path>

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The path to an experiment spec file
export.checkpoint: The .pth model to export.
export.onnx_file: The path where the .etlt or .onnx model is saved.

Optional Arguments

The following arguments are optional to run the command.

export.<export_option>: The export options.

TensorRT Engine Generation and Validation#

For deployment, refer to TAO Deploy documentation for Mask2Former.

Deploying to DeepStream#

Refer to the Integrating a Mask2Former Model page for more information about deploying a Mask2Former model to DeepStream.