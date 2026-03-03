Mask2former#

Mask2Former supports the following tasks:

  • train

  • evaluate

  • inference

  • export

Each task is explained in detail in the following sections.

Note

  • Throughout this documentation are references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.

    • For instructions on creating a dataset using the remote client, refer to the Creating a dataset section in the Remote Client documentation.

    • For instructions on creating an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

  • The spec format is YAML for TAO Launcher, and JSON for FTMS Client.

  • File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.

Dataset Format#

Mask2Former supports 3 type of dataloaders corresponding to the semantic, panoptic and instance segmentation tasks.

Each dataloader requires a certain annotation format.

For the semantic segmentation task, each line of the JSONL annotation file encodes the locations of the raw image and the mask groundtruth.

For the panoptic and instance segmentation tasks, the annotation format follows the COCO panoptic and COCO format respectively.

Note

The category ids and annotation ids must be greater than 0.

Creating a Configuration File#

BASE_EXPERIMENT_ID=$(tao mask2former list-base-experiments | jq -r '.[0].id')
SPECS=$(tao mask2former get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Below is a sample Mask2Former spec file. It has six components –model, inference, evaluate, dataset, export, and train–as well as several global parameters, which are described below. The format of the spec file is a YAML file.

Here’s a sample of the Mask2Former spec file:

results_dir: /workspace/mask2former_coco_swint
data:
  contiguous_id: False
  label_map: /tlt3_experiments/mask2former_coco_effvit_b2/colormap.json
  type: 'coco_panoptic'
  train:
    panoptic_json: "/datasets/coco/annotations/panoptic_train2017.json"
    img_dir: "/datasets/coco/train2017"
    panoptic_dir: "/datasets/coco/panoptic_train2017"
    batch_size: 16
    num_workers: 20
  val:
    panoptic_json: "/datasets/coco/annotations/panoptic_val2017.json"
    img_dir: "/datasets/coco/val2017"
    panoptic_dir: "/datasets/coco/panoptic_val2017"
    batch_size: 1
    num_workers: 2
    target_size: [1024, 1024]
  test:
    img_dir: /workspace/test_images/
    batch_size: 1
  augmentation:
    train_min_size: [1024]
    train_max_size: 2560
    train_crop_size: [1024, 1024]
    test_min_size: 1024
    test_max_size: 2560
train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 1
  validation_interval: 5
  num_epochs: 50
  optim:
    lr_scheduler: "MultiStep"
    milestones: [44, 48]
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05
model:
  object_mask_threshold: 0.
  overlap_threshold: 0.8
  mode: "semantic"
  backbone:
    pretrained_weights: "/workspace/mask2former_coco_swint/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 200
inference:
  checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth"
evaluate:
  checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth"
export:
  checkpoint: "/workspace/mask2former_coco_swint/train/model_epoch=049.pth"
  input_channel: 3
  input_width: 1024
  input_height: 1024
  opset_version: 17

Parameter

Data Type

Default

Description

Supported Values

model

dict config

The configuration of the model architecture

dataset

dict config

The configuration of the dataset

train

dict config

The configuration of the training task

evaluate

dict config

The configuration of the evaluation task

inference

dict config

The configuration of the inference task

encryption_key

string

None

The encryption key to encrypt and decrypt model files

results_dir

string

/results

The directory where experiment results are saved

export

dict config

The configuration of the ONNX export task

Model Config#

The model configuration (model) defines the Mask2Former model structure. This model is used for training, evaluation, and inference. A detailed description is included in the table below. Currently, Mask2Former only supports Swin-Transformers and EfficientViT (experimental feature) models.

Field

Description

Data Type and Constraints

Supported Value

backbone

The backbone configuration

Dict

sem_seg_head

The configuration for the segmentation head

Dict

mask_former

The configuration for the mask2former architecture

Dict

mode

The postprocesing mode

string

‘panoptic’, ‘semantic’, ‘instance’

object_mask_threshold

Classification confidence threshold

float

0.4

overlap_threshold

Overlap threshold for panoptic inference

float

0.8

test_topk_per_image

Keep topk instances per image for instance inference

Unsigned int

100

Backbone Config#

The backbone configuration (backbone) defines the backbone structure. A detailed description is included in the table below. Currently, Mask2Former only supports Swin-Transformers and EfficientViT models.

Field

Description

Data Type and Constraints

Recommended/Typical Value

type

The backbone type

str

“swin”

pretrained_weights

The path to the pretrained backbone model

str

swin

The configuration for the Swin backbones

Dict

efficientvit

The configuration for the EfficientViT backbones

Dict

Swin Config#

The swin configuration (swin) specifies the key parameters in a Swin Transformer backbone.

Field

Description

Data Type and Constraints

Recommended/Typical Value

type

The type of Swin Transformer (from tiny to huge)

str

“large”

pretrain_img_size

The image size used in pretraining

Unsigned int

384

out_indices

The stages to extract feature maps

List

[0, 1, 2, 3]

out_features

The names of the extracted feature maps

List

[“res2”, “res3”, “res4”, “res5”]

EfficientViT Config#

The efficientvit configuration (efficientvit) specifies the key parameters in a EfficientViT backbone.

Field

Description

Data Type and Constraints

Recommended/Typical Value

name

The name of EfficientViT model (“b0”-“b3”, “l0”-“l3”)

str

“l2”

pretrain_img_size

The image size used in pretraining

Unsigned int

384

out_indices

The stages to extract feature maps

List

[0, 1, 2, 3]

out_features

The names of the extracted feature maps

List

[“res2”, “res3”, “res4”, “res5”]

Data Config#

The data configuration (data) defines the data source, augmentation methods and pre-processing hyperparameters.

Field

Description

Data Type and Constraints

Recommended/Typical Value

pixel_mean

Image mean in RGB order

List

[0.485, 0.456, 0.406]

pixel_std

Image standard deviation in RGB order

List

[0.229, 0.224, 0.225]

augmentation

The augmentation settings

Dict

contiguous_id

Whether to use contiguous ids

bool

label_map

The path to the label mapping file

string

workers

The number of workers to load data for each GPU

Unsigned int

train

The train dataset config

Dict

val

The validation dataset config

Dict

test

The test dataset config

Dict

Augmentation Config#

The augmentation configuration (augmentation) defines the augmentation methods.

Parameter

Datatype

Description

Supported Values

train_min_size

int list

A list of sizes to perform random resize for training data

int list

train_max_size

unsigned int

The minimum random crop size for training data

>0

train_crop_size

int list

The random crop size for training data in [H, W]

int list

test_min_size

unsigned int

The minimum resize size for test data

>0

test_max_size

unsigned int

The maximum resize size for test data

>0

Dataset Config#

The dataset configuration (dataset) defines the dataset directories, annotation file and batch size for either train, val or test.

Parameter

Datatype

Description

type

str

Dataset type (“ade”, “coco”, “coco_panoptic”)

panoptic_json

str

JSON file in COCO panoptic format

img_dir

str

Image directory (can be relative path to root_dir)

panoptic_dir

str

Directory of panoptic segmentation annotation images

root_dir

str

Root directory to img_dir

annot_file

str

JSON file in COCO/COCO_panoptic format or JSONL format for image/mask pair

batch_size

unsigned int

Batch size

num_workers

unsigned int

Number of workers to process the input data

Train Config#

The train configuration defines the hyperparameters of the training process.

train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 10
  validation_interval: 10
  num_epochs: 50
  optim:
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05

Parameter

Datatype

Default

Description

Supported Values

num_gpus

unsigned int

1

The number of GPUs to use for distributed training

>0

gpu_ids

List[int]

[0]

The indices of the GPU’s to use for distributed training

seed

unsigned int

1234

The random seed for random, NumPy, and torch

>0

num_epochs

unsigned int

10

The total number of epochs to run the experiment

>0

checkpoint_interval

unsigned int

1

The epoch interval at which the checkpoints are saved

>0

validation_interval

unsigned int

1

The epoch interval at which the validation is run

>0

resume_training_checkpoint_path

string

The intermediate PyTorch Lightning checkpoint to resume training from

results_dir

string

/results/train

The directory to save training results

optim

dict config

The config for the optimizer, including the learning rate, learning scheduler, and weight decay

>0

clip_grad_type

str

full

The type of gradient clip method

clip_grad_norm

float

0.1

amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping

>=0

precision

string

fp32

Specifying “fp16” enables precision training. Training with fp16 can help save GPU memory.

fp32, fp16

distributed_strategy

string

ddp

The multi-GPU training strategy. DDP (Distributed Data Parallel) and Sharded DDP are supported.

ddp, ddp_sharded

activation_checkpoint

bool

True

A True value instructs train to recompute in backward pass to save GPU memory, rather than storing activations.

True, False

pretrained_model_path

string

Path to pretrained model checkpoint path to load for finetuning

num_nodes

unsigned int

1

The number of nodes. If the value is larger than 1, multi-node is enabled

>0

freeze

string list

[]

The list of layer names in the model to freeze. Example [“backbone”, “transformer.encoder”, “input_proj”]

verbose

bool

False

Whether to print detailed learning rate scaling from the optimizer

True, False

iters_per_epoch

unsigned int

The number of samples per epoch

Optimizer Config#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

Parameter

Datatype

Default

Description

Supported Values

lr

float

2e-4

The initial learning rate for training the model, excluding the backbone

>0.0

momentum

float

0.9

The momentum for the AdamW optimizer

>0.0

weight_decay

float

1e-4

The weight decay coefficient

>0.0
lr_scheduler
string
MultiStep
The learning scheduler:
* MultiStep : Decrease the lr by lr_decay from lr_steps
* StepLR : Decrease the lr by lr_decay at every lr_step_size
MultiStep/StepLR

gamma

float

0.1

The decreasing factor for the learning rate scheduler

>0.0

milestones

int list

[11]

The steps to decrease the learning rate for the MultiStep scheduler

int list

monitor_name

string

val_loss

The monitor value for the AutoReduce scheduler

val_loss/train_loss

type

string

AdamW

The type of optimizer to use during training

AdamW/SGD

Evaluation Config#

The evaluate parameter defines the hyperparameters of the evaluation process.

evaluate:
  checkpoint: /path/to/model.pth
  num_gpus: 1

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

Path to PyTorch model to evaluate

trt_engine

string

Path to TensorRT model to evaluate. Must be only used with tao deploy

num_gpus

unsigned int

1

The number of GPUs to use

>0

gpu_ids

unsigned int

[0]

The GPU ids to use

results_dir

string

/results/evaluate

Path to the evaluation results directory

Inference Config#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  num_gpus: 1

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

Path to PyTorch model to inference

trt_engine

string

Path to TensorRT model to inference. Must be only used with tao deploy

num_gpus

unsigned int

1

The number of GPUs to use

>0

gpu_ids

unsigned int

[0]

The GPU ids to use

results_dir

string

/results/inference

Path to the inference results directory

Export Config#

The export parameter defines the hyperparameters of the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 12
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

The path to the PyTorch model to export

onnx_file

string

The path to the .onnx file

on_cpu

bool

True

If this value is True, the DMHA module will be exported as standard PyTorch. If this value is False, the module will be exported using the TRT Plugin.

True, False

opset_version

unsigned int

12

The opset version of the exported ONNX

>0

input_channel

unsigned int

3

The input channel size. Only the value 3 is supported.

3

input_width

unsigned int

960

The input width

>0

input_height

unsigned int

544

The input height

>0

batch_size

unsigned int

-1

The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size.

>=-1

Training the Model#

To train a Mask2Former model, use this command:

TRAIN_JOB_ID=$(tao mask2former create-job \
  --kind experiment \
  --name "mask2former_train" \
  --action train \
  --workspace-id $WORKSPACE_ID \
  --specs "$TRAIN_SPECS" \
  --train-datasets '["'$DATASET_ID'"]' \
  --eval-dataset "$DATASET_ID" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask2former train [-h] -e <experiment_spec>
                    [results_dir=<global_results_dir>]
                    [model.<model_option>=<model_option_value>]
                    [dataset.<dataset_option>=<dataset_option_value>]
                    [train.<train_option>=<train_option_value>]
                    [train.gpu_ids=<gpu indices>]
                    [train.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The experiment specification file to set up the training experiment.

Optional Arguments

You can set optional arguments to override the option values in the experiment spec file.

Note

For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but are inconsistent, for example num_gpus = 1, gpu_ids = [0, 1], then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.

In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set this variable:

  • CLI Launcher:

    You may set the environment variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in ths section Running the launcher.

    {
    "Envs": [
        {
            "variable": "OMP_NUM_THREADSR",
            "value": "1"
        }

}

  • Docker:

    You may set environment variables in Docker by setting the -e flag in the Docker command line.

    docker run -it --rm --gpus all \
    -e OMP_NUM_THREADS=1 \
    -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. Checkpoints are saved in train.results_dir, like this:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

The latest checkpoint will also be saved as mask2former_model_latest.pth. Training automatically resumes from mask2former_model_latest.pth, if it exists in train.results_dir. This is superseded by train.resume_training_checkpoint_path, if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:

  • Specify a new, empty results directory (Recommended)

  • Remove the latest checkpoint from the results directory

Optimizing Resource for Training Mask2Former#

Training Mask2Former requires strong GPUs (for example, V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory#

There are various ways to optimize GPU memory usage. A typical option is to reduce dataset.batch_size. However, this can cause your training to take longer than usual. We recommend setting the following configurations to optimize GPU consumption:

  • Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

  • Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.

  • Set train.distributed_strategy to ddp_sharded to enabled Sharded DDP training. This shares gradient calculation across different processes to help reduce GPU memory.

  • Try using more lightweight backbones or freeze the backbone through setting train.freeze.

  • Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

Optimize CPU Memory#

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory, if the size of your annotation file is very large. We recommend setting the following configurations to optimize CPU consumption.

  • Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.

  • Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leak and the CPU memory to slowly stacks up until it becomes Out of Memory in the middle of training. This is the limitation of PyTorch so we advise setting fixed_padding to True to help stablize the CPU memory usage.

Evaluating the Model#

To run evaluation with a Mask2Former model, use this command:

EVAL_JOB_ID=$(tao mask2former create-job \
  --kind experiment \
  --name "mask2former_evaluate" \
  --action evaluate \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --eval-dataset "$DATASET_ID" \
  --specs "$EVALUATE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask2former evaluate [-h] -e <experiment_spec>
                    evaluate.checkpoint=<model to be evaluated>
                    [evaluate.<evaluate_option>=<evaluate_option_value>]
                    [evaluate.gpu_ids=<gpu indices>]
                    [evaluate.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The experiment spec file to set up the evaluation experiment.

  • evaluate.checkpoint: The .pth model to be evaluated.

Optional Arguments

The following arguments are optional to run the command.

Running Inference with Mask2Former Model#

The inference tool for Mask2Former models can be used to visualize bboxes and masks.

INFERENCE_JOB_ID=$(tao mask2former create-job \
  --kind experiment \
  --name "mask2former_inference" \
  --action inference \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --inference-dataset "$DATASET_ID" \
  --specs "$INFERENCE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask2former inference [-h] -e <experiment spec file>
                    inference.checkpoint=<inference model>
                    [inference.<evaluate_option>=<evaluate_option_value>]
                    [inference.gpu_ids=<gpu indices>]
                    [inference.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The experiment spec file to set up the inference experiment.

  • inference.checkpoint: The .pth model to run inference on.

Optional Arguments

The following arguments are optional to run the command.

Exporting the Model#

EXPORT_JOB_ID=$(tao mask2former create-job \
  --kind experiment \
  --name "mask2former_export" \
  --action export \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --specs "$EXPORT_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mask2former export [-h] -e <experiment spec file>
                    [results_dir=<results_dir>]
                    export.checkpoint=<model to export>
                    export.onnx_file=<onnx path>

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The path to an experiment spec file

  • export.checkpoint: The .pth model to export.

  • export.onnx_file: The path where the .etlt or .onnx model is saved.

Optional Arguments

The following arguments are optional to run the command.

TensorRT Engine Generation and Validation#

For deployment, refer to TAO Deploy documentation for Mask2Former.

Deploying to DeepStream#

Refer to the Integrating a Mask2Former Model page for more information about deploying a Mask2Former model to DeepStream.