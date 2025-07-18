To train a Mask2Former model, use this command:

Copy Copied! tao model mask2former train [-h] -e <experiment_spec> [results_dir=<global_results_dir>] [model.<model_option>=<model_option_value>] [dataset.<dataset_option>=<dataset_option_value>] [train.<train_option>=<train_option_value>] [train.gpu_ids=<gpu indices>] [train.num_gpus=<number of gpus>]

-e, --experiment_spec : The experiment specification file to set up the training experiment.

You can set optional arguments to override the option values in the experiment spec file.

Note For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus and gpu_ids , which default to 1 and [0] , respectively. If both are passed, but inconsistent, for example num_gpus = 1 , gpu_ids = [0, 1] , then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2 .

At every train.checkpoint_interval , a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth . These are saved in train.results_dir , like so:

Copy Copied! $ ls /results/train 'model_epoch_000.pth' 'model_epoch_001.pth' 'model_epoch_002.pth' 'model_epoch_003.pth' 'model_epoch_004.pth'

The latest checkpoint will also be saved as mask2former_model_latest.pth . Training automatically resumes from mask2former_model_latest.pth , if it exists in train.results_dir . This is superseded by train.resume_training_checkpoint_path , if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:

Specify a new, empty results directory (Recommended)

Remove the latest checkpoint from the results directory

Training Mask2Former requires strong GPUs (for example, V100/A100) with at least 15GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

There are various ways to optimize GPU memory usage. A typical option is to reduce dataset.batch_size . However, this can cause your training to take longer than usual. We recommend setting the following configurations to optimize GPU consumption:

Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

Set train.activation_checkpoint to True to enable activation checkpointing. By recomputing the activations instead of caching them into memory, the memory usage can be improved.

Set train.distributed_strategy to ddp_sharded to enabled Sharded DDP training. This shares gradient calculation across different processes to help reduce GPU memory.

Try using more lightweight backbones or freeze the backbone through setting train.freeze .

Try changing the augmentation resolution in dataset.augmentation depending on your dataset.

To speed up data loading, it is a common practice to set high number of workers to spawn multiple processes. However, this can cause your CPU memory to become Out of Memory, if the size of your annotation file is very large. We recommend setting the following configurations to optimize CPU consumption.