OneFormer#

OneFormer supports the following tasks:

Train
Evaluate
Inference
Export

The following sections explain each task in detail.

Note

Throughout this documentation are references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.
- For instructions on creating a dataset using the remote client, refer to the Creating a dataset section in the Remote Client documentation.
- For instructions on creating an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
The spec format is YAML for TAO Launcher, and JSON for FTMS Client.
File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.

Dataset Format#

OneFormer supports three types of dataloaders, corresponding to the semantic, panoptic and instance segmentation tasks.

Each dataloader requires a certain annotation format.

For the semantic segmentation task, each line of the JSONL annotation file encodes the locations of the raw image and the mask ground truth.

For the panoptic and instance segmentation tasks, the annotation formats respectively follow the COCO panoptic and COCO format.

Note

The category IDs and annotation IDs must be greater than 0.

Creating a Configuration File#

FTMS Client

SPECS=$(tao-client oneformer get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

Below is a sample OneFormer spec file. It has six components --:code:`model, :code:`inference`, :code:`evaluate`, :code:`dataset`, :code:`export`, and :code:`train`, as well as several global parameters, described below. The spec file is coded in YAML file format.

Here’s a sample of the OneFormer spec file:

results_dir: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin
dataset:
    train:
        images: /workspace/datasets/coco/train2017
        annotations: /workspace/datasets/coco/annotations/panoptic_train2017.json
        panoptic: /workspace/datasets/coco/panoptic_train2017
        batch_size: 4
        num_workers: 4
    val:
        images: /workspace/datasets/coco/val2017
        annotations: /workspace/datasets/coco/annotations/panoptic_val2017.json
        panoptic: /workspace/datasets/coco/panoptic_val2017
        batch_size: 4
        num_workers: 4
    test:
        images: /workspace/datasets/coco/val2017
        annotations: /workspace/datasets/coco/annotations/panoptic_val2017.json
        panoptic: /workspace/datasets/coco/panoptic_val2017
        batch_size: 4
        num_workers: 4
    image_size: 1024
    label_map: /workspace/datasets/coco/label_map.json
    cutmix_prob: 0.0
model:
    backbone:
        name: D2SwinTransformer
        freeze_at: 0
        swin:
        embed_dim: 192
        depths: [2, 2, 18, 2]
        num_heads: [6, 12, 24, 48]
        window_size: 12
        mlp_ratio: 4.0
        patch_size: 4
        patch_norm: true
        ape: false
        pretrain_img_size: 384
        qkv_bias: true
        qk_scale: null
        attn_drop_rate: 0.0
        drop_rate: 0.0
        drop_path_rate: 0.3
        out_features: [res2, res3, res4, res5]
        out_indices: [0, 1, 2, 3]
        use_checkpoint: false
    one_former:
        num_object_queries: 150
    sem_seg_head:
        num_classes: 133
    test:
        test_topk_per_image: 100
        object_mask_threshold: 0.8
train:
    num_epochs: 50
    num_gpus: 8
    num_nodes: 4
    pretrained_model: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin_base/train/model_epoch_006_step_25879.pth
    pretrained_backbone:
    precision: 32
    iters_per_epoch: 15000
evaluate:
    checkpoint: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin/train/model_epoch_001_step_01850.pth
    num_gpus: 1
    gpu_ids: [0]
    results_dir: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin/eval
inference:
    mode: semantic
    results_dir: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin/inference
    images_dir: /workspace/datasets/coco/val2017
    image_size: [1024, 1024]
    checkpoint: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin/train/model_epoch_001_step_01850.pth

Parameter	Data Type	Default	Description
model	dict config	–	Configuration of the model architecture
dataset	dict config	–	Configuration of the dataset
train	dict config	–	Configuration of the training task
evaluate	dict config	–	Configuration of the evaluation task
inference	dict config	–	Configuration of the inference task
encryption_key	string	None	Encryption key to encrypt and decrypt model files
results_dir	string	/results	Directory where experiment results are saved
export	dict config	–	Configuration of the ONNX export task

Model Config#

The model configuration (model) defines the oneformer model structure. Thw model is used for training, evaluation, and inference. The table below provides a detailed description of the model structure. Currently, oneformer only supports Swin Transformers and EfficientViT (experimental feature) models.

Field	Description	Data Type and Constraints	Supported Value
backbone	Backbone configuration	dict
one_former	Configuration for the oneformer architecture	dict
sem_seg_head	Configuration for the segmentation head	dict
text_encoder	Configuration for the text encoder	dict
mode	Postprocesing mode	string	`"panoptic"`, `"semantic"`, `"instance"`
object_mask_threshold	Classification confidence threshold	float	0.4
overlap_threshold	Overlap threshold for panoptic inference	float	0.8
test_topk_per_image	Keep topk instances per image for instance inference	unsigned int	100

Backbone Configuration#

The backbone configuration (backbone) defines the backbone structure. The table below provides a detailed description. OneFormer currently supports only Swin Transformers and EfficientViT models.

Field	Description	Data Type and Constraints	Recommended/Typical Value
type	Backbone type	str	`"swin"`
pretrained_weights	Path to the pretrained backbone model	str
swin	Configuration for the Swin backbones	dict

Swin Configuration#

The swin configuration (swin) specifies the key parameters in a Swin Transformer backbone.

Field	Description	Data Type and Constraints	Recommended/Typical Value
embed_dim	Dimension of the embedding	unsigned int	192
depths	Number of layers in each stage	list	[2, 2, 18, 2]
num_heads	Number of attention heads in each stage	list	[6, 12, 24, 48]
window_size	Size of the window for local attention	unsigned int	12
mlp_ratio	Ratio of the MLP hidden dimension to the embedding dimension	float	4.0
patch_size	Size of the patch for the patch embedding	unsigned int	4
patch_norm	Whether to normalize the patch embedding	bool	True
ape	Whether to use absolute positional encoding	bool	False
qkv_bias	Whether to use bias in the QKV projection	bool	True
qk_scale	Scale factor for the QK projection	float	None
attn_drop_rate	Dropout rate for the attention	float	0.0
drop_rate	Dropout rate for the MLP	float	0.0
drop_path_rate	Drop path rate for the MLP	float	0.3
out_features	Names of the extracted feature maps	list	["res2", "res3", "res4", "res5"]
out_indices	Stages to extract feature maps	list	[0, 1, 2, 3]
use_checkpoint	Whether to use checkpoint for the transformer	bool	False
pretrain_img_size	Image size used in pretraining	unsigned int	384

Data Config#

The data configuration (data) defines the data source, augmentation methods, and preprocessing hyperparameters.

Field	Description	Data Type and Constraints	Recommended/Typical Value
pixel_mean	Image mean in RGB order	list	[123.675, 116.28, 103.53]
pixel_std	Image standard deviation in RGB order	list	[58.395, 57.12, 57.375]
augmentation	Augmentation settings	dict
contiguous_id	Whether to use contiguous IDs	bool
label_map	Path of the label mapping file	string
workers	Number of workers to load data for each GPU	unsigned int
train	Train dataset configuration	dict
val	Validation dataset configuration	dict
test	Test dataset configuration	dict

Augmentation Config#

The augmentation configuration (augmentation) defines the augmentation methods.

Parameter	Datatype	Description	Supported Values
train_min_size	int list	List of sizes to perform random resize for training data	int list
train_max_size	unsigned int	Minimum random crop size for training data	>0
train_crop_size	int list	Random crop size for training data in [H, W]	int list
test_min_size	unsigned int	Minimum resize size for test data	>0
test_max_size	unsigned int	Maximum resize size for test data	>0

Dataset Configuration#

The dataset configuration (dataset) defines the dataset directories, annotation file and batch size for train, val, or test.

Parameter	Datatype	Description
images	str	Path of the image directory
annotations	str	Path of the annotation file
panoptic	str	Path of the panoptic directory
batch_size	unsigned int	Batch size
num_workers	unsigned int	Number of workers to process the input data

Train Configuration#

The train configuration defines the hyperparameters of the training process.

train:
  precision: "fp16"
  num_gpus: 1
  checkpoint_interval: 10
  validation_interval: 10
  num_epochs: 50
  optim:
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05

Parameter	Datatype	Default	Description	Supported Values
num_gpus	unsigned int	1	Number of GPUs to use for distributed training.	>0
gpu_ids	list[int]	[0]	Indices of GPUs to use for distributed training.
seed	unsigned int	1234	Random seed for random, NumPy, and torch.	>0
num_epochs	unsigned int	10	Total number of epochs to run the experiment.	>0
checkpoint_interval	unsigned int	1	Epoch interval at which checkpoints are saved.	>0
validation_interval	unsigned int	1	Epoch interval at which validation is run.	>0
resume_training_checkpoint_path	string		Intermediate PyTorch Lightning checkpoint from which to resume training.
results_dir	string	/results/train	Directory to save training results.
optim	dict config		Configuration for the optimizer, including the learning rate, learning scheduler, and weight decay.	>0
clip_grad_type	str	full	Type of gradient clip method.
clip_grad_norm	float	0.1	Amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping.	>=0
precision	string	fp32	“fp16” enables precision training; this can help save GPU memory.	fp32, fp16
distributed_strategy	string	ddp	Multi-GPU training strategy. Supported values are `"DDP"` (Distributed Data Parallel) and `"Sharded DDP"`.	ddp, ddp_sharded
activation_checkpoint	bool	True	Whether to recompute in backward pass to save GPU memory, rather than storing activations.	True, False
pretrained_model_path	string		Path of pretrained model checkpoint path to load for finetuning.
num_nodes	unsigned int	1	Number of nodes. If greater than 1, multi-node is enabled.	>0
freeze	string list	[]	List of layer names in the model to freeze; for example, `["backbone", "transformer.encoder", "input_proj"]`.
verbose	bool	False	Whether to print detailed learning rate scaling from the optimizer.	True, False
iters_per_epoch	unsigned int		Number of samples per epoch.

Optimizer Configuration#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

Parameter	Datatype	Default	Description	Supported Values
lr	float	2e-4	Initial learning rate for training the model, excluding the backbone	>0.0
momentum	float	0.9	Momentum for the AdamW optimizer	>0.0
weight_decay	float	1e-4	Weight decay coefficient	>0.0
lr_scheduler	string	MultiStep	Learning scheduler: `MultiStep` : Decrease `lr` by `lr_decay` from `lr_steps` `StepLR` : Decrease `lr` by `lr_decay` at every `lr_step_size`	MultiStep, StepLR
gamma	float	0.1	decreasing factor for the learning rate scheduler	>0.0
milestones	int list	[11]	steps to decrease the learning rate for the `MultiStep` scheduler	int list
monitor_name	string	val_loss	monitor value for the `AutoReduce` scheduler	val_loss, train_loss
type	string	AdamW	type of optimizer to use during training	AdamW, SGD

Evaluation Configuration#

The evaluate parameter defines the hyperparameters of the evaluation process.

evaluate:
  checkpoint: /path/to/model.pth
  num_gpus: 1

Parameter	Datatype	Default	Description	Supported Values
checkpoint	string		Path to the PyTorch model to evaluate.
trt_engine	string		Path to the TensorRT model to evaluate. Must be used only with `tao deploy`.
num_gpus	unsigned int	1	Number of GPUs to use.	>0
gpu_ids	unsigned int	[0]	GPU IDs to use.
results_dir	string	/results/evaluate	Path of the evaluation results directory

Inference Configuration#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  num_gpus: 1

Parameter	Datatype	Default	Description	Supported Values
checkpoint	string		Path to the PyTorch model to inference.
trt_engine	string		Path to the TensorRT model to inference. Must be used only with `tao deploy`.
num_gpus	unsigned int	1	Number of GPUs to use.	>0
gpu_ids	unsigned int	[0]	GPU IDs to use.
results_dir	string	/results/inference	Path of the inference results directory.

Export Configuration#

The export parameter defines the hyperparameters of the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 12
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Parameter	Datatype	Default	Description	Supported Values
checkpoint	string		Path to the PyTorch model to export.
onnx_file	string		Path to the `.onnx` file.
on_cpu	bool	True	If `True`, the DMHA module is exported as standard PyTorch. If `False`, the module is exported using the TRT Plugin.	True, False
opset_version	unsigned int	12	Opset version of the exported ONNX.	>0
input_channel	unsigned int	3	Input channel size. The only supported value is 3.	3
input_width	unsigned int	960	Input width.	>0
input_height	unsigned int	544	Input height.	>0
batch_size	unsigned int	-1	Batch size of the ONNX model. If -1, export uses a dynamic batch size.	>=-1

Training the Model#

To train a OneFormer model, use this command:

FTMS Client

TRAIN_JOB_ID=$(tao-client oneformer experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model oneformer train [-h] -e <experiment_spec>
                    [results_dir=<global_results_dir>]
                    [model.<model_option>=<model_option_value>]
                    [dataset.<dataset_option>=<dataset_option_value>]
                    [train.<train_option>=<train_option_value>]
                    [train.gpu_ids=<gpu indices>]
                    [train.num_gpus=<number of gpus>]

Required Arguments

-e, --experiment_spec: The experiment specification file to set up the training experiment.

Optional Arguments

Optional arguments override option values in the experiment spec file.

-h, --help: Show this help message and exit.
model.<model_option>: The model options.
dataset.<dataset_option>: The dataset options.
train.<train_option>: The train options.
train.optim.<optim_option>: The optimizer options

Note

For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but are inconsistent, for example num_gpus = 1, gpu_ids = [0, 1], then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.

In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set this variable:

CLI Launcher:

You may set the environment variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in ths section Running the launcher.
```
{
    "Envs": [
        {
            "variable": "OMP_NUM_THREADSR",
            "value": "1"
        }

}
```

Docker:

You may set environment variables in Docker by setting the -e flag in the Docker command line.

docker run -it --rm --gpus all \
    -e OMP_NUM_THREADS=1 \
    -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. Checkpoints are saved in train.results_dir, like this:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

The latest checkpoint is also saved as oneformer_model_latest.pth.

Training automatically resumes from oneformer_model_latest.pth if it exists in train.results_dir.

oneformer_model_latest.pth is superseded by train.resume_training_checkpoint_path if it is provided.

The major implication of this logic is that, if you want to trigger fresh training from scratch, you must either:

Specify a new, empty results directory (recommended), or
Remove the latest checkpoint from the results directory.

Optimizing Resources for Training OneFormer#

Training OneFormer requires powerful GPUs (for example, V100 or A100) with at least 15 GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory#

There are various ways to optimize GPU memory usage. A common approach is to reduce dataset.batch_size. However, this can cause your training to take longer than usual.

We recommend setting the following configurations to optimize GPU consumption:

Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.
Set train.activation_checkpoint to True to enable activation checkpointing. Memory usage can be improved by recomputing the activations instead of caching them in memory.
Set train.distributed_strategy to ddp_sharded to enable Sharded DDP training. This shares gradient calculation across different processes to help reduce GPU memory.
Try using lighter-weight backbones, or freeze the backbone by setting train.freeze.
Try changing the augmentation resolution in dataset.augmentation, depending on your dataset.

Optimize CPU Memory#

To speed up data loading, it is a common practice to use many workers to spawn multiple processes. However, this can cause an Out of Memory condition if annotation file is very large. We recommend setting the following configurations to optimize CPU consumption:

Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.
Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leaks, causing an Out of Memory condition in the middle of training. This is the limitation of PyTorch, so we advise setting fixed_padding to True to help stabilize CPU memory usage.

Evaluating the Model#

To run evaluation with a OneFormer model, use this command:

FTMS Client

EVAL_JOB_ID=$(tao-client oneformer experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model oneformer evaluate [-h] -e <experiment_spec>
                    evaluate.checkpoint=<model to be evaluated>
                    [evaluate.<evaluate_option>=<evaluate_option_value>]
                    [evaluate.gpu_ids=<gpu indices>]
                    [evaluate.num_gpus=<number of gpus>]

Required Arguments

-e, --experiment_spec: The experiment spec file to set up the evaluation experiment
evaluate.checkpoint: The .pth model to be evaluated

Optional Arguments

evaluate.<evaluate_option>: The evaluate options.

Running Inference with the oneformer Model#

The inference tool for oneformer models can be used to visualize bounding boxes and masks.

FTMS Client

INFERENCE_JOB_ID=$(tao-client oneformer experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model oneformer inference [-h] -e <experiment spec file>
                    inference.checkpoint=<inference model>
                    [inference.<evaluate_option>=<evaluate_option_value>]
                    [inference.gpu_ids=<gpu indices>]
                    [inference.num_gpus=<number of gpus>]

Required Arguments

-e, --experiment_spec: The experiment spec file to set up the inference experiment
inference.checkpoint: The .pth model to run inference on

Optional Arguments

inference.<inference_option>: The inference options

Exporting the Model#

FTMS Client

EXPORT_JOB_ID=$(tao-client oneformer experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model oneformer export [-h] -e <experiment spec file>
                    [results_dir=<results_dir>]
                    export.checkpoint=<model to export>
                    export.onnx_file=<onnx path>

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The path to an experiment spec file
export.checkpoint: The .pth model to export.
export.onnx_file: The path where the .etlt or .onnx model is saved.

Optional Arguments

The following arguments are optional to run the command.

export.<export_option>: The export options.

TensorRT Engine Generation and Validation#

For deployment, refer to :ref:`TAO Deploy documentation for oneformer <oneformer_with_tao_deploy>`.