OneFormer#

OneFormer supports the following tasks:

  • Train

  • Evaluate

  • Inference

  • Export

The following sections explain each task in detail.

Note

  • Throughout this documentation are references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.

    • For instructions on creating a dataset using the remote client, refer to the Creating a dataset section in the Remote Client documentation.

    • For instructions on creating an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

  • The spec format is YAML for TAO Launcher, and JSON for FTMS Client.

  • File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.

Dataset Format#

OneFormer supports three types of dataloaders, corresponding to the semantic, panoptic and instance segmentation tasks.

Each dataloader requires a certain annotation format.

For the semantic segmentation task, each line of the JSONL annotation file encodes the locations of the raw image and the mask ground truth.

For the panoptic and instance segmentation tasks, the annotation formats respectively follow the COCO panoptic and COCO format.

Note

The category IDs and annotation IDs must be greater than 0.

Creating a Configuration File#

SPECS=$(tao-client oneformer get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Below is a sample OneFormer spec file. It has six components --:code:`model, :code:`inference`, :code:`evaluate`, :code:`dataset`, :code:`export`, and :code:`train`, as well as several global parameters, described below. The spec file is coded in YAML file format.

Here’s a sample of the OneFormer spec file:

results_dir: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin
dataset:
    train:
        images: /workspace/datasets/coco/train2017
        annotations: /workspace/datasets/coco/annotations/panoptic_train2017.json
        panoptic: /workspace/datasets/coco/panoptic_train2017
        batch_size: 4
        num_workers: 4
    val:
        images: /workspace/datasets/coco/val2017
        annotations: /workspace/datasets/coco/annotations/panoptic_val2017.json
        panoptic: /workspace/datasets/coco/panoptic_val2017
        batch_size: 4
        num_workers: 4
    test:
        images: /workspace/datasets/coco/val2017
        annotations: /workspace/datasets/coco/annotations/panoptic_val2017.json
        panoptic: /workspace/datasets/coco/panoptic_val2017
        batch_size: 4
        num_workers: 4
    image_size: 1024
    label_map: /workspace/datasets/coco/label_map.json
    cutmix_prob: 0.0
model:
    backbone:
        name: D2SwinTransformer
        freeze_at: 0
        swin:
        embed_dim: 192
        depths: [2, 2, 18, 2]
        num_heads: [6, 12, 24, 48]
        window_size: 12
        mlp_ratio: 4.0
        patch_size: 4
        patch_norm: true
        ape: false
        pretrain_img_size: 384
        qkv_bias: true
        qk_scale: null
        attn_drop_rate: 0.0
        drop_rate: 0.0
        drop_path_rate: 0.3
        out_features: [res2, res3, res4, res5]
        out_indices: [0, 1, 2, 3]
        use_checkpoint: false
    one_former:
        num_object_queries: 150
    sem_seg_head:
        num_classes: 133
    test:
        test_topk_per_image: 100
        object_mask_threshold: 0.8
train:
    num_epochs: 50
    num_gpus: 8
    num_nodes: 4
    pretrained_model: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin_base/train/model_epoch_006_step_25879.pth
    pretrained_backbone:
    precision: 32
    iters_per_epoch: 15000
evaluate:
    checkpoint: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin/train/model_epoch_001_step_01850.pth
    num_gpus: 1
    gpu_ids: [0]
    results_dir: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin/eval
inference:
    mode: semantic
    results_dir: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin/inference
    images_dir: /workspace/datasets/coco/val2017
    image_size: [1024, 1024]
    checkpoint: nvidia_tao_pytorch/cv/oneformer/checkpoints/coco/swin/train/model_epoch_001_step_01850.pth

Parameter

Data Type

Default

Description

Supported Values

model

dict config

Configuration of the model architecture

dataset

dict config

Configuration of the dataset

train

dict config

Configuration of the training task

evaluate

dict config

Configuration of the evaluation task

inference

dict config

Configuration of the inference task

encryption_key

string

None

Encryption key to encrypt and decrypt model files

results_dir

string

/results

Directory where experiment results are saved

export

dict config

Configuration of the ONNX export task

Model Config#

The model configuration (model) defines the oneformer model structure. Thw model is used for training, evaluation, and inference. The table below provides a detailed description of the model structure. Currently, oneformer only supports Swin Transformers and EfficientViT (experimental feature) models.

Field

Description

Data Type and Constraints

Supported Value

backbone

Backbone configuration

dict

one_former

Configuration for the oneformer architecture

dict

sem_seg_head

Configuration for the segmentation head

dict

text_encoder

Configuration for the text encoder

dict

mode

Postprocesing mode

string

"panoptic", "semantic", "instance"

object_mask_threshold

Classification confidence threshold

float

0.4

overlap_threshold

Overlap threshold for panoptic inference

float

0.8

test_topk_per_image

Keep topk instances per image for instance inference

unsigned int

100

Backbone Configuration#

The backbone configuration (backbone) defines the backbone structure. The table below provides a detailed description. OneFormer currently supports only Swin Transformers and EfficientViT models.

Field

Description

Data Type and Constraints

Recommended/Typical Value

type

Backbone type

str

"swin"

pretrained_weights

Path to the pretrained backbone model

str

swin

Configuration for the Swin backbones

dict

Swin Configuration#

The swin configuration (swin) specifies the key parameters in a Swin Transformer backbone.

Field

Description

Data Type and Constraints

Recommended/Typical Value

embed_dim

Dimension of the embedding

unsigned int

192

depths

Number of layers in each stage

list

[2, 2, 18, 2]

num_heads

Number of attention heads in each stage

list

[6, 12, 24, 48]

window_size

Size of the window for local attention

unsigned int

12

mlp_ratio

Ratio of the MLP hidden dimension to the embedding dimension

float

4.0

patch_size

Size of the patch for the patch embedding

unsigned int

4

patch_norm

Whether to normalize the patch embedding

bool

True

ape

Whether to use absolute positional encoding

bool

False

qkv_bias

Whether to use bias in the QKV projection

bool

True

qk_scale

Scale factor for the QK projection

float

None

attn_drop_rate

Dropout rate for the attention

float

0.0

drop_rate

Dropout rate for the MLP

float

0.0

drop_path_rate

Drop path rate for the MLP

float

0.3

out_features

Names of the extracted feature maps

list

["res2", "res3", "res4", "res5"]

out_indices

Stages to extract feature maps

list

[0, 1, 2, 3]

use_checkpoint

Whether to use checkpoint for the transformer

bool

False

pretrain_img_size

Image size used in pretraining

unsigned int

384

Data Config#

The data configuration (data) defines the data source, augmentation methods, and preprocessing hyperparameters.

Field

Description

Data Type and Constraints

Recommended/Typical Value

pixel_mean

Image mean in RGB order

list

[123.675, 116.28, 103.53]

pixel_std

Image standard deviation in RGB order

list

[58.395, 57.12, 57.375]

augmentation

Augmentation settings

dict

contiguous_id

Whether to use contiguous IDs

bool

label_map

Path of the label mapping file

string

workers

Number of workers to load data for each GPU

unsigned int

train

Train dataset configuration

dict

val

Validation dataset configuration

dict

test

Test dataset configuration

dict

Augmentation Config#

The augmentation configuration (augmentation) defines the augmentation methods.

Parameter

Datatype

Description

Supported Values

train_min_size

int list

List of sizes to perform random resize for training data

int list

train_max_size

unsigned int

Minimum random crop size for training data

>0

train_crop_size

int list

Random crop size for training data in [H, W]

int list

test_min_size

unsigned int

Minimum resize size for test data

>0

test_max_size

unsigned int

Maximum resize size for test data

>0

Dataset Configuration#

The dataset configuration (dataset) defines the dataset directories, annotation file and batch size for train, val, or test.

Parameter

Datatype

Description

images

str

Path of the image directory

annotations

str

Path of the annotation file

panoptic

str

Path of the panoptic directory

batch_size

unsigned int

Batch size

num_workers

unsigned int

Number of workers to process the input data

Train Configuration#

The train configuration defines the hyperparameters of the training process.

train:
  precision: "fp16"
  num_gpus: 1
  checkpoint_interval: 10
  validation_interval: 10
  num_epochs: 50
  optim:
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05

Parameter

Datatype

Default

Description

Supported Values

num_gpus

unsigned int

1

Number of GPUs to use for distributed training.

>0

gpu_ids

list[int]

[0]

Indices of GPUs to use for distributed training.

seed

unsigned int

1234

Random seed for random, NumPy, and torch.

>0

num_epochs

unsigned int

10

Total number of epochs to run the experiment.

>0

checkpoint_interval

unsigned int

1

Epoch interval at which checkpoints are saved.

>0

validation_interval

unsigned int

1

Epoch interval at which validation is run.

>0

resume_training_checkpoint_path

string

Intermediate PyTorch Lightning checkpoint from which to resume training.

results_dir

string

/results/train

Directory to save training results.

optim

dict config

Configuration for the optimizer, including the learning rate, learning scheduler, and weight decay.

>0

clip_grad_type

str

full

Type of gradient clip method.

clip_grad_norm

float

0.1

Amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping.

>=0

precision

string

fp32

“fp16” enables precision training; this can help save GPU memory.

fp32, fp16

distributed_strategy

string

ddp

Multi-GPU training strategy. Supported values are "DDP" (Distributed Data Parallel) and "Sharded DDP".

ddp, ddp_sharded

activation_checkpoint

bool

True

Whether to recompute in backward pass to save GPU memory, rather than storing activations.

True, False

pretrained_model_path

string

Path of pretrained model checkpoint path to load for finetuning.

num_nodes

unsigned int

1

Number of nodes. If greater than 1, multi-node is enabled.

>0

freeze

string list

[]

List of layer names in the model to freeze; for example, ["backbone", "transformer.encoder", "input_proj"].

verbose

bool

False

Whether to print detailed learning rate scaling from the optimizer.

True, False

iters_per_epoch

unsigned int

Number of samples per epoch.

Optimizer Configuration#

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

Parameter

Datatype

Default

Description

Supported Values

lr

float

2e-4

Initial learning rate for training the model, excluding the backbone

>0.0

momentum

float

0.9

Momentum for the AdamW optimizer

>0.0

weight_decay

float

1e-4

Weight decay coefficient

>0.0

lr_scheduler

string

MultiStep

Learning scheduler:

  • MultiStep : Decrease lr by lr_decay from lr_steps

  • StepLR : Decrease lr by lr_decay at every lr_step_size

MultiStep, StepLR

gamma

float

0.1

decreasing factor for the learning rate scheduler

>0.0

milestones

int list

[11]

steps to decrease the learning rate for the MultiStep scheduler

int list

monitor_name

string

val_loss

monitor value for the AutoReduce scheduler

val_loss, train_loss

type

string

AdamW

type of optimizer to use during training

AdamW, SGD

Evaluation Configuration#

The evaluate parameter defines the hyperparameters of the evaluation process.

evaluate:
  checkpoint: /path/to/model.pth
  num_gpus: 1

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

Path to the PyTorch model to evaluate.

trt_engine

string

Path to the TensorRT model to evaluate. Must be used only with tao deploy.

num_gpus

unsigned int

1

Number of GPUs to use.

>0

gpu_ids

unsigned int

[0]

GPU IDs to use.

results_dir

string

/results/evaluate

Path of the evaluation results directory

Inference Configuration#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  num_gpus: 1

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

Path to the PyTorch model to inference.

trt_engine

string

Path to the TensorRT model to inference. Must be used only with tao deploy.

num_gpus

unsigned int

1

Number of GPUs to use.

>0

gpu_ids

unsigned int

[0]

GPU IDs to use.

results_dir

string

/results/inference

Path of the inference results directory.

Export Configuration#

The export parameter defines the hyperparameters of the export process.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 12
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

Path to the PyTorch model to export.

onnx_file

string

Path to the .onnx file.

on_cpu

bool

True

If True, the DMHA module is exported as standard PyTorch. If False, the module is exported using the TRT Plugin.

True, False

opset_version

unsigned int

12

Opset version of the exported ONNX.

>0

input_channel

unsigned int

3

Input channel size. The only supported value is 3.

3

input_width

unsigned int

960

Input width.

>0

input_height

unsigned int

544

Input height.

>0

batch_size

unsigned int

-1

Batch size of the ONNX model. If -1, export uses a dynamic batch size.

>=-1

Training the Model#

To train a OneFormer model, use this command:

TRAIN_JOB_ID=$(tao-client oneformer experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model oneformer train [-h] -e <experiment_spec>
                    [results_dir=<global_results_dir>]
                    [model.<model_option>=<model_option_value>]
                    [dataset.<dataset_option>=<dataset_option_value>]
                    [train.<train_option>=<train_option_value>]
                    [train.gpu_ids=<gpu indices>]
                    [train.num_gpus=<number of gpus>]

Required Arguments

  • -e, --experiment_spec: The experiment specification file to set up the training experiment.

Optional Arguments

Optional arguments override option values in the experiment spec file.

Note

For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but are inconsistent, for example num_gpus = 1, gpu_ids = [0, 1], then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.

In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set this variable:

  • CLI Launcher:

    You may set the environment variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in ths section Running the launcher.

    {
        "Envs": [
            {
                "variable": "OMP_NUM_THREADSR",
                "value": "1"
            }
    
    }
    
  • Docker:

    You may set environment variables in Docker by setting the -e flag in the Docker command line.

    docker run -it --rm --gpus all \
        -e OMP_NUM_THREADS=1 \
        -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e
    

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. Checkpoints are saved in train.results_dir, like this:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

The latest checkpoint is also saved as oneformer_model_latest.pth.

Training automatically resumes from oneformer_model_latest.pth if it exists in train.results_dir.

oneformer_model_latest.pth is superseded by train.resume_training_checkpoint_path if it is provided.

The major implication of this logic is that, if you want to trigger fresh training from scratch, you must either:

  • Specify a new, empty results directory (recommended), or

  • Remove the latest checkpoint from the results directory.

Optimizing Resources for Training OneFormer#

Training OneFormer requires powerful GPUs (for example, V100 or A100) with at least 15 GB of VRAM and a lot of CPU memory to be trained on a standard dataset like COCO. This section outlines some of the strategies you can use to launch training with only limited resources.

Optimize GPU Memory#

There are various ways to optimize GPU memory usage. A common approach is to reduce dataset.batch_size. However, this can cause your training to take longer than usual.

We recommend setting the following configurations to optimize GPU consumption:

  • Set train.precision to fp16 to enable automatic mixed precision training. This can reduce your GPU memory usage by 50%.

  • Set train.activation_checkpoint to True to enable activation checkpointing. Memory usage can be improved by recomputing the activations instead of caching them in memory.

  • Set train.distributed_strategy to ddp_sharded to enable Sharded DDP training. This shares gradient calculation across different processes to help reduce GPU memory.

  • Try using lighter-weight backbones, or freeze the backbone by setting train.freeze.

  • Try changing the augmentation resolution in dataset.augmentation, depending on your dataset.

Optimize CPU Memory#

To speed up data loading, it is a common practice to use many workers to spawn multiple processes. However, this can cause an Out of Memory condition if annotation file is very large. We recommend setting the following configurations to optimize CPU consumption:

  • Set dataset.dataset_type to serialized so that the COCO-based annotation data can be shared across different subprocesses.

  • Set dataset.augmentation.fixed_padding to True so that images are padded before the batch formulation. Due to random resize and random crop augmentation during training, the resulting image resolution after transform can vary across images. Such variable image resolutions can cause memory leaks, causing an Out of Memory condition in the middle of training. This is the limitation of PyTorch, so we advise setting fixed_padding to True to help stabilize CPU memory usage.

Evaluating the Model#

To run evaluation with a OneFormer model, use this command:

EVAL_JOB_ID=$(tao-client oneformer experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model oneformer evaluate [-h] -e <experiment_spec>
                    evaluate.checkpoint=<model to be evaluated>
                    [evaluate.<evaluate_option>=<evaluate_option_value>]
                    [evaluate.gpu_ids=<gpu indices>]
                    [evaluate.num_gpus=<number of gpus>]

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the evaluation experiment

  • evaluate.checkpoint: The .pth model to be evaluated

Optional Arguments

Running Inference with the oneformer Model#

The inference tool for oneformer models can be used to visualize bounding boxes and masks.

INFERENCE_JOB_ID=$(tao-client oneformer experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model oneformer inference [-h] -e <experiment spec file>
                    inference.checkpoint=<inference model>
                    [inference.<evaluate_option>=<evaluate_option_value>]
                    [inference.gpu_ids=<gpu indices>]
                    [inference.num_gpus=<number of gpus>]

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the inference experiment

  • inference.checkpoint: The .pth model to run inference on

Optional Arguments

Exporting the Model#

EXPORT_JOB_ID=$(tao-client oneformer experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model oneformer export [-h] -e <experiment spec file>
                    [results_dir=<results_dir>]
                    export.checkpoint=<model to export>
                    export.onnx_file=<onnx path>

Required Arguments

The following arguments are required to run the command.

  • -e, --experiment_spec: The path to an experiment spec file

  • export.checkpoint: The .pth model to export.

  • export.onnx_file: The path where the .etlt or .onnx model is saved.

Optional Arguments

The following arguments are optional to run the command.

TensorRT Engine Generation and Validation#

For deployment, refer to :ref:`TAO Deploy documentation for oneformer <oneformer_with_tao_deploy>`.