Masked Autoencoders (MAE)#

Introduction#

Masked Autoencoders (MAE) are a self-supervised learning technique designed to learn powerful visual representations without the need for labeled data. Inspired by masked language modeling approaches in NLP (such as BERT), MAEs operate by randomly masking portions of an input image and training a model to reconstruct the missing areas. This encourages the model to understand the global structure and semantics of the image in order to accurately fill in the blanks.

The key idea behind MAE is to make the learning task sufficiently challenging and meaningful so that the model must capture high-level information about the input data. Unlike traditional autoencoders, MAEs only encode the visible patches and reconstruct the full image, making them both memory-efficient and effective at learning general-purpose features.

Benefits#

Label-efficient learning: MAEs do not require manually annotated data, making them ideal for large-scale, unlabeled datasets.
Strong representations: Features learned via MAE pretraining can be fine-tuned or transferred to various downstream tasks such as classification, segmentation, and detection.
Scalability: The MAE architecture is highly scalable and can leverage modern transformer-based backbones.

Note

The MAE training and finetuning pipelines are compatible with model checkpoints released in the ConvNeXt-V2 repository, allowing users to leverage pretrained models for transfer learning.

Each task is explained in detail in the following sections.

Note

Throughout this documentation, you will see references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.
- For instructions on creating a dataset using the remote client, see the Creating a dataset section in the Remote Client documentation.
- For instructions on creating an experiment using the remote client, see the Creating an experiment section in the Remote Client documentation.
The spec format is YAML for TAO Launcher and JSON for FTMS Client.
File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher and not for FTMS Client.

Data Input for MAE#

MAE expects input data to be RGB images stored in a single directory. Supported image formats include: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, and .webp.

Creating an Experiment Spec File#

The training experiment spec file for MAE includes the following elements:

model
train
evaluate
inference
export
gen_trt_engine
dataset

Use the following command to create an experiment spec file for MAE:

SPECS=$(tao-client mae get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Here is an example spec file for training a ConvNeXtV2 backbone:

dataset:
  train_data_sources: /data/train/
  val_data_sources: /data/val/
  test_data_sources: /data/test/
  batch_size: 32
  num_workers_per_gpu: 2
  augmentation:
    input_size: 224
    mean:
    - 0.485
    - 0.456
    - 0.406
    std:
    - 0.229
    - 0.224
    - 0.225
    min_scale: 0.1
    max_scale: 2.0
    smoothing: 0.1
    color_jitter: 0.0
    auto_aug: rand-m9-mstd0.5-inc1
    mixup: 0.8
    cutmix: 1.0
    mixup_prob: 1.0
    mixup_switch_prob: 0.5
    mixup_mode: batch
model:
  arch: convnextv2_base
  num_classes: 1000
  drop_path_rate: 0.1
  global_pool: True
  decoder_depth: 1
  decoder_embed_dim: 512
train:
  stage: pretrain # finetune
  accum_grad_batches: 1
  precision: bf16
  distributed_strategy: ddp
  optim:
    type: AdamW
    monitor_name: train_loss
    lr: 2e-4
    backbone_multiplier: 0.1
    momentum: 0.9
    weight_decay: 0.05
    layer_decay: 0.75
    lr_scheduler: cosine # MultiStep
    warmup_epochs: 40
  norm_pix_loss: True
  mask_ratio: 0.75
  num_epochs: 600

Parameter	Data Type	Description	Automl Enabled
model	collection	Configurable parameters to construct the model for an MAE experiment.	False
dataset	collection	Configurable parameters to construct the dataset for an MAE experiment.	False
train	collection	Configurable parameters to construct the trainer for an MAE experiment.	False
evaluate	collection	Configurable parameters to construct the evaluator for an MAE experiment.	False
inference	collection	Configurable parameters to construct the inferencer for an MAE experiment.	False
export	collection	Configurable parameters to construct the exporter for an MAE experiment.	False
gen_trt_engine	collection	Configurable parameters to construct the TensorRT engine builder for an MAE experiment.	False

model#

The model parameter provides options to change the MAE architecture.

model:
  arch: convnextv2_base
  num_classes: 1000
  drop_path_rate: 0.1
  global_pool: True
  decoder_depth: 1
  decoder_embed_dim: 512

Parameter	Datatype	Default	Description	Supported Values
arch	string	convnextv2_base	The model architecture to use	convnextv2_atto, convnextv2_femto, convnextv2_pico, convnextv2_nano, convnextv2_tiny, convnextv2_base, convnextv2_large, convnextv2_huge vit_base_patch16, vit_large_patch16 vit_huge_patch14, hiera_tiny_224 hiera_small_224, hiera_base_224 hiera_large_224, hiera_huge_224
num_classes	int	1000	The number of classes for classification	>0
drop_path_rate	float	0.1	The drop path rate for stochastic depth	>=0.0
global_pool	bool	True	Whether to use global pooling in the model	True/False
decoder_depth	int	1	The depth of the MAE decoder	>0
decoder_embed_dim	int	512	The embedding dimension of the MAE decoder	>0

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_data_sources: /data/train/
  val_data_sources: /data/val/
  test_data_sources: /data/test/
  batch_size: 32
  num_workers_per_gpu: 2
  augmentation:
    input_size: 224
    mean:
    - 0.485
    - 0.456
    - 0.406
    std:
    - 0.229
    - 0.224
    - 0.225
    min_scale: 0.1
    max_scale: 2.0
    smoothing: 0.1
    color_jitter: 0.0
    auto_aug: rand-m9-mstd0.5-inc1
    mixup: 0.8
    cutmix: 1.0
    mixup_prob: 1.0
    mixup_switch_prob: 0.5
    mixup_mode: batch

Parameter	Datatype	Default	Description	Supported Values
train_data_sources	string		The directory containing training images
val_data_sources	string		The directory containing validation images
batch_size	int	3	The batch size for training and validation	>0
num_workers_per_gpu	int	2	The number of workers per GPU for data loading	>0

augmentation#

The augmentation parameter contains hyperparameters for data augmentation.

Parameter	Datatype	Default	Description	Supported Values
input_size	int	224	The input image size	>0
mean	float list	[0.485, 0.456, 0.406]	The mean values for image normalization	list of 3 values
std	float list	[0.229, 0.224, 0.225]	The standard deviation values for image normalization	list of 3 values
min_scale	float	0.1	The minimum scale for random resizing	>0.0
max_scale	float	2.0	The maximum scale for random resizing	>0.0
min_ratio	float	0.1	The minimum ratio for random resizing	>0.0
max_ratio	float	2.0	The maximum ratio for random resizing	>0.0
smoothing	float	0.1	The label smoothing value	>=0.0
color_jitter	float	0.0	The color jittering strength	>=0.0
auto_aug	string	rand-m9-mstd0.5-inc1	The auto augmentation policy
mixup	float	0.8	The mixup alpha value	>=0.0
cutmix	float	1.0	The cutmix alpha value	>=0.0
mixup_prob	float	1.0	The probability of applying mixup	>=0.0
mixup_switch_prob	float	0.5	The probability of switching between mixup and cutmix	>=0.0
mixup_mode	string	batch	The mixup mode	batch, pair, elem
interpolation	string	random	The interpolation method	random, bilinear
hflip	float	0.5	The probability of horizontal flipping	>=0.0
re_prob	float	0.0	The probability of random erasing	>=0.0

train#

The train parameter defines the hyperparameters of the training process.

train:
  stage: pretrain
  accum_grad_batches: 1
  precision: fp32
  distributed_strategy: ddp
  optim:
    type: AdamW
    monitor_name: train_loss
    lr: 2e-4
    backbone_multiplier: 0.1
    momentum: 0.9
    weight_decay: 0.05
    layer_decay: 0.75
    lr_scheduler: MultiStep
    milestones: [88, 96]
    gamma: 0.1
    warmup_epochs: 1
  norm_pix_loss: True
  mask_ratio: 0.75

Parameter	Datatype	Default	Description	Supported Values
stage	string	pretrain	The training stage (pretrain or finetune)	pretrain, finetune
accum_grad_batches	int	1	The number of gradient accumulation steps	>0
precision	string	fp32	The training precision	fp32, bf16, fp16
distributed_strategy	string	ddp	The distributed training strategy	ddp, fsdp
norm_pix_loss	bool	True	Whether to use normalized pixel loss	True/False
mask_ratio	float	0.75	The ratio of patches to mask	>0.0, <1.0
num_gpus	unsigned int	1	The number of GPUs to use for distributed training	>0
gpu_ids	List[int]	[0]	The indices of the GPU’s to use for distributed training
seed	unsigned int	1234	The random seed for random, NumPy, and torch	>0
num_epochs	unsigned int	10	The total number of epochs to run the experiment	>0
checkpoint_interval	unsigned int	1	The epoch interval at which the checkpoints are saved	>0
validation_interval	unsigned int	1	The epoch interval at which the validation is run	>0
resume_training_checkpoint_path	string		The intermediate PyTorch Lightning checkpoint to resume training from
results_dir	string		The directory to save training results

optim#

The optim parameter defines the config for the optimizer in training.

Parameter	Datatype	Default	Description	Supported Values
type	string	AdamW	The optimizer type	AdamW
monitor_name	string	train_loss	The metric to monitor for learning rate scheduling	train_loss, val_loss
lr	float	2e-4	The learning rate	>0.0
backbone_multiplier	float	0.1	The learning rate multiplier for the backbone	>0.0
momentum	float	0.9	The momentum value	>0.0
weight_decay	float	0.05	The weight decay coefficient	>=0.0
layer_decay	float	0.75	The layer-wise learning rate decay	>0.0
lr_scheduler	string	MultiStep	The learning rate scheduler type	MultiStep, cosine
milestones	int list	[88, 96]	The epochs at which to decay the learning rate
gamma	float	0.1	The learning rate decay factor	>0.0
warmup_epochs	int	1	The number of warmup epochs	>=0

Training the Model#

Use the following command to run MAE training:

TRAIN_JOB_ID=$(tao-client mae experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Multi-Node Training with FTMS

Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.

Please verify that your cluster has multiple GPU enabled nodes available for training. You can do this by running the following command:

kubectl get nodes -o wide

You should see multiple nodes listed. If you do not see multiple nodes, please contact your cluster administrator to get more nodes added to your cluster.

To run a multi-node training job through FTMS, you can modify the following fields in the training job spec:

{
    "train": {
        "num_gpus": 8, // Number of GPUs per node
        "num_nodes": 2 // Number of nodes to use for training
    }
}

If these fields are not specified, the default value of 1 GPU per node and 1 node will be used.

Note

The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster. The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.

tao model mae train [-h] -e <experiment_spec_file>
                          [results_dir=<global_results_dir>]
                          [model.<model_option>=<model_option_value>]
                          [dataset.<dataset_option>=<dataset_option_value>]
                          [train.<train_option>=<train_option_value>]
                          [train.gpu_ids=<gpu indices>]
                          [train.num_gpus=<number of gpus>]

Required Arguments

The only required argument is the path to the experiment spec:

-e, --experiment_spec: The experiment specification file to set up the training experiment

Optional Arguments

You can set optional arguments to override the option values in the experiment spec file.

-h, --help: Show this help message and exit.
model.<model_option>: The model options.
dataset.<dataset_option>: The dataset options.
train.<train_option>: The train options.
train.optim.<optim_option>: The optimizer options

Note

For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but inconsistent, for example num_gpus = 1, gpu_ids = [0, 1]`, then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2.

In some cases, you may encounter an issue with multi-GPU training resulting in a segmentation fault. You may circumvent this by setting the OMP_NUM_THREADS enviroment variable to 1. Depending upon your model of execution, you may use the following methods to set this variable

CLI Launcher

You may set this env variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in this section

{
    "Envs": [
        {
            "variable": "OMP_NUM_THREADSR",
            "value": "1"
        }
    ]
}

Docker

You may set environment variables in the docker by setting the -e flag in the docker command line.

docker run -it --rm --gpus all \
    -e OMP_NUM_THREADS=1 \
    -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. These are saved in train.results_dir, like so:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluate process.

evaluate:
  checkpoint: /path/to/model.pth
  num_gpus: 1
  gpu_ids: [0]
  results_dir: /path/to/results

Field	value_type	description	default_value	automl_enabled
checkpoint	string	Path to the model checkpoint to evaluate		False
results_dir	string	The directory to save evaluation results
num_gpus	unsigned int	The number of GPUs to use for distributed evaluation	>0
gpu_ids	List[int]	The indices of the GPU’s to use for distributed evaluation
trt_engine	string	Path to TensorRT model to evaluate. Only used with TAO deploy

Note

The evaluation pipeline only supports the checkpoints from the finetune stage.

To run evaluation with an MAE model, use this command:

EVAL_JOB_ID=$(tao-client mae experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mae evaluate [-h] -e <experiment_spec>
                        evaluate.checkpoint=<model to be evaluated>
                        [evaluate.<evaluate_option>=<evaluate_option_value>]
                        [evaluate.gpu_ids=<gpu indices>]
                        [evaluate.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required.

-e, --experiment_spec: The experiment spec file to set up the evaluation experiment
evaluate.checkpoint: The .pth model to be evaluated.

Optional Arguments

The following arguments are optional to run the command.

evaluate.<evaluate_option>: The evaluate options.

Running Inference with an MAE Model#

inference#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  num_gpus: 1
  gpu_ids: [0]
  results_dir: /path/to/results

Field	value_type	description	default_value	automl_enabled
checkpoint	string	Path to the model checkpoint to evaluate		False
results_dir	string	The directory to save evaluation results
num_gpus	unsigned int	The number of GPUs to use for distributed evaluation	>0
gpu_ids	List[int]	The indices of the GPU’s to use for distributed evaluation
trt_engine	string	Path to TensorRT model to evaluate. Only used with TAO deploy

Note

The inference pipeline only supports the checkpoints from the finetune stage.

To run inference with an MAE model, use this command:

INFERENCE_JOB_ID=$(tao-client mae experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mae inference [-h] -e <experiment_spec>
                          inference.checkpoint=<model to be inferenced>
                          [inference.<inference_option>=<inference_option_value>]
                          [inference.gpu_ids=<gpu indices>]
                          [inference.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required.

-e, --experiment_spec: The experiment spec file to set up the inference experiment
inference.checkpoint: The .pth model to use for inference.

Optional Arguments

The following arguments are optional to run the command.

inference.<inference_option>: The inference options.

Exporting the Model#

export#

The export parameter defines the hyperparameters for exporting the model.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 12
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Parameter	Datatype	Default	Description	Supported Values
checkpoint	string		The path to the PyTorch model to export
onnx_file	string		The path to the `.onnx` file
on_cpu	bool	True	If this value is True, the DMHA module is exported as standard PyTorch. If this value is False, the module is exported using the TRT Plugin.	True, False
opset_version	unsigned int	12	The opset version of the exported ONNX	>0
input_channel	unsigned int	3	The input channel size. Only the value 3 is supported.	3
input_width	unsigned int	960	The input width	>0
input_height	unsigned int	544	The input height	>0
batch_size	unsigned int	-1	The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size.	>=-1

Note

The export pipeline supports the checkpoints from both pretrain and finetune stages. When exporting the finetune stage model, the output tensor is the classification logits. When exporting the pretrain stage model, the output tensor is the backbone features before the classification head.

To export an MAE model, use this command:

EXPORT_JOB_ID=$(tao-client mae experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

tao model mae export [-h] -e <experiment_spec>
                        export.checkpoint=<model to export>
                        export.onnx_file=<onnx path>
                        [export.<export_option>=<export_option_value>]

Required Arguments

The following arguments are required.

-e, --experiment_spec: The experiment spec file to set up the export experiment
export.checkpoint: The .pth model to export.
export.onnx_file: The path where the .onnx model is saved.

Optional Arguments

The following arguments are optional to run the command.

export.<export_option>: The export options.

TensorRT Engine Generation#

For deployment, refer to TAO Deploy documentation.