Masked Autoencoders (MAE)#
Introduction#
Masked Autoencoders (MAE) are a self-supervised learning technique designed to learn powerful visual representations without the need for labeled data. Inspired by masked language modeling approaches in NLP (such as BERT), MAEs operate by randomly masking portions of an input image and training a model to reconstruct the missing areas. This encourages the model to understand the global structure and semantics of the image in order to accurately fill in the blanks.
The key idea behind MAE is to make the learning task sufficiently challenging and meaningful so that the model must capture high-level information about the input data. Unlike traditional autoencoders, MAEs only encode the visible patches and reconstruct the full image, making them both memory-efficient and effective at learning general-purpose features.
Benefits#
Label-efficient learning: MAEs do not require manually annotated data, making them ideal for large-scale, unlabeled datasets.
Strong representations: Features learned via MAE pretraining can be fine-tuned or transferred to various downstream tasks such as classification, segmentation, and detection.
Scalability: The MAE architecture is highly scalable and can leverage modern transformer-based backbones.
Note
The MAE training and finetuning pipelines are compatible with model checkpoints released in the ConvNeXt-V2 repository, allowing users to leverage pretrained models for transfer learning.
Each task is explained in detail in the following sections.
Note
Throughout this documentation, you will see references to
$EXPERIMENT_ID
and$DATASET_ID
in the FTMS Client sections.For instructions on creating a dataset using the remote client, see the Creating a dataset section in the Remote Client documentation.
For instructions on creating an experiment using the remote client, see the Creating an experiment section in the Remote Client documentation.
The spec format is YAML for TAO Launcher and JSON for FTMS Client.
File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher and not for FTMS Client.
Data Input for MAE#
MAE expects input data to be RGB images stored in a single directory. Supported image formats include: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, and .webp.
Creating an Experiment Spec File#
The training experiment spec file for MAE includes the following elements:
model
train
evaluate
inference
export
gen_trt_engine
dataset
Use the following command to create an experiment spec file for MAE:
SPECS=$(tao-client mae get-spec --action train --job_type experiment --id $EXPERIMENT_ID)
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Here is an example spec file for training a ConvNeXtV2 backbone:
dataset:
train_data_sources: /data/train/
val_data_sources: /data/val/
test_data_sources: /data/test/
batch_size: 32
num_workers_per_gpu: 2
augmentation:
input_size: 224
mean:
- 0.485
- 0.456
- 0.406
std:
- 0.229
- 0.224
- 0.225
min_scale: 0.1
max_scale: 2.0
smoothing: 0.1
color_jitter: 0.0
auto_aug: rand-m9-mstd0.5-inc1
mixup: 0.8
cutmix: 1.0
mixup_prob: 1.0
mixup_switch_prob: 0.5
mixup_mode: batch
model:
arch: convnextv2_base
num_classes: 1000
drop_path_rate: 0.1
global_pool: True
decoder_depth: 1
decoder_embed_dim: 512
train:
stage: pretrain # finetune
accum_grad_batches: 1
precision: bf16
distributed_strategy: ddp
optim:
type: AdamW
monitor_name: train_loss
lr: 2e-4
backbone_multiplier: 0.1
momentum: 0.9
weight_decay: 0.05
layer_decay: 0.75
lr_scheduler: cosine # MultiStep
warmup_epochs: 40
norm_pix_loss: True
mask_ratio: 0.75
num_epochs: 600
Parameter |
Data Type |
Description |
Default Value |
Valid Min |
Valid Max |
Valid Options |
Automl Enabled |
---|---|---|---|---|---|---|---|
model |
collection |
Configurable parameters to construct the model for an MAE experiment. |
False |
||||
dataset |
collection |
Configurable parameters to construct the dataset for an MAE experiment. |
False |
||||
train |
collection |
Configurable parameters to construct the trainer for an MAE experiment. |
False |
||||
evaluate |
collection |
Configurable parameters to construct the evaluator for an MAE experiment. |
False |
||||
inference |
collection |
Configurable parameters to construct the inferencer for an MAE experiment. |
False |
||||
export |
collection |
Configurable parameters to construct the exporter for an MAE experiment. |
False |
||||
gen_trt_engine |
collection |
Configurable parameters to construct the TensorRT engine builder for an MAE experiment. |
False |
model#
The model parameter provides options to change the MAE architecture.
model:
arch: convnextv2_base
num_classes: 1000
drop_path_rate: 0.1
global_pool: True
decoder_depth: 1
decoder_embed_dim: 512
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
arch
|
string
|
convnextv2_base
|
The model architecture to use
|
convnextv2_atto, convnextv2_femto,
convnextv2_pico, convnextv2_nano,
convnextv2_tiny, convnextv2_base,
convnextv2_large, convnextv2_huge
vit_base_patch16, vit_large_patch16
vit_huge_patch14, hiera_tiny_224
hiera_small_224, hiera_base_224
hiera_large_224, hiera_huge_224
|
num_classes |
int |
1000 |
The number of classes for classification |
>0 |
drop_path_rate |
float |
0.1 |
The drop path rate for stochastic depth |
>=0.0 |
global_pool |
bool |
True |
Whether to use global pooling in the model |
True/False |
decoder_depth |
int |
1 |
The depth of the MAE decoder |
>0 |
decoder_embed_dim |
int |
512 |
The embedding dimension of the MAE decoder |
>0 |
dataset#
The dataset parameter defines the dataset source, training batch size, and augmentation.
dataset:
train_data_sources: /data/train/
val_data_sources: /data/val/
test_data_sources: /data/test/
batch_size: 32
num_workers_per_gpu: 2
augmentation:
input_size: 224
mean:
- 0.485
- 0.456
- 0.406
std:
- 0.229
- 0.224
- 0.225
min_scale: 0.1
max_scale: 2.0
smoothing: 0.1
color_jitter: 0.0
auto_aug: rand-m9-mstd0.5-inc1
mixup: 0.8
cutmix: 1.0
mixup_prob: 1.0
mixup_switch_prob: 0.5
mixup_mode: batch
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
---|---|---|---|---|---|
train_data_sources |
string |
The directory containing training images |
|||
val_data_sources |
string |
The directory containing validation images |
|||
batch_size |
int |
3 |
The batch size for training and validation |
>0 |
|
num_workers_per_gpu |
int |
2 |
The number of workers per GPU for data loading |
>0 |
augmentation#
The augmentation parameter contains hyperparameters for data augmentation.
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
input_size |
int |
224 |
The input image size |
>0 |
mean |
float list |
[0.485, 0.456, 0.406] |
The mean values for image normalization |
list of 3 values |
std |
float list |
[0.229, 0.224, 0.225] |
The standard deviation values for image normalization |
list of 3 values |
min_scale |
float |
0.1 |
The minimum scale for random resizing |
>0.0 |
max_scale |
float |
2.0 |
The maximum scale for random resizing |
>0.0 |
min_ratio |
float |
0.1 |
The minimum ratio for random resizing |
>0.0 |
max_ratio |
float |
2.0 |
The maximum ratio for random resizing |
>0.0 |
smoothing |
float |
0.1 |
The label smoothing value |
>=0.0 |
color_jitter |
float |
0.0 |
The color jittering strength |
>=0.0 |
auto_aug |
string |
rand-m9-mstd0.5-inc1 |
The auto augmentation policy |
|
mixup |
float |
0.8 |
The mixup alpha value |
>=0.0 |
cutmix |
float |
1.0 |
The cutmix alpha value |
>=0.0 |
mixup_prob |
float |
1.0 |
The probability of applying mixup |
>=0.0 |
mixup_switch_prob |
float |
0.5 |
The probability of switching between mixup and cutmix |
>=0.0 |
mixup_mode |
string |
batch |
The mixup mode |
batch, pair, elem |
interpolation |
string |
random |
The interpolation method |
random, bilinear |
hflip |
float |
0.5 |
The probability of horizontal flipping |
>=0.0 |
re_prob |
float |
0.0 |
The probability of random erasing |
>=0.0 |
train#
The train parameter defines the hyperparameters of the training process.
train:
stage: pretrain
accum_grad_batches: 1
precision: fp32
distributed_strategy: ddp
optim:
type: AdamW
monitor_name: train_loss
lr: 2e-4
backbone_multiplier: 0.1
momentum: 0.9
weight_decay: 0.05
layer_decay: 0.75
lr_scheduler: MultiStep
milestones: [88, 96]
gamma: 0.1
warmup_epochs: 1
norm_pix_loss: True
mask_ratio: 0.75
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
---|---|---|---|---|---|
stage |
string |
pretrain |
The training stage (pretrain or finetune) |
pretrain, finetune |
|
accum_grad_batches |
int |
1 |
The number of gradient accumulation steps |
>0 |
|
precision |
string |
fp32 |
The training precision |
fp32, bf16, fp16 |
|
distributed_strategy |
string |
ddp |
The distributed training strategy |
ddp, fsdp |
|
norm_pix_loss |
bool |
True |
Whether to use normalized pixel loss |
True/False |
|
mask_ratio |
float |
0.75 |
The ratio of patches to mask |
>0.0, <1.0 |
|
num_gpus |
unsigned int |
1 |
The number of GPUs to use for distributed training |
>0 |
|
gpu_ids |
List[int] |
[0] |
The indices of the GPU’s to use for distributed training |
||
seed |
unsigned int |
1234 |
The random seed for random, NumPy, and torch |
>0 |
|
num_epochs |
unsigned int |
10 |
The total number of epochs to run the experiment |
>0 |
|
checkpoint_interval |
unsigned int |
1 |
The epoch interval at which the checkpoints are saved |
>0 |
|
validation_interval |
unsigned int |
1 |
The epoch interval at which the validation is run |
>0 |
|
resume_training_checkpoint_path |
string |
The intermediate PyTorch Lightning checkpoint to resume training from |
|||
results_dir |
string |
The directory to save training results |
optim#
The optim parameter defines the config for the optimizer in training.
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
---|---|---|---|---|---|
type |
string |
AdamW |
The optimizer type |
AdamW |
|
monitor_name |
string |
train_loss |
The metric to monitor for learning rate scheduling |
train_loss, val_loss |
|
lr |
float |
2e-4 |
The learning rate |
>0.0 |
|
backbone_multiplier |
float |
0.1 |
The learning rate multiplier for the backbone |
>0.0 |
|
momentum |
float |
0.9 |
The momentum value |
>0.0 |
|
weight_decay |
float |
0.05 |
The weight decay coefficient |
>=0.0 |
|
layer_decay |
float |
0.75 |
The layer-wise learning rate decay |
>0.0 |
|
lr_scheduler |
string |
MultiStep |
The learning rate scheduler type |
MultiStep, cosine |
|
milestones |
int list |
[88, 96] |
The epochs at which to decay the learning rate |
||
gamma |
float |
0.1 |
The learning rate decay factor |
>0.0 |
|
warmup_epochs |
int |
1 |
The number of warmup epochs |
>=0 |
Training the Model#
Use the following command to run MAE training:
TRAIN_JOB_ID=$(tao-client mae experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Multi-Node Training with FTMS
Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.
Please verify that your cluster has multiple GPU enabled nodes available for training. You can do this by running the following command:
kubectl get nodes -o wide
You should see multiple nodes listed. If you do not see multiple nodes, please contact your cluster administrator to get more nodes added to your cluster.
To run a multi-node training job through FTMS, you can modify the following fields in the training job spec:
{
"train": {
"num_gpus": 8, // Number of GPUs per node
"num_nodes": 2 // Number of nodes to use for training
}
}
If these fields are not specified, the default value of 1 GPU per node and 1 node will be used.
Note
The number of GPUs specified in the num_gpus
field must not exceed the number of GPUs per node in the cluster.
The number of nodes specified in the num_nodes
field must not exceed the number of nodes in the cluster.
tao model mae train [-h] -e <experiment_spec_file>
[results_dir=<global_results_dir>]
[model.<model_option>=<model_option_value>]
[dataset.<dataset_option>=<dataset_option_value>]
[train.<train_option>=<train_option_value>]
[train.gpu_ids=<gpu indices>]
[train.num_gpus=<number of gpus>]
Required Arguments
The only required argument is the path to the experiment spec:
-e, --experiment_spec
: The experiment specification file to set up the training experiment
Optional Arguments
You can set optional arguments to override the option values in the experiment spec file.
-h, --help
: Show this help message and exit.model.<model_option>
: The model options.dataset.<dataset_option>
: The dataset options.train.<train_option>
: The train options.train.optim.<optim_option>
: The optimizer options
Note
For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus
and gpu_ids
, which
default to 1
and [0]
, respectively. If both are passed, but inconsistent, for example num_gpus = 1
,
gpu_ids = [0, 1]`, then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2
.
In some cases, you may encounter an issue with multi-GPU training resulting in a segmentation fault. You may circumvent this by setting the OMP_NUM_THREADS enviroment variable to 1. Depending upon your model of execution, you may use the following methods to set this variable
CLI Launcher
You may set this env variable by adding the following fields to the Envs field of your ~/.tao_mounts.json
file as mentioned in bullet 3
in this section
{
"Envs": [
{
"variable": "OMP_NUM_THREADSR",
"value": "1"
}
]
}
Docker
You may set environment variables in the docker by setting the -e
flag in the docker command line.
docker run -it --rm --gpus all \
-e OMP_NUM_THREADS=1 \
-v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e
Checkpointing and Resuming Training
At every train.checkpoint_interval
, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth
.
These are saved in train.results_dir
, like so:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
Evaluating the Model#
evaluate#
The evaluate parameter defines the hyperparameters of the evaluate process.
evaluate:
checkpoint: /path/to/model.pth
num_gpus: 1
gpu_ids: [0]
results_dir: /path/to/results
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
checkpoint |
string |
Path to the model checkpoint to evaluate |
False |
||||
results_dir |
string |
The directory to save evaluation results |
|||||
num_gpus |
unsigned int |
The number of GPUs to use for distributed evaluation |
>0 |
||||
gpu_ids |
List[int] |
The indices of the GPU’s to use for distributed evaluation |
|||||
trt_engine |
string |
Path to TensorRT model to evaluate. Only used with TAO deploy |
Note
The evaluation pipeline only supports the checkpoints from the finetune
stage.
To run evaluation with an MAE model, use this command:
EVAL_JOB_ID=$(tao-client mae experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mae evaluate [-h] -e <experiment_spec>
evaluate.checkpoint=<model to be evaluated>
[evaluate.<evaluate_option>=<evaluate_option_value>]
[evaluate.gpu_ids=<gpu indices>]
[evaluate.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required.
-e, --experiment_spec
: The experiment spec file to set up the evaluation experimentevaluate.checkpoint
: The.pth
model to be evaluated.
Optional Arguments
The following arguments are optional to run the command.
evaluate.<evaluate_option>
: The evaluate options.
Running Inference with an MAE Model#
inference#
The inference parameter defines the hyperparameters of the inference process.
inference:
checkpoint: /path/to/model.pth
num_gpus: 1
gpu_ids: [0]
results_dir: /path/to/results
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
checkpoint |
string |
Path to the model checkpoint to evaluate |
False |
||||
results_dir |
string |
The directory to save evaluation results |
|||||
num_gpus |
unsigned int |
The number of GPUs to use for distributed evaluation |
>0 |
||||
gpu_ids |
List[int] |
The indices of the GPU’s to use for distributed evaluation |
|||||
trt_engine |
string |
Path to TensorRT model to evaluate. Only used with TAO deploy |
Note
The inference pipeline only supports the checkpoints from the finetune
stage.
To run inference with an MAE model, use this command:
INFERENCE_JOB_ID=$(tao-client mae experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mae inference [-h] -e <experiment_spec>
inference.checkpoint=<model to be inferenced>
[inference.<inference_option>=<inference_option_value>]
[inference.gpu_ids=<gpu indices>]
[inference.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required.
-e, --experiment_spec
: The experiment spec file to set up the inference experimentinference.checkpoint
: The.pth
model to use for inference.
Optional Arguments
The following arguments are optional to run the command.
inference.<inference_option>
: The inference options.
Exporting the Model#
export#
The export parameter defines the hyperparameters for exporting the model.
export:
checkpoint: /path/to/model.pth
onnx_file: /path/to/model.onnx
on_cpu: False
opset_version: 12
input_channel: 3
input_width: 960
input_height: 544
batch_size: -1
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
checkpoint |
string |
The path to the PyTorch model to export |
||
onnx_file |
string |
The path to the |
||
on_cpu |
bool |
True |
If this value is True, the DMHA module is exported as standard PyTorch. If this value is False, the module is exported using the TRT Plugin. |
True, False |
opset_version |
unsigned int |
12 |
The opset version of the exported ONNX |
>0 |
input_channel |
unsigned int |
3 |
The input channel size. Only the value 3 is supported. |
3 |
input_width |
unsigned int |
960 |
The input width |
>0 |
input_height |
unsigned int |
544 |
The input height |
>0 |
batch_size |
unsigned int |
-1 |
The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size. |
>=-1 |
Note
The export pipeline supports the checkpoints from both pretrain
and finetune
stages.
When exporting the finetune
stage model, the output tensor is the classification logits.
When exporting the pretrain
stage model, the output tensor is the backbone features before the classification head.
To export an MAE model, use this command:
EXPORT_JOB_ID=$(tao-client mae experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mae export [-h] -e <experiment_spec>
export.checkpoint=<model to export>
export.onnx_file=<onnx path>
[export.<export_option>=<export_option_value>]
Required Arguments
The following arguments are required.
-e, --experiment_spec
: The experiment spec file to set up the export experimentexport.checkpoint
: The.pth
model to export.export.onnx_file
: The path where the.onnx
model is saved.
Optional Arguments
The following arguments are optional to run the command.
export.<export_option>
: The export options.
TensorRT Engine Generation#
For deployment, refer to TAO Deploy documentation.