Masked Autoencoders (MAE)#

Introduction#

Masked Autoencoders (MAE) are a self-supervised learning technique designed to learn powerful visual representations without the need for labeled data. Inspired by masked language modeling approaches in NLP (such as BERT), MAEs operate by randomly masking portions of an input image and training a model to reconstruct the missing areas. This encourages the model to understand the global structure and semantics of the image in order to accurately fill in the blanks.

The key idea behind MAE is to make the learning task sufficiently challenging and meaningful so that the model must capture high-level information about the input data. Unlike traditional autoencoders, MAEs only encode the visible patches and reconstruct the full image, making them both memory-efficient and effective at learning general-purpose features.

Benefits#

  • Label-efficient learning: MAEs do not require manually annotated data, making them ideal for large-scale, unlabeled datasets.

  • Strong representations: Features learned via MAE pretraining can be fine-tuned or transferred to various downstream tasks such as classification, segmentation, and detection.

  • Scalability: The MAE architecture is highly scalable and can leverage modern transformer-based backbones.

Note

The MAE training and finetuning pipelines are compatible with model checkpoints released in the ConvNeXt-V2 repository, allowing users to leverage pretrained models for transfer learning.

Each task is explained in detail in the following sections.

Note

  • Throughout this documentation, you will see references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.

    • For instructions on creating a dataset using the remote client, see the Creating a dataset section in the Remote Client documentation.

    • For instructions on creating an experiment using the remote client, see the Creating an experiment section in the Remote Client documentation.

  • The spec format is YAML for TAO Launcher and JSON for FTMS Client.

  • File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher and not for FTMS Client.

Data Input for MAE#

MAE expects input data to be RGB images stored in a single directory. Supported image formats include: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, and .webp.

Creating an Experiment Spec File#

The training experiment spec file for MAE includes the following elements:

  • model

  • train

  • evaluate

  • inference

  • export

  • gen_trt_engine

  • dataset

Use the following command to create an experiment spec file for MAE:

SPECS=$(tao-client mae get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Parameter

Data Type

Description

Default Value

Valid Min

Valid Max

Valid Options

Automl Enabled

model

collection

Configurable parameters to construct the model for an MAE experiment.

False

dataset

collection

Configurable parameters to construct the dataset for an MAE experiment.

False

train

collection

Configurable parameters to construct the trainer for an MAE experiment.

False

evaluate

collection

Configurable parameters to construct the evaluator for an MAE experiment.

False

inference

collection

Configurable parameters to construct the inferencer for an MAE experiment.

False

export

collection

Configurable parameters to construct the exporter for an MAE experiment.

False

gen_trt_engine

collection

Configurable parameters to construct the TensorRT engine builder for an MAE experiment.

False

model#

The model parameter provides options to change the MAE architecture.

model:
  arch: convnextv2_base
  num_classes: 1000
  drop_path_rate: 0.1
  global_pool: True
  decoder_depth: 1
  decoder_embed_dim: 512

Parameter

Datatype

Default

Description

Supported Values

arch







string







convnextv2_base







The model architecture to use







convnextv2_atto, convnextv2_femto,
convnextv2_pico, convnextv2_nano,
convnextv2_tiny, convnextv2_base,
convnextv2_large, convnextv2_huge
vit_base_patch16, vit_large_patch16
vit_huge_patch14, hiera_tiny_224
hiera_small_224, hiera_base_224
hiera_large_224, hiera_huge_224

num_classes

int

1000

The number of classes for classification

>0

drop_path_rate

float

0.1

The drop path rate for stochastic depth

>=0.0

global_pool

bool

True

Whether to use global pooling in the model

True/False

decoder_depth

int

1

The depth of the MAE decoder

>0

decoder_embed_dim

int

512

The embedding dimension of the MAE decoder

>0

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_data_sources: /data/train/
  val_data_sources: /data/val/
  test_data_sources: /data/test/
  batch_size: 32
  num_workers_per_gpu: 2
  augmentation:
    input_size: 224
    mean:
    - 0.485
    - 0.456
    - 0.406
    std:
    - 0.229
    - 0.224
    - 0.225
    min_scale: 0.1
    max_scale: 2.0
    smoothing: 0.1
    color_jitter: 0.0
    auto_aug: rand-m9-mstd0.5-inc1
    mixup: 0.8
    cutmix: 1.0
    mixup_prob: 1.0
    mixup_switch_prob: 0.5
    mixup_mode: batch

Parameter

Datatype

Default

Description

Supported Values

train_data_sources

string

The directory containing training images

val_data_sources

string

The directory containing validation images

batch_size

int

3

The batch size for training and validation

>0

num_workers_per_gpu

int

2

The number of workers per GPU for data loading

>0

augmentation#

The augmentation parameter contains hyperparameters for data augmentation.

Parameter

Datatype

Default

Description

Supported Values

input_size

int

224

The input image size

>0

mean

float list

[0.485, 0.456, 0.406]

The mean values for image normalization

list of 3 values

std

float list

[0.229, 0.224, 0.225]

The standard deviation values for image normalization

list of 3 values

min_scale

float

0.1

The minimum scale for random resizing

>0.0

max_scale

float

2.0

The maximum scale for random resizing

>0.0

min_ratio

float

0.1

The minimum ratio for random resizing

>0.0

max_ratio

float

2.0

The maximum ratio for random resizing

>0.0

smoothing

float

0.1

The label smoothing value

>=0.0

color_jitter

float

0.0

The color jittering strength

>=0.0

auto_aug

string

rand-m9-mstd0.5-inc1

The auto augmentation policy

mixup

float

0.8

The mixup alpha value

>=0.0

cutmix

float

1.0

The cutmix alpha value

>=0.0

mixup_prob

float

1.0

The probability of applying mixup

>=0.0

mixup_switch_prob

float

0.5

The probability of switching between mixup and cutmix

>=0.0

mixup_mode

string

batch

The mixup mode

batch, pair, elem

interpolation

string

random

The interpolation method

random, bilinear

hflip

float

0.5

The probability of horizontal flipping

>=0.0

re_prob

float

0.0

The probability of random erasing

>=0.0

train#

The train parameter defines the hyperparameters of the training process.

train:
  stage: pretrain
  accum_grad_batches: 1
  precision: fp32
  distributed_strategy: ddp
  optim:
    type: AdamW
    monitor_name: train_loss
    lr: 2e-4
    backbone_multiplier: 0.1
    momentum: 0.9
    weight_decay: 0.05
    layer_decay: 0.75
    lr_scheduler: MultiStep
    milestones: [88, 96]
    gamma: 0.1
    warmup_epochs: 1
  norm_pix_loss: True
  mask_ratio: 0.75

Parameter

Datatype

Default

Description

Supported Values

stage

string

pretrain

The training stage (pretrain or finetune)

pretrain, finetune

accum_grad_batches

int

1

The number of gradient accumulation steps

>0

precision

string

fp32

The training precision

fp32, bf16, fp16

distributed_strategy

string

ddp

The distributed training strategy

ddp, fsdp

norm_pix_loss

bool

True

Whether to use normalized pixel loss

True/False

mask_ratio

float

0.75

The ratio of patches to mask

>0.0, <1.0

num_gpus

unsigned int

1

The number of GPUs to use for distributed training

>0

gpu_ids

List[int]

[0]

The indices of the GPU’s to use for distributed training

seed

unsigned int

1234

The random seed for random, NumPy, and torch

>0

num_epochs

unsigned int

10

The total number of epochs to run the experiment

>0

checkpoint_interval

unsigned int

1

The epoch interval at which the checkpoints are saved

>0

validation_interval

unsigned int

1

The epoch interval at which the validation is run

>0

resume_training_checkpoint_path

string

The intermediate PyTorch Lightning checkpoint to resume training from

results_dir

string

The directory to save training results

optim#

The optim parameter defines the config for the optimizer in training.

Parameter

Datatype

Default

Description

Supported Values

type

string

AdamW

The optimizer type

AdamW

monitor_name

string

train_loss

The metric to monitor for learning rate scheduling

train_loss, val_loss

lr

float

2e-4

The learning rate

>0.0

backbone_multiplier

float

0.1

The learning rate multiplier for the backbone

>0.0

momentum

float

0.9

The momentum value

>0.0

weight_decay

float

0.05

The weight decay coefficient

>=0.0

layer_decay

float

0.75

The layer-wise learning rate decay

>0.0

lr_scheduler

string

MultiStep

The learning rate scheduler type

MultiStep, cosine

milestones

int list

[88, 96]

The epochs at which to decay the learning rate

gamma

float

0.1

The learning rate decay factor

>0.0

warmup_epochs

int

1

The number of warmup epochs

>=0

Training the Model#

Use the following command to run MAE training:

TRAIN_JOB_ID=$(tao-client mae experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Multi-Node Training with FTMS

Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.

Please verify that your cluster has multiple GPU enabled nodes available for training. You can do this by running the following command:

kubectl get nodes -o wide

You should see multiple nodes listed. If you do not see multiple nodes, please contact your cluster administrator to get more nodes added to your cluster.

To run a multi-node training job through FTMS, you can modify the following fields in the training job spec:

{
    "train": {
        "num_gpus": 8, // Number of GPUs per node
        "num_nodes": 2 // Number of nodes to use for training
    }
}

If these fields are not specified, the default value of 1 GPU per node and 1 node will be used.

Note

The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster. The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.

Evaluating the Model#

evaluate#

The evaluate parameter defines the hyperparameters of the evaluate process.

evaluate:
  checkpoint: /path/to/model.pth
  num_gpus: 1
  gpu_ids: [0]
  results_dir: /path/to/results

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

checkpoint

string

Path to the model checkpoint to evaluate

False

results_dir

string

The directory to save evaluation results

num_gpus

unsigned int

The number of GPUs to use for distributed evaluation

>0

gpu_ids

List[int]

The indices of the GPU’s to use for distributed evaluation

trt_engine

string

Path to TensorRT model to evaluate. Only used with TAO deploy

Note

The evaluation pipeline only supports the checkpoints from the finetune stage.

To run evaluation with an MAE model, use this command:

EVAL_JOB_ID=$(tao-client mae experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Running Inference with an MAE Model#

inference#

The inference parameter defines the hyperparameters of the inference process.

inference:
  checkpoint: /path/to/model.pth
  num_gpus: 1
  gpu_ids: [0]
  results_dir: /path/to/results

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

checkpoint

string

Path to the model checkpoint to evaluate

False

results_dir

string

The directory to save evaluation results

num_gpus

unsigned int

The number of GPUs to use for distributed evaluation

>0

gpu_ids

List[int]

The indices of the GPU’s to use for distributed evaluation

trt_engine

string

Path to TensorRT model to evaluate. Only used with TAO deploy

Note

The inference pipeline only supports the checkpoints from the finetune stage.

To run inference with an MAE model, use this command:

INFERENCE_JOB_ID=$(tao-client mae experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Exporting the Model#

export#

The export parameter defines the hyperparameters for exporting the model.

export:
  checkpoint: /path/to/model.pth
  onnx_file: /path/to/model.onnx
  on_cpu: False
  opset_version: 12
  input_channel: 3
  input_width: 960
  input_height: 544
  batch_size: -1

Parameter

Datatype

Default

Description

Supported Values

checkpoint

string

The path to the PyTorch model to export

onnx_file

string

The path to the .onnx file

on_cpu

bool

True

If this value is True, the DMHA module is exported as standard PyTorch. If this value is False, the module is exported using the TRT Plugin.

True, False

opset_version

unsigned int

12

The opset version of the exported ONNX

>0

input_channel

unsigned int

3

The input channel size. Only the value 3 is supported.

3

input_width

unsigned int

960

The input width

>0

input_height

unsigned int

544

The input height

>0

batch_size

unsigned int

-1

The batch size of the ONNX model. If this value is set to -1, the export uses dynamic batch size.

>=-1

Note

The export pipeline supports the checkpoints from both pretrain and finetune stages. When exporting the finetune stage model, the output tensor is the classification logits. When exporting the pretrain stage model, the output tensor is the backbone features before the classification head.

To export an MAE model, use this command:

EXPORT_JOB_ID=$(tao-client mae experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TensorRT Engine Generation#

For deployment, refer to TAO Deploy documentation.