The Vision Transformer, commonly referred to as ViT [VISION-MODELS1], serves as a foundational model for image classification tasks in NeMo. Unlike conventional convolutional neural networks, ViT adopts a transformer-like architecture to process image data. In this approach, an image is divided into fixed-size patches, typically 14x14 or 16x16. These patches are linearly embedded and augmented with position embeddings. The resulting sequence of vectors is passed through a standard transformer encoder. In order to facilitate classification, a “classification token” that is learnable is incorporated into the sequence.


ViT models can be instantiated using the MegatronVitClassificationModel class.

Transformer Encoder

NeMo’s implementation of the ViT model leverages its parallel transformer implementation, specifically the nemo.collections.nlp.modules.common.megatron.transformer.ParallelTransformer, to enable model parallelism support in the transformer encoder. This design choice ensures efficient scaling and utilization of resources during training.

Model Model size (M) Hidden size FFN_dim Attention heads Number of layers PatchDim Num Batches (Seq)
B/16 86 768 3072 12 12 16 204
L/16 303 1024 4096 16 24 16 204
H/16 632 1280 5120 16 32 16 204
H/14 632 1280 5120 16 32 14 264
g/14 1011 1408 6144 16 40 14 264
G/14 1843 1664 8192 16 48 14 264

Transformer Encoder


encoder_seq_length: 196 max_position_embeddings: ${.encoder_seq_length} num_layers: 12 hidden_size: 768 ffn_hidden_size: 3072 num_attention_heads: 12 hidden_dropout: 0.1 attention_dropout: 0.

  • encoder_seq_length: Sequence length for the transformer encoder.

  • num_layers, hidden_size, ffn_hidden_size, num_attention_heads: Parameters defining the architecture of the text transformer. The ffn_hidden_size is typically 4 times the hidden_size.

  • hidden_dropout and attention_dropout: Dropout probabilities for the hidden state and attention in the transformer respectively.

Patch & Positional Embedding


vision_pretraining_type: "classify" num_classes: 1000 patch_dim: 16 img_h: 224 img_w: 224 num_channels: 3

  • vision_pretraining_type: Type of MLP head, with support limited to classification tasks now

  • num_classes: Number of labels used for classification

  • patch_dim: Size of the patches the image is divided into.

  • img_h and img_w: Height and width of the input images.

  • num_channels: Number of channels in the input image (e.g., 3 for RGB images).




To Enable

Data parallelism Dataset is read concurrently across multiple GPUs or nodes, allowing for faster data loading and processing. Automatically when training on multi GPUs/nodes
Tensor parallelism Each tensor is split up into multiple chunks, allowing for horizontal parallelism across GPUs. This technique, known as TensorParallel (TP), distributes the model’s tensors across multiple GPUs. During processing, each shard gets processed separately and in parallel on different GPUs, and the results are synced at the end of the step. This approach is inspired by NVIDIA’s Megatron implementation. [Reference]( model.tensor_model_parallel_size
Activation Checkpointing To reduce memory usage, activations of certain layers are cleared and recomputed during a backward pass. This technique is particularly useful for training large models that wouldn’t fit in GPU memory using traditional methods. model.activations_checkpoint_granularity=full, model.activations_checkpoint_method=block, model.activations_checkpoint_num_layers={num_layers_to_check}
Bfloat16 Training Training is conducted in Bfloat16 precision, which offers a balance between the higher precision of FP32 and the memory savings and speed of FP16. trainer.precision=bf16
BF16 O2 Enables O2-level automatic mixed precision, optimizing Bfloat16 precision for better performance. model.megatron_amp_O2=True
Distributed Optimizer The optimization process is distributed across multiple GPUs, reducing memory requirements. This technique distributes the optimizer state across data parallel ranks, rather than replicating it, offering significant memory savings. This approach is inspired by the ZeRO optimization described in the paper “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models” and implemented in NVIDIA’s Megatron. [Reference]("distributed_fused_adam"
Flash Attention V2 FlashAttention is a fast and memory-efficient algorithm to compute exact attention. It speeds up model training and reduces memory requirement by being IO-aware. This approach is particularly useful for large-scale models and is detailed further in the repository linked. [Reference]( model.use_flash_attention=True

Model Training

Below are the highlights of the training and fine-tuning recipe we used:


Model: ViT B/16 Dataset: ImageNet 1K Pretraining: Epochs: 300 Batch Size: 4096 Training Resolution: 224 Optimizer: Adam (0.9, 0.999) Base Learning Rate: 3.00E-03 Learning Rate Decay: Cosine Weight Decay: 0.3 Dropout: 0.1 Fine-tuning: Steps: 20,000 Batch Size: 512 Fine-tuning Resolution: 512 Optimizer: SGD (0.9) Base Learning Rate: 0.003 - 0.06 Learning Rate Decay: Cosine Weight Decay: 0

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. 2022. arXiv:2205.11487.

Previous Checkpoints
Next Common Collection
© Copyright 2023-2024, NVIDIA. Last updated on Apr 12, 2024.