Vision Transformer

The Vision Transformer, commonly referred to as ViT (Paper), is a foundation model for image classification tasks in NeMo Multimodal. It leverages a transformer-like architecture to process image patches, rather than relying on traditional convolutional neural networks. In the ViT, an image is divided into fixed-size patches (usually 14x14 or 16x16), which are then linearly embedded and augmented with position embeddings. The resulting sequence of vectors is fed into a standard transformer encoder. To enable classification, a learnable “classification token” is added to the sequence.




Data parallelism Yes N/A
Tensor parallelism Yes Yes
Pipeline parallelism No No
Sequence parallelism No No
Activation checkpointing Yes (Uniform or Block) No
FP32/TF32 Yes Yes (FP16 enabled by default)
AMP/FP16 No Yes
AMP/BF16 Yes No
BF16 O2 Yes No
TransformerEngine/FP8 No No
Multi-GPU Yes Yes
Multi-Node Yes Yes
Inference deployment N/A NVIDIA Triton
SW stack support Slurm DeepOps/Base Command Manager/Base Command Platform Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser No N/A
Distributed Optimizer Yes N/A
TorchInductor No N/A
Flash Attention Yes N/A
Previous Performance
Next Data Preparation
© Copyright 2023-2024, NVIDIA. Last updated on Feb 22, 2024.