Vision Transformer

The Vision Transformer, commonly referred to as ViT (Paper), is a foundation model for image classification tasks in NeMo Multimodal. It leverages a transformer-like architecture to process image patches, rather than relying on traditional convolutional neural networks. In the ViT, an image is divided into fixed-size patches (usually 14x14 or 16x16), which are then linearly embedded and augmented with position embeddings. The resulting sequence of vectors is fed into a standard transformer encoder. To enable classification, a learnable “classification token” is added to the sequence.

Feature	Training	Inference
Data parallelism	Yes	N/A
Tensor parallelism	Yes	Yes
Pipeline parallelism	No	No
Sequence parallelism	No	No
Activation checkpointing	Yes (Uniform or Block)	No
FP32/TF32	Yes	Yes (FP16 enabled by default)
AMP/FP16	No	Yes
AMP/BF16	Yes	No
BF16 O2	Yes	No
TransformerEngine/FP8	No	No
Multi-GPU	Yes	Yes
Multi-Node	Yes	Yes
Inference deployment	N/A	NVIDIA Triton
SW stack support	Slurm DeepOps/Base Command Manager/Base Command Platform	Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser	No	N/A
Distributed Optimizer	Yes	N/A
TorchInductor	No	N/A
Flash Attention	Yes	N/A