Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Vision Transformer#

The Vision Transformer, commonly referred to as ViT (Paper), is a foundation model for image classification tasks in NeMo Multimodal. It leverages a transformer-like architecture to process image patches, rather than relying on traditional convolutional neural networks. In the ViT, an image is divided into fixed-size patches (usually 14x14 or 16x16), which are then linearly embedded and augmented with position embeddings. The resulting sequence of vectors is fed into a standard transformer encoder. To enable classification, a learnable “classification token” is added to the sequence.

Feature	Training	Inference
Data parallelism	Yes	N/A
Tensor parallelism	Yes	Yes
Pipeline parallelism	No	No
Sequence parallelism	No	No
Activation checkpointing	Yes (Uniform or Block)	No
FP32/TF32	Yes	Yes (FP16 enabled by default)
AMP/FP16	No	Yes
AMP/BF16	Yes	No
BF16 O2	Yes	No
TransformerEngine/FP8	No	No
Multi-GPU	Yes	Yes
Multi-Node	Yes	Yes
Inference deployment	N/A	NVIDIA Triton
SW stack support	Slurm DeepOps/Base Command Manager/Base Command Platform	Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser	No	N/A
Distributed Optimizer	Yes	N/A
TorchInductor	No	N/A
Flash Attention	Yes	N/A