Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Vision Transformer#

The Vision Transformer, commonly referred to as ViT (Paper), is a foundation model for image classification tasks in NeMo Multimodal. It leverages a transformer-like architecture to process image patches, rather than relying on traditional convolutional neural networks. In the ViT, an image is divided into fixed-size patches (usually 14x14 or 16x16), which are then linearly embedded and augmented with position embeddings. The resulting sequence of vectors is fed into a standard transformer encoder. To enable classification, a learnable “classification token” is added to the sequence.

Feature

Training

Inference

Data parallelism

Yes

N/A

Tensor parallelism

Yes

Yes

Pipeline parallelism

No

No

Sequence parallelism

No

No

Activation checkpointing

Yes (Uniform or Block)

No

FP32/TF32

Yes

Yes (FP16 enabled by default)

AMP/FP16

No

Yes

AMP/BF16

Yes

No

BF16 O2

Yes

No

TransformerEngine/FP8

No

No

Multi-GPU

Yes

Yes

Multi-Node

Yes

Yes

Inference deployment

N/A

NVIDIA Triton

SW stack support

Slurm DeepOps/Base Command Manager/Base Command Platform

Slurm DeepOps/Base Command Manager/Base Command Platform

NVfuser

No

N/A

Distributed Optimizer

Yes

N/A

TorchInductor

No

N/A

Flash Attention

Yes

N/A