The Vision Transformer, commonly referred to as ViT (Paper), is a foundation model for image classification tasks in NeMo Multimodal. It leverages a transformer-like architecture to process image patches, rather than relying on traditional convolutional neural networks. In the ViT, an image is divided into fixed-size patches (usually 14x14 or 16x16), which are then linearly embedded and augmented with position embeddings. The resulting sequence of vectors is fed into a standard transformer encoder. To enable classification, a learnable “classification token” is added to the sequence.
Feature |
Training |
Inference |
---|---|---|
Data parallelism | Yes | N/A |
Tensor parallelism | Yes | Yes |
Pipeline parallelism | No | No |
Sequence parallelism | No | No |
Activation checkpointing | Yes (Uniform or Block) | No |
FP32/TF32 | Yes | Yes (FP16 enabled by default) |
AMP/FP16 | No | Yes |
AMP/BF16 | Yes | No |
BF16 O2 | Yes | No |
TransformerEngine/FP8 | No | No |
Multi-GPU | Yes | Yes |
Multi-Node | Yes | Yes |
Inference deployment | N/A | NVIDIA Triton |
SW stack support | Slurm DeepOps/Base Command Manager/Base Command Platform | Slurm DeepOps/Base Command Manager/Base Command Platform |
NVfuser | No | N/A |
Distributed Optimizer | Yes | N/A |
TorchInductor | No | N/A |
Flash Attention | Yes | N/A |