Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
CLIP
Contrastive Language-Image Pre-training (CLIP) Paper offers an efficient method for learning image representations using natural language supervision. In essence, CLIP trains both an image encoder and a text encoder from scratch. The goal is to predict the correct pairings of a batch of (image, text) training examples by jointly training these encoders.
During pre-training, the model is designed to predict which images and texts form a semantically coherent pair by maximizing the similarity between the correct (image, text) pairs while minimizing the similarity between incorrect pairs. This contrastive learning approach ensures that CLIP learns meaningful and contextually rich representations of both visual and textual data.
Upon completion of the pre-training phase, CLIP models can be fine-tuned for specialized downstream tasks or directly employed for zero-shot learning. For instance, the learned text encoder generates high-level representations by embedding captions in Stable Diffusion. This approach facilitates seamless image and text representation learning and has demonstrated exceptional effectiveness across a diverse range of applications.
Important
The current update of NeMo CLIP now uses a Megatron Core-based implementation.
Feature |
Training |
Inference |
---|---|---|
Data parallelism |
Yes |
N/A |
Tensor parallelism |
Yes |
Yes |
Pipeline parallelism |
No |
No |
Sequence parallelism |
No |
No |
Activation checkpointing |
Yes (Uniform or Block) |
No |
FP32/TF32 |
Yes |
Yes (FP16 enabled by default) |
AMP/FP16 |
No |
Yes |
AMP/BF16 |
Yes |
No |
BF16 O2 |
Yes |
No |
TransformerEngine/FP8 |
No |
No |
Multi-GPU |
Yes |
Yes |
Multi-Node |
Yes |
Yes |
Inference deployment |
N/A |
|
SW stack support |
Slurm DeepOps/Base Command Manager/Base Command Platform |
Slurm DeepOps/Base Command Manager/Base Command Platform |
NVfuser |
No |
N/A |
Distributed Optimizer |
No |
N/A |
TorchInductor |
No |
N/A |
Flash Attention |
Yes |
N/A |