CLIP - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide CLIP

Contrastive Language-Image Pre-training (CLIP) Paper offers an efficient method for learning image representations using natural language supervision. In essence, CLIP trains both an image encoder and a text encoder from scratch. The goal is to predict the correct pairings of a batch of (image, text) training examples by jointly training these encoders.

During pre-training, the model is designed to predict which images and texts form a semantically coherent pair by maximizing the similarity between the correct (image, text) pairs while minimizing the similarity between incorrect pairs. This contrastive learning approach ensures that CLIP learns meaningful and contextually rich representations of both visual and textual data.

Upon completion of the pre-training phase, CLIP models can be fine-tuned for specialized downstream tasks or directly employed for zero-shot learning. For instance, the learned text encoder generates high-level representations by embedding captions in Stable Diffusion. This approach facilitates seamless image and text representation learning and has demonstrated exceptional effectiveness across a diverse range of applications.

Feature	Training	Inference
Data parallelism	Yes	N/A
Tensor parallelism	Yes	Yes
Pipeline parallelism	No	No
Sequence parallelism	No	No
Activation checkpointing	Yes (Uniform or Block)	No
FP32/TF32	Yes	Yes (FP16 enabled by default)
AMP/FP16	No	Yes
AMP/BF16	Yes	No
BF16 O2	Yes	No
TransformerEngine/FP8	No	No
Multi-GPU	Yes	Yes
Multi-Node	Yes	Yes
Inference deployment	N/A	NVIDIA Triton supported
SW stack support	Slurm DeepOps/Base Command Manager/Base Command Platform	Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser	No	N/A
Distributed Optimizer	Yes	N/A
TorchInductor	No	N/A
Flash Attention	Yes	N/A

Previous Vision-Language Foundation Models

Next Data Preparation