CLIP

Contrastive Language-Image Pre-training (CLIP) Paper offers an efficient method for learning image representations using natural language supervision. In essence, CLIP trains both an image encoder and a text encoder from scratch. The goal is to predict the correct pairings of a batch of (image, text) training examples by jointly training these encoders.

During pre-training, the model is designed to predict which images and texts form a semantically coherent pair by maximizing the similarity between the correct (image, text) pairs while minimizing the similarity between incorrect pairs. This contrastive learning approach ensures that CLIP learns meaningful and contextually rich representations of both visual and textual data.

Upon completion of the pre-training phase, CLIP models can be fine-tuned for specialized downstream tasks or directly employed for zero-shot learning. For instance, the learned text encoder generates high-level representations by embedding captions in Stable Diffusion. This approach facilitates seamless image and text representation learning and has demonstrated exceptional effectiveness across a diverse range of applications.

Feature

Training

Inference

Data parallelism

Yes

N/A

Tensor parallelism

Yes

Yes

Pipeline parallelism

No

No

Sequence parallelism

No

No

Activation checkpointing

Yes (Uniform or Block)

No

FP32/TF32

Yes

Yes (FP16 enabled by default)

AMP/FP16

No

Yes

AMP/BF16

Yes

No

BF16 O2

Yes

No

TransformerEngine/FP8

No

No

Multi-GPU

Yes

Yes

Multi-Node

Yes

Yes

Inference deployment

N/A

NVIDIA Triton supported

SW stack support

Slurm DeepOps/Base Command Manager/Base Command Platform

Slurm DeepOps/Base Command Manager/Base Command Platform

NVfuser

No

N/A

Distributed Optimizer

Yes

N/A

TorchInductor

No

N/A

Flash Attention

Yes

N/A