CLIP

Contrastive Language-Image Pre-training (CLIP) Paper offers an efficient method for learning image representations using natural language supervision. In essence, CLIP trains both an image encoder and a text encoder from scratch. The goal is to predict the correct pairings of a batch of (image, text) training examples by jointly training these encoders.

During pre-training, the model is designed to predict which images and texts form a semantically coherent pair by maximizing the similarity between the correct (image, text) pairs while minimizing the similarity between incorrect pairs. This contrastive learning approach ensures that CLIP learns meaningful and contextually rich representations of both visual and textual data.

Upon completion of the pre-training phase, CLIP models can be fine-tuned for specialized downstream tasks or directly employed for zero-shot learning. For instance, the learned text encoder generates high-level representations by embedding captions in Stable Diffusion. This approach facilitates seamless image and text representation learning and has demonstrated exceptional effectiveness across a diverse range of applications.

Feature

Training

Inference

Data parallelism Yes N/A
Tensor parallelism Yes Yes
Pipeline parallelism No No
Sequence parallelism No No
Activation checkpointing Yes (Uniform or Block) No
FP32/TF32 Yes Yes (FP16 enabled by default)
AMP/FP16 No Yes
AMP/BF16 Yes No
BF16 O2 Yes No
TransformerEngine/FP8 No No
Multi-GPU Yes Yes
Multi-Node Yes Yes
Inference deployment N/A NVIDIA Triton supported
SW stack support Slurm DeepOps/Base Command Manager/Base Command Platform Slurm DeepOps/Base Command Manager/Base Command Platform
NVfuser No N/A
Distributed Optimizer Yes N/A
TorchInductor No N/A
Flash Attention Yes N/A
Previous Vision-Language Foundation Models
Next Data Preparation
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.