Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
CLIP#
The Contrastive Language-Image Pre-training (CLIP) paper <https://arxiv.org/pdf/2103.00020.pdf>_ offers an efficient method for learning image representations using natural language supervision. In essence, CLIP trains both an image encoder and a text encoder from scratch. The goal is to predict the correct pairings of a batch of (image, text) training examples by jointly training these encoders.
During pre-training, the model is designed to predict which images and texts form a semantically coherent pair by maximizing the similarity between the correct (image, text) pairs while minimizing the similarity between incorrect pairs. This contrastive learning approach ensures that CLIP learns meaningful and contextually rich representations of both visual and textual data.
Upon completion of the pre-training phase, CLIP models can be fine-tuned for specialized downstream tasks or directly employed for zero-shot learning. This approach facilitates seamless image and text representation learning and has demonstrated exceptional effectiveness across a diverse range of applications.
To get started with CLIP, follow these steps.
Import from Hugging Face to NeMo 2.0#
The following script downloads the checkpoint for CLIP and converts it to NeMo format. The converted checkpoint is then stored in the NeMo cache folder located at ~/.cache/nemo
.
For example, when used with the NeMo container, the full path is /root/.cache/nemo/models/openai/clip-vit-large-patch14
.
The checkpoint can be used to initialize the Vision Language Model (VLM) and fine-tune the CLIP model for Supervised Fine-Tuning (SFT).
.. code-block:: python
from nemo.collections.llm import import_ckpt from nemo.collections import vlm from nemo.collections.vlm import ClipConfigL14
- if __name__ == ‘__main__’:
# Specify the Hugging Face model ID hf_model_id = “hf://openai/clip-vit-large-patch14”
# Import the model and convert to NeMo 2.0 format import_ckpt(
model=vlm.CLIPModel(ClipConfigL14()), # Model configuration source=f”{hf_model_id}”, # Hugging Face model source
)
NeMo 2.0 Pretraining Recipes#
We provide some default recipes for pretraining CLIP clip_b32.
from nemo.collections import vlm
pretrain = vlm.clip_b32.pretrain_recipe(
name="clip_pretrain",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
)
Note
The configuration in the recipes is done using the NeMo-Run run.Config
and run.Partial
configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.
Note
The recipes use the MockDataModule
for the data
argument. You are expected to replace the MockDataModule
with your custom dataset.
Once you have your final configuration ready, you can execute it using any of the NeMo-Run supported executors. The simplest option is the local executor, which runs the pretraining locally in a separate process. You can use it as follows: .. code-block:: python
import nemo_run as run
run.run(pretrain, executor=run.LocalExecutor())
Additionally, you can also run it directly in the same Python process as follows:
run.run(pretrain, direct=True)
Use the Energon Dataloader#
Given a dataset in WebDataset format, you can use the Energon data loader
to prepare the data for use with CLIP. You can run the following command from the <data_root>
directory to convert WebDataset format to Energon format:
energon prepare .
Use CrudeSample as your sample class.
Below is an example of how to set up the Energon data module for CLIP training:
from nemo.collections.multimodal.data.energon import EnergonMultiModalDataModule
from nemo.collections.vlm.clip.data.clip_data_module import ClipTaskEncoder
# Paths and configuration
data_path = "<path_to_dataset>"
text_seq_length = 80
mbs = 500
gbs = 4000
num_workers = 16
# Load the task encoder for train and validation
train_task_encoder = ClipTaskEncoder(max_length=text_seq_length)
valid_task_encoder = ClipTaskEncoder(max_length=text_seq_length, is_train=False)
data = EnergonMultiModalDataModule(
data_path,
seq_length=text_seq_length,
image_processor=None,
micro_batch_size=mbs,
global_batch_size=gbs,
num_workers=num_workers,
task_encoder=train_task_encoder,
tokenizer=train_task_encoder.tokenizer,
validation_task_encoder=valid_task_encoder,
image_decode="pil",
ignore_decoder_errors=True,
)
Replace the MockDataModule
in the default recipes with the above data.
from nemo.collections import vlm
# Define the finetuning recipe
pretrain = vlm.clip_b32.pretrain_recipe(
name="clip_pretrain",
dir=f"/path/to/checkpoints",
num_nodes=1,
num_gpus_per_node=8,
)
# Assign the above data module to the finetuning recipe
pretrain.data = data
We have also included additional example scripts to further customize CLIP training and inference:
Inference with HF and Nemo: clip_infer.py
Pretraining: clip_pretrain.py
These scripts allow for flexible and comprehensive training workflows tailored to your requirements. Eg:- If you want to do SFT, you can use clip_pretrain.py and pass restore_path as the checkpoint path obtained after HF convervion to Nemo 2.0