CLIP#

The Contrastive Language-Image Pre-training (CLIP) paper <https://arxiv.org/pdf/2103.00020.pdf>_ offers an efficient method for learning image representations using natural language supervision. In essence, CLIP trains both an image encoder and a text encoder from scratch. The goal is to predict the correct pairings of a batch of (image, text) training examples by jointly training these encoders.

During pre-training, the model is designed to predict which images and texts form a semantically coherent pair by maximizing the similarity between the correct (image, text) pairs while minimizing the similarity between incorrect pairs. This contrastive learning approach ensures that CLIP learns meaningful and contextually rich representations of both visual and textual data.

Upon completion of the pre-training phase, CLIP models can be fine-tuned for specialized downstream tasks or directly employed for zero-shot learning. This approach facilitates seamless image and text representation learning and has demonstrated exceptional effectiveness across a diverse range of applications.

To get started with CLIP, follow these steps.

Import from Hugging Face to NeMo 2.0#

The following script downloads the checkpoint for CLIP and converts it to NeMo format. The converted checkpoint is then stored in the NeMo cache folder located at ~/.cache/nemo. For example, when used with the NeMo container, the full path is /root/.cache/nemo/models/openai/clip-vit-large-patch14. The checkpoint can be used to initialize the Vision Language Model (VLM) and fine-tune the CLIP model for Supervised Fine-Tuning (SFT).

from nemo.collections.llm import import_ckpt
from nemo.collections import vlm
from nemo.collections.vlm import ClipConfigL14

if __name__ == '__main__':
    # Specify the Hugging Face model ID
    hf_model_id = "hf://openai/clip-vit-large-patch14"

    # Import the model and convert to NeMo 2.0 format
    import_ckpt(
        model=vlm.CLIPModel(ClipConfigL14()),  # Model configuration
        source=f"{hf_model_id}",  # Hugging Face model source
    )

NeMo 2.0 Pretraining Recipes#

We provide some default recipes for pretraining CLIP clip_b32.

from nemo.collections import vlm

pretrain = vlm.clip_b32.pretrain_recipe(
    name="clip_pretrain",
    dir=f"/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
)

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.

Note

The recipes use the MockDataModule for the data argument. You are expected to replace the MockDataModule with your custom dataset.

Once you have your final configuration ready, you can execute it using any of the NeMo-Run supported executors. The simplest option is the local executor, which runs the pretraining locally in a separate process. You can use it as follows: .. code-block:: python

import nemo_run as run

run.run(pretrain, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

run.run(pretrain, direct=True)

Use the Energon Dataloader#

Given a dataset in WebDataset format, you can use the Energon data loader to prepare the data for use with CLIP. You can run the following command from the <data_root> directory to convert WebDataset format to Energon format:

energon prepare .

Use CrudeSample as your sample class.

Below is an example of how to set up the Energon data module for CLIP training:

from nemo.collections.multimodal.data.energon import EnergonMultiModalDataModule
from nemo.collections.vlm.clip.data.clip_data_module import ClipTaskEncoder

# Paths and configuration
data_path = "<path_to_dataset>"
text_seq_length = 80
mbs = 500
gbs = 4000
num_workers = 16

# Load the task encoder for train and validation
train_task_encoder = ClipTaskEncoder(max_length=text_seq_length)
valid_task_encoder = ClipTaskEncoder(max_length=text_seq_length, is_train=False)
data = EnergonMultiModalDataModule(
    data_path,
    seq_length=text_seq_length,
    image_processor=None,
    micro_batch_size=mbs,
    global_batch_size=gbs,
    num_workers=num_workers,
    task_encoder=train_task_encoder,
    tokenizer=train_task_encoder.tokenizer,
    validation_task_encoder=valid_task_encoder,
    image_decode="pil",
    ignore_decoder_errors=True,
)

Replace the MockDataModule in the default recipes with the above data.

from nemo.collections import vlm

# Define the finetuning recipe
pretrain = vlm.clip_b32.pretrain_recipe(
  name="clip_pretrain",
  dir=f"/path/to/checkpoints",
  num_nodes=1,
  num_gpus_per_node=8,
)

# Assign the above data module to the finetuning recipe
pretrain.data = data

We have also included additional example scripts to further customize CLIP training and inference:

Inference with HF and Nemo: clip_infer.py
Pretraining: clip_pretrain.py

These scripts allow for flexible and comprehensive training workflows tailored to your requirements. Eg:- If you want to do SFT, you can use clip_pretrain.py and pass restore_path as the checkpoint path obtained after HF convervion to Nemo 2.0