CLIP Introduction#

CLIP (Contrastive Language–Image Pretraining) is a multimodal model that learns to align images and text in a shared embedding space. You can use CLIP to rank images against text queries, classify images without labeled training data, or extract embeddings for downstream similarity search applications.

TAO Toolkit supports fine-tuning CLIP models on custom image-text datasets as well as zero-shot evaluation of pretrained models without any training.

Supported tasks:

  • train

  • evaluate

  • inference

  • export

  • gen_trt_engine

For full training, evaluation, inference, and export commands, refer to CLIP Training and Deployment.

For TensorRT engine generation, evaluation, and inference with TAO Deploy, refer to CLIP with TAO Deploy.

For downstream application examples that use CLIP embeddings, refer to Using CLIP Embeddings.

Note

  • Throughout this documentation are references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.

    • For instructions on creating a dataset using the remote client, refer to the Creating a dataset section in the Remote Client documentation.

    • For instructions on creating an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

  • The spec format is YAML for TAO Launcher, and JSON for FTMS Client.

  • File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.

Supported Models#

TAO Toolkit supports three backbone families. The table below lists the available model.type values, image resolutions, and embedding dimensions for each.

Radio-CLIP is the NVIDIA® C-RADIO vision backbone paired with a SigLIP text encoder. C-RADIO is a large-scale vision foundation model trained via multi-teacher distillation; Radio-CLIP adapts it for image-text alignment by attaching a lightweight SigLIP text adapter. Use model.adaptor_name to select the adapter: siglip (default) or clip.

Family

model.type values

Image size

Embedding dim

Weights source

SigLIP2

siglip2-so400m-patch16-256 (default)
siglip2-so400m-patch14-224
siglip2-so400m-patch14-384
siglip2-so400m-patch16-384
siglip2-so400m-patch16-512
siglip2-so400m-patch16-naflex (training only)

224–512

1152

HuggingFace

Radio-CLIP

c-radio_v3-b
c-radio_v3-l
c-radio_v3-h
c-radio_v3-g

224

768–1536

torch.hub

OpenCLIP / NV-CLIP

ViT-L-14-SigLIP-CLIPA-224
ViT-L-14-SigLIP-CLIPA-336
ViT-H-14-SigLIP-CLIPA-224
ViT-H-14-SigLIP-CLIPA-336
ViT-H-14-SigLIP-CLIPA-574

224–574

768–1024

HuggingFace

SigLIP2 — Use when you need a well-rounded image-text model with strong retrieval performance across a range of resolutions. The siglip2-so400m-patch16-256 variant is the default and a good starting point.

Radio-CLIP — Use when you need NVIDIA’s C-RADIO vision backbone with SigLIP text alignment. Radio-CLIP combines a powerful vision foundation model with an efficient text encoder. Set model.adaptor_name to siglip (default) or clip to select the text adapter.

OpenCLIP / NV-CLIP — Use when you need a CLIP-compatible architecture with SigLIP-CLIPA-style training at various resolutions.

Note

SigLIP2 and OpenCLIP / NV-CLIP weights load from HuggingFace. Radio-CLIP weights load from torch.hub (NVlabs/RADIO). You need network access or a local mirror on first use.

Limitations#

Warning

siglip2-so400m-patch16-naflex supports dynamic resolution during training but cannot be exported to ONNX or TensorRT in the current version. To deploy your model, use a fixed-resolution SigLIP2 variant such as siglip2-so400m-patch16-384.

  • TRT deployment supports FP16 and FP32.

  • Radio-CLIP models require you to set model.adaptor_name explicitly in the experiment specification.

  • Currently, attention_mask is accepted as an ONNX graph input for backward compatibility only; the model does not use its values and always substitutes an all-ones mask internally. This input is deprecated and scheduled for removal.

Data Input for CLIP#

CLIP supports two dataset formats: a custom filesystem format and WebDataset (WDS) sharded archives. Both formats support multiple dataset entries that are concatenated at training time.

Custom Format#

In the custom format, images and captions reside in separate directories. Each caption file must share the base name of its corresponding image.

Example directory layout:

/data/
  images/
    sample_001.jpg
    sample_002.jpg
    sample_003.png
  captions/
    sample_001.txt
    sample_002.txt
    sample_003.txt

Each caption file contains a single plain-text caption. For example, sample_001.txt:

A golden retriever playing fetch on a grassy field.

Configure a custom dataset in the experiment specification:

dataset:
  train:
    type: custom
    datasets:
      - image_dir: /data/images
        caption_dir: /data/captions
        caption_file_suffix: .txt
        image_list_file: null
  • caption_file_suffix: File extension for caption files. The default is .txt.

  • image_list_file: Optional path to a text file listing image filenames, one per line. When set, the dataloader reads only the listed files rather than scanning the directory. Providing an image list speeds up initialization for large datasets.

  • To use multiple datasets, add additional entries under datasets. TAO concatenates them automatically.

Supported image extensions: .jpg, .jpeg, .png.

WebDataset Format#

WebDataset (WDS) stores data in .tar shards. Each shard contains paired image and caption files identified by a shared key. This format supports large-scale training and streaming from cloud storage.

Example shard layout:

/data/shards/
  shard-000000.tar   # Contains: 000001.jpg, 000001.txt, 000002.jpg, ...
  shard-000001.tar
  shard-000002.tar

Supported image keys within a shard: .jpg, .jpeg, .png, .webp.

Supported caption keys within a shard: .txt, .text.

Configure a WDS dataset:

dataset:
  train:
    type: wds
    wds:
      root_dir: /data/shards
      shard_list_file: null
      samples_per_shard: 10000
  • root_dir: Directory containing .tar shard files. TAO discovers all shards automatically.

  • shard_list_file: Optional path to a text file listing shard paths or URLs, one per line. Use this to specify a subset of shards or remote paths.

  • samples_per_shard: Approximate number of samples per shard, used for progress tracking and training resumption.

To stream shards directly from S3, list the URLs in shard_list_file:

s3://my-bucket/clip-data/shard-000000.tar?endpoint=s3.us-east-1.amazonaws.com&region=us-east-1
s3://my-bucket/clip-data/shard-000001.tar?endpoint=s3.us-east-1.amazonaws.com&region=us-east-1

Inference Text Input#

For text embedding extraction during inference, provide a plain text file with one prompt per line:

A dog running on the beach.
A red sports car on a highway.
A bowl of fresh fruit.

Set inference.text_file to the path of this file in your experiment specification.

End-to-End Workflow#

The following two paths illustrate common CLIP workflows. For full command details, refer to CLIP Training and Deployment.

Path A: Fine-Tuning on Custom Data#

Use this path when you have domain-specific image-text pairs and want to fine-tune a pretrained model on your data.

  1. Prepare your dataset

    Organize images and captions in the custom format described above, or prepare WDS shards.

  2. Train

    tao model clip train -e /path/to/experiment_spec.yaml \
      model.type=siglip2-so400m-patch16-256 \
      results_dir=/results/clip_experiment
    
  3. Evaluate

    tao model clip evaluate -e /path/to/experiment_spec.yaml \
      evaluate.checkpoint=/results/clip_experiment/train/epoch_100.pth
    

    Example output:

    Image-to-Text: R@1=72.4  R@5=91.2  R@10=95.8  mAP=78.3  MedR=1  AUC=0.94
    Text-to-Image: R@1=68.1  R@5=89.4  R@10=94.2  mAP=74.6  MedR=1  AUC=0.93
    
  4. Export

    tao model clip export -e /path/to/experiment_spec.yaml \
      export.checkpoint=/results/clip_experiment/train/epoch_100.pth \
      export.onnx_file=/results/clip_experiment/export/clip_model.onnx
    
  5. Generate a TRT Engine

    tao deploy clip gen_trt_engine -e /path/to/experiment_spec.yaml \
      gen_trt_engine.onnx_file=/results/clip_experiment/export/clip_model.onnx \
      gen_trt_engine.trt_engine=/results/clip_experiment/deploy/clip_model.engine \
      gen_trt_engine.tensorrt.data_type=fp16
    
  6. Evaluate or Infer with TRT

    tao deploy clip evaluate -e /path/to/experiment_spec.yaml \
      evaluate.trt_engine=/results/clip_experiment/deploy/clip_model.engine
    

Path B: Zero-Shot Evaluation of a Pretrained Model#

Use this path to evaluate a pretrained CLIP model on your own data without any training. When evaluate.checkpoint is not set, TAO loads pretrained weights directly from HuggingFace or torch.hub.

  1. Configure the experiment specification

    Set dataset.val.datasets to your evaluation data. Leave evaluate.checkpoint unset or set it to null.

  2. Evaluate

    tao model clip evaluate -e /path/to/experiment_spec.yaml \
      model.type=siglip2-so400m-patch16-256
    
  3. Export the pretrained model

    tao model clip export -e /path/to/experiment_spec.yaml \
      model.type=siglip2-so400m-patch16-256 \
      export.onnx_file=/results/clip_zero_shot/export/clip_model.onnx
    
  4. Generate a TRT Engine and Infer

    Follow steps 5 and 6 from Path A.

Note

The siglip2-so400m-patch16-naflex model variant supports dynamic resolution during training but cannot be exported to ONNX or used for gen_trt_engine. Use a fixed-resolution SigLIP2 variant (e.g., siglip2-so400m-patch16-384) for TRT deployment.