CLIP Introduction#
CLIP (Contrastive Language–Image Pretraining) is a multimodal model that learns to align images and text in a shared embedding space. You can use CLIP to rank images against text queries, classify images without labeled training data, or extract embeddings for downstream similarity search applications.
TAO Toolkit supports fine-tuning CLIP models on custom image-text datasets as well as zero-shot evaluation of pretrained models without any training.
Supported tasks:
trainevaluateinferenceexportgen_trt_engine
For full training, evaluation, inference, and export commands, refer to CLIP Training and Deployment.
For TensorRT™ engine generation, evaluation, and inference with TAO Deploy, refer to CLIP with TAO Deploy.
For downstream application examples that use CLIP embeddings, refer to Using CLIP Embeddings.
Note
Throughout this documentation are references to
$EXPERIMENT_IDand$DATASET_IDin the FTMS Client sections.For instructions on creating a dataset using the remote client, refer to the Creating a dataset section in the Remote Client documentation.
For instructions on creating an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
The spec format is YAML for TAO Launcher, and JSON for FTMS Client.
File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.
Supported Models#
TAO Toolkit supports three backbone families. The table below lists the
available model.type values, image resolutions, and embedding dimensions
for each.
Radio-CLIP is the NVIDIA® C-RADIO vision backbone paired with a SigLIP text
encoder. C-RADIO is a large-scale vision foundation model trained via
multi-teacher distillation; Radio-CLIP adapts it for image-text alignment by
attaching a lightweight SigLIP text adapter. Use model.adaptor_name to
select the adapter: siglip (default) or clip.
Family |
|
Image size |
Embedding dim |
Weights source |
|---|---|---|---|---|
SigLIP2 |
siglip2-so400m-patch16-256 (default)siglip2-so400m-patch14-224siglip2-so400m-patch14-384siglip2-so400m-patch16-384siglip2-so400m-patch16-512siglip2-so400m-patch16-naflex (training only) |
224–512 |
1152 |
HuggingFace |
Radio-CLIP |
c-radio_v3-bc-radio_v3-lc-radio_v3-hc-radio_v3-g |
224 |
768–1536 |
|
OpenCLIP / NV-CLIP |
ViT-L-14-SigLIP-CLIPA-224ViT-L-14-SigLIP-CLIPA-336ViT-H-14-SigLIP-CLIPA-224ViT-H-14-SigLIP-CLIPA-336ViT-H-14-SigLIP-CLIPA-574 |
224–574 |
768–1024 |
HuggingFace |
SigLIP2 — Use when you need a well-rounded image-text model with strong
retrieval performance across a range of resolutions. The
siglip2-so400m-patch16-256 variant is the default and a good starting point.
Radio-CLIP — Use when you need NVIDIA’s C-RADIO vision backbone with SigLIP
text alignment. Radio-CLIP combines a powerful vision foundation model with an
efficient text encoder. Set model.adaptor_name to siglip (default) or
clip to select the text adapter.
OpenCLIP / NV-CLIP — Use when you need a CLIP-compatible architecture with SigLIP-CLIPA-style training at various resolutions.
Note
SigLIP2 and OpenCLIP / NV-CLIP weights load from HuggingFace. Radio-CLIP weights
load from torch.hub (NVlabs/RADIO). You need network access or a
local mirror on first use.
Limitations#
Warning
siglip2-so400m-patch16-naflex supports dynamic resolution during training
but cannot be exported to ONNX or TensorRT in the current version. To deploy
your model, use a fixed-resolution SigLIP2 variant such as
siglip2-so400m-patch16-384.
TRT deployment supports FP16 and FP32.
Radio-CLIP models require you to set
model.adaptor_nameexplicitly in the experiment specification.Currently,
attention_maskis accepted as an ONNX graph input for backward compatibility only; the model does not use its values and always substitutes an all-ones mask internally. This input is deprecated and scheduled for removal.
Data Input for CLIP#
CLIP supports two dataset formats: a custom filesystem format and WebDataset (WDS) sharded archives. Both formats support multiple dataset entries that are concatenated at training time.
Custom Format#
In the custom format, images and captions reside in separate directories. Each caption file must share the base name of its corresponding image.
Example directory layout:
/data/
images/
sample_001.jpg
sample_002.jpg
sample_003.png
captions/
sample_001.txt
sample_002.txt
sample_003.txt
Each caption file contains a single plain-text caption. For example,
sample_001.txt:
A golden retriever playing fetch on a grassy field.
Configure a custom dataset in the experiment specification:
dataset:
train:
type: custom
datasets:
- image_dir: /data/images
caption_dir: /data/captions
caption_file_suffix: .txt
image_list_file: null
caption_file_suffix: File extension for caption files. The default is.txt.image_list_file: Optional path to a text file listing image filenames, one per line. When set, the dataloader reads only the listed files rather than scanning the directory. Providing an image list speeds up initialization for large datasets.To use multiple datasets, add additional entries under
datasets. TAO concatenates them automatically.
Supported image extensions: .jpg, .jpeg, .png.
WebDataset Format#
WebDataset (WDS) stores data in .tar shards. Each shard contains paired
image and caption files identified by a shared key. This format supports
large-scale training and streaming from cloud storage.
Example shard layout:
/data/shards/
shard-000000.tar # Contains: 000001.jpg, 000001.txt, 000002.jpg, ...
shard-000001.tar
shard-000002.tar
Supported image keys within a shard: .jpg, .jpeg, .png, .webp.
Supported caption keys within a shard: .txt, .text.
Configure a WDS dataset:
dataset:
train:
type: wds
wds:
root_dir: /data/shards
shard_list_file: null
samples_per_shard: 10000
root_dir: Directory containing.tarshard files. TAO discovers all shards automatically.shard_list_file: Optional path to a text file listing shard paths or URLs, one per line. Use this to specify a subset of shards or remote paths.samples_per_shard: Approximate number of samples per shard, used for progress tracking and training resumption.
To stream shards directly from S3, list the URLs in shard_list_file:
s3://my-bucket/clip-data/shard-000000.tar?endpoint=s3.us-east-1.amazonaws.com®ion=us-east-1
s3://my-bucket/clip-data/shard-000001.tar?endpoint=s3.us-east-1.amazonaws.com®ion=us-east-1
Inference Text Input#
For text embedding extraction during inference, provide a plain text file with one prompt per line:
A dog running on the beach.
A red sports car on a highway.
A bowl of fresh fruit.
Set inference.text_file to the path of this file in your experiment
specification.
End-to-End Workflow#
The following two paths illustrate common CLIP workflows. For full command details, refer to CLIP Training and Deployment.
Path A: Fine-Tuning on Custom Data#
Use this path when you have domain-specific image-text pairs and want to fine-tune a pretrained model on your data.
Prepare your dataset
Organize images and captions in the custom format described above, or prepare WDS shards.
Train
tao model clip train -e /path/to/experiment_spec.yaml \ model.type=siglip2-so400m-patch16-256 \ results_dir=/results/clip_experiment
Evaluate
tao model clip evaluate -e /path/to/experiment_spec.yaml \ evaluate.checkpoint=/results/clip_experiment/train/epoch_100.pth
Example output:
Image-to-Text: R@1=72.4 R@5=91.2 R@10=95.8 mAP=78.3 MedR=1 AUC=0.94 Text-to-Image: R@1=68.1 R@5=89.4 R@10=94.2 mAP=74.6 MedR=1 AUC=0.93
Export
tao model clip export -e /path/to/experiment_spec.yaml \ export.checkpoint=/results/clip_experiment/train/epoch_100.pth \ export.onnx_file=/results/clip_experiment/export/clip_model.onnx
Generate a TRT Engine
tao deploy clip gen_trt_engine -e /path/to/experiment_spec.yaml \ gen_trt_engine.onnx_file=/results/clip_experiment/export/clip_model.onnx \ gen_trt_engine.trt_engine=/results/clip_experiment/deploy/clip_model.engine \ gen_trt_engine.tensorrt.data_type=fp16
Evaluate or Infer with TRT
tao deploy clip evaluate -e /path/to/experiment_spec.yaml \ evaluate.trt_engine=/results/clip_experiment/deploy/clip_model.engine
Path B: Zero-Shot Evaluation of a Pretrained Model#
Use this path to evaluate a pretrained CLIP model on your own data without any
training. When evaluate.checkpoint is not set, TAO loads pretrained weights
directly from HuggingFace or torch.hub.
Configure the experiment specification
Set
dataset.val.datasetsto your evaluation data. Leaveevaluate.checkpointunset or set it tonull.Evaluate
tao model clip evaluate -e /path/to/experiment_spec.yaml \ model.type=siglip2-so400m-patch16-256
Export the pretrained model
tao model clip export -e /path/to/experiment_spec.yaml \ model.type=siglip2-so400m-patch16-256 \ export.onnx_file=/results/clip_zero_shot/export/clip_model.onnx
Generate a TRT Engine and Infer
Follow steps 5 and 6 from Path A.
Note
The siglip2-so400m-patch16-naflex model variant supports dynamic resolution during
training but cannot be exported to ONNX or used for gen_trt_engine. Use a
fixed-resolution SigLIP2 variant (e.g., siglip2-so400m-patch16-384) for TRT deployment.