CLIP Training, Evaluation, Inference, and Export#
The following sections cover the experiment specification parameters, training and evaluation commands, inference, export, and TRT deployment for CLIP.
For an overview of supported models, data formats, and end-to-end workflows, refer to CLIP Introduction.
Creating an Experiment Specification File#
tao clip get-spec \
--action train \
--output /path/to/experiment_spec.json
results_dir: /results/clip_experiment
model:
type: siglip2-so400m-patch16-256
adaptor_name: null
freeze_vision_encoder: false
freeze_text_encoder: false
canonicalize_text: false
train:
num_epochs: 100
num_gpus: 1
num_nodes: 1
checkpoint_interval: 10
resume_training_checkpoint_path: null
pretrained_model_path: null
loss_type: siglip
precision: fp16
grad_checkpointing: false
grad_clip_norm: null
distributed_strategy: ddp
validation_interval: 1
val_check_interval: null
optim:
optimizer_type: adamw
vision_lr: 1.0e-4
text_lr: 1.0e-4
weight_decay: 1.0e-4
betas: [0.9, 0.95]
eps: 1.0e-6
warmup_steps: 100
scheduler: cosine
dataset:
seed: 42
train:
type: custom
datasets:
- image_dir: /data/train/images
caption_dir: /data/train/captions
caption_file_suffix: .txt
image_list_file: null
batch_size: 16
num_workers: 8
val:
datasets:
- image_dir: /data/val/images
caption_dir: /data/val/captions
batch_size: 16
num_workers: 8
augmentation:
scale: [0.4, 1.0]
color_jitter: [0.8, 0.32, 0.32, 0.32, 0.08]
grayscale: 0.2
evaluate:
checkpoint: /results/clip_experiment/train/epoch_100.pth
batch_size: 16
inference:
checkpoint: /results/clip_experiment/train/epoch_100.pth
datasets:
- image_dir: /data/inference/images
text_file: /data/inference/prompts.txt
batch_size: 16
export:
checkpoint: /results/clip_experiment/train/epoch_100.pth
onnx_file: /results/clip_experiment/export/clip_model.onnx
encoder_type: combined
input_height: 256
input_width: 256
batch_size: -1
opset_version: 17
gen_trt_engine:
onnx_file: /results/clip_experiment/export/clip_model.onnx
trt_engine: /results/clip_experiment/deploy/clip_model.engine
batch_size: -1
tensorrt:
workspace_size: 4096
data_type: fp16
min_batch_size: 1
opt_batch_size: 8
max_batch_size: 16
model#
model:
type: siglip2-so400m-patch16-256
adaptor_name: null
freeze_vision_encoder: false
freeze_text_encoder: false
canonicalize_text: false
Field |
Data type |
Description |
Default value |
Valid options |
|---|---|---|---|---|
|
string |
Backbone architecture. Refer to CLIP Introduction for all valid values. |
|
|
|
string |
Text adaptor for Radio-CLIP models. Set to |
|
|
|
bool |
Freeze vision encoder weights during training. |
|
|
|
bool |
Freeze text encoder weights during training. |
|
|
|
bool |
Lowercase and remove punctuation from captions before tokenization. Enable this only if the pretrained model was trained with text canonicalization. |
|
|
train#
train:
num_epochs: 100
num_gpus: 1
num_nodes: 1
checkpoint_interval: 10
loss_type: siglip
precision: fp16
distributed_strategy: ddp
grad_checkpointing: false
grad_clip_norm: null
pretrained_model_path: null
resume_training_checkpoint_path: null
validation_interval: 1
val_check_interval: null
Field |
Data type |
Description |
Default value |
Valid options |
|---|---|---|---|---|
|
int |
Total number of training epochs. |
|
|
|
int |
Number of GPUs per node. |
|
|
|
int |
Number of nodes for distributed training. |
|
|
|
int |
Save a checkpoint every N epochs. |
|
|
|
string |
Contrastive loss formulation. Use |
|
|
|
string |
Training precision. |
|
|
|
string |
Distributed training strategy. Use |
|
|
|
bool |
Enable gradient checkpointing to reduce GPU memory at the cost of additional compute. |
|
|
|
float |
Maximum gradient norm for clipping. Set to |
|
|
|
string |
Path to a TAO checkpoint to use as the starting point for fine-tuning.
When set to |
|
|
|
string |
Path to a TAO checkpoint from which to resume an interrupted training run. |
|
|
|
int |
Run validation every N epochs. |
|
|
|
int |
Run validation every N steps. When set, this takes precedence over
|
|
optim#
train:
optim:
optimizer_type: adamw
vision_lr: 1.0e-4
text_lr: 1.0e-4
weight_decay: 1.0e-4
betas: [0.9, 0.95]
eps: 1.0e-6
warmup_steps: 100
scheduler: cosine
Field |
Data type |
Description |
Default value |
Valid options |
|---|---|---|---|---|
|
string |
Optimizer. Use |
|
|
|
float |
Learning rate for the vision encoder. |
|
|
|
float |
Learning rate for the text encoder. |
|
|
|
float |
L2 regularization coefficient. |
|
|
|
list[float] |
Adam/LAMB beta parameters. |
|
|
|
float |
Epsilon for numerical stability. |
|
|
|
int |
Number of linear warmup steps at the start of training. |
|
|
|
string |
Learning rate schedule after warmup. |
|
|
dataset#
dataset:
seed: 42
train:
type: custom
datasets:
- image_dir: /data/train/images
caption_dir: /data/train/captions
caption_file_suffix: .txt
image_list_file: null
batch_size: 16
num_workers: 8
val:
datasets: []
batch_size: 16
num_workers: 8
Field |
Data type |
Description |
Default value |
Valid options |
|---|---|---|---|---|
|
int |
Random seed for data loading and shuffling. |
|
|
|
string |
Dataset format for training. |
|
|
|
list |
List of dataset entries. Each entry specifies |
||
|
int |
Batch size per GPU during training. |
|
|
|
int |
Number of dataloader worker processes. |
|
augmentation#
dataset:
augmentation:
scale: [0.4, 1.0]
color_jitter: [0.8, 0.32, 0.32, 0.32, 0.08]
grayscale: 0.2
Field |
Data type |
Description |
Default value |
|---|---|---|---|
|
list[float] |
Random resized crop scale range |
|
|
list[float] |
Color jitter parameters: |
|
|
float |
Probability of converting an image to grayscale during training. |
|
evaluate#
evaluate:
checkpoint: /results/clip_experiment/train/epoch_100.pth
batch_size: 16
Field |
Data type |
Description |
Default value |
|---|---|---|---|
|
string |
Path to the TAO training checkpoint. When set to |
|
|
int |
Batch size for embedding extraction during evaluation. |
|
inference#
inference:
checkpoint: /results/clip_experiment/train/epoch_100.pth
datasets:
- image_dir: /data/inference/images
text_file: /data/inference/prompts.txt
batch_size: 16
Field |
Data type |
Description |
Default value |
|---|---|---|---|
|
string |
Path to the TAO training checkpoint. When set to |
|
|
list |
List of image dataset entries. Supported image extensions: |
|
|
string |
Path to a plain text file with one prompt per line. TAO extracts a text embedding for each prompt. |
|
|
int |
Batch size for embedding extraction. |
|
export#
export:
checkpoint: /results/clip_experiment/train/epoch_100.pth
onnx_file: /results/clip_experiment/export/clip_model.onnx
encoder_type: combined
input_height: 256
input_width: 256
batch_size: -1
opset_version: 17
Field |
Data type |
Description |
Default value |
|---|---|---|---|
|
string |
Path to the TAO training checkpoint. When set to |
|
|
string |
Output path for the ONNX file. |
|
|
string |
Controls whether TAO exports a single combined encoder or two separate encoders. Refer to Exporting the Model for guidance on which to choose. |
|
|
int |
Input image height in pixels. |
|
|
int |
Input image width in pixels. |
|
|
int |
Export batch size. Set to |
|
|
int |
ONNX opset version. The minimum supported value is 11. |
|
Training the Model#
tao clip get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID \
--output @train_spec.yaml
# Edit train_spec.yaml as needed
TRAIN_JOB_ID=$(tao clip create-job \
--kind experiment \
--name "clip_train" \
--action train \
--workspace-id $WORKSPACE_ID \
--specs @train_spec.yaml \
--train-dataset-uri "$DATASET_URI" \
--eval-dataset-uri "$DATASET_URI" \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" \
--output json | jq -r '.id')
Multi-Node Training with FTMS
Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.
Verify that your cluster has multiple GPU enabled nodes available for training by running this command:
kubectl get nodes -o wide
The command lists the nodes in your cluster. If it does not list multiple nodes, contact your cluster administrator to get more nodes added to your cluster.
To run a multi-node training job through FTMS, modify these fields in the training job specification:
{
"train": {
"num_gpus": 8, // Number of GPUs per node
"num_nodes": 2 // Number of nodes to use for training
}
}
If these fields are not specified, FTMS uses the default values of one GPU per node and one node.
Note
The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster.
The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.
tao model clip train -e /path/to/experiment_spec.yaml
Required Arguments
-e,--experiment_spec_file: Path to the experiment specification file.
Optional Arguments
-r, --results_dir: Path to the directory for storing results. Overridesresults_dirin the specification file.-g, --num_gpus: Number of GPUs to use for training-h, --help: Display the help message
Sample Usage
tao model clip train -e /path/to/experiment_spec.yaml \
train.num_gpus=4 \
train.num_epochs=50 \
results_dir=/results/clip_run1
Note
To run multi-GPU training, set train.num_gpus in the specification
file or pass it as a command-line override. For multi-node training, set
train.num_nodes and train.distributed_strategy: ddp. Use
distributed_strategy: fsdp for models that exceed single-node GPU memory.
Evaluating the Model#
CLIP evaluation runs bidirectional retrieval across your validation dataset and reports the following metrics:
R@1, R@5, R@10: Recall at k. The fraction of queries for which the correct match appears in the top-k retrieved results.
mAP: Mean average precision across all queries.
Median Rank: The median rank position of the first correct match across all queries. Lower is better.
Mean Rank: The mean rank position of the first correct match. Lower is better.
AUC: Area under the precision-recall curve.
TAO reports all metrics for two directions: image-to-text (given an image, retrieve the matching caption) and text-to-image (given a caption, retrieve the matching image).
EVAL_JOB_ID=$(tao clip create-job \
--kind experiment \
--name "clip_evaluate" \
--action evaluate \
--workspace-id $WORKSPACE_ID \
--specs @eval_spec.yaml \
--eval-dataset-uri "$DATASET_URI" \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" | jq -r '.id')
tao model clip evaluate -e /path/to/experiment_spec.yaml
Required Arguments
-e,--experiment_spec_file: Path to the experiment specification file.
Optional Arguments
evaluate.checkpoint: Path to the checkpoint to evaluate. When omitted, TAO evaluates the pretrained model.-h, --help: Display the help message
Sample Usage
tao model clip evaluate -e /path/to/experiment_spec.yaml \
evaluate.checkpoint=/results/clip_experiment/train/epoch_100.pth
Running Inference#
The inference task extracts image and text embeddings and saves them as HDF5
files in results_dir:
image_embeddings.h5: Contains datasetsembeddings(float32, shape N × D) andimage_paths(string).text_embeddings.h5: Contains datasetsembeddings(float32, shape N × D) andtexts(string).
All embeddings are L2-normalized before saving.
For examples of how to use these embeddings in downstream applications, refer to Using CLIP Embeddings.
INFER_JOB_ID=$(tao clip create-job \
--kind experiment \
--name "clip_inference" \
--action inference \
--workspace-id $WORKSPACE_ID \
--specs @inference_spec.yaml \
--eval-dataset-uri "$DATASET_URI" \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" | jq -r '.id')
tao model clip inference -e /path/to/experiment_spec.yaml
Required Arguments
-e,--experiment_spec_file: Path to the experiment specification file.
Optional Arguments
inference.checkpoint: Path to the checkpointinference.text_file: Path to a text file of prompts for text embedding extraction-h, --help: Display the help message
Sample Usage
tao model clip inference -e /path/to/experiment_spec.yaml \
inference.checkpoint=/results/clip_experiment/train/epoch_100.pth \
inference.text_file=/data/prompts.txt \
results_dir=/results/clip_experiment/inference
Exporting the Model#
TAO exports CLIP models to ONNX. You can export either a single combined encoder or two separate encoders, depending on your deployment requirements.
Combined encoder (encoder_type: combined): Produces a single ONNX file
containing both the vision and text encoders.
Direction |
Details |
|---|---|
Inputs |
|
Outputs |
|
Use this format when you run vision and text encoding together at inference time—for example, in real-time retrieval or classification pipelines.
Separate encoders (encoder_type: separate): Produces two ONNX files:
clip_model_vision.onnx and clip_model_text.onnx.
Engine |
Details |
|---|---|
Vision |
Input: |
Text |
Inputs: |
Use this format when you want to pre-compute text embeddings offline—for example, to index a fixed set of class names or captions once and then run only the vision encoder at query time. TAO Deploy and trtexec support both combined and separate engine formats.
Note
attention_mask is a required ONNX graph input but its values are not
used—the model always substitutes an all-ones mask internally. Passing
the tokenizer’s mask or np.ones_like(input_ids) produces identical
results. Refer to Usage Notes for ONNX and TensorRT Deployment for details.
Warning
Currently, attention_mask is accepted as an explicit graph input for
backward compatibility only. This input is deprecated and scheduled for
removal. Remove it from your inference pipeline to avoid a future breaking
change.
The export command also produces two artifact files alongside the ONNX output:
<name>_config.yaml: Saved experiment configuration, required by TAO Deploy for engine generation and inference.<name>_tokenizer/: Saved HuggingFace tokenizer directory, required by TAO Deploy for text preprocessing.
Important
Keep _config.yaml and _tokenizer/ in the same directory as the ONNX
file. TAO Deploy discovers these artifacts automatically. Moving or renaming
them causes engine generation and inference to fail.
Important
For models larger than 2 GB, ONNX export writes two files: the .onnx
file and an external data file that stores the large weight tensors
(an ONNX external data limitation). The
external data file name is set by the ONNX export path configuration. Do
not rename it after export; the .onnx file references it by the exact
name written at export time. Both files must remain in the same directory.
If you move the .onnx file, move the external data file alongside it,
or the engine build cannot succeed.
Warning
siglip2-so400m-patch16-naflex cannot be exported to ONNX. Use a
fixed-resolution variant such as siglip2-so400m-patch16-384 instead.
EXPORT_JOB_ID=$(tao clip create-job \
--kind experiment \
--name "clip_export" \
--action export \
--workspace-id $WORKSPACE_ID \
--specs @export_spec.yaml \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" | jq -r '.id')
tao model clip export -e /path/to/experiment_spec.yaml
Required Arguments
-e,--experiment_spec_file: Path to the experiment specification file.
Optional Arguments
export.checkpoint: Path to the checkpointexport.encoder_type:combinedorseparateexport.onnx_file: Output path for the ONNX file-h, --help: Display the help message
Sample Usage: Combined Encoder
tao model clip export -e /path/to/experiment_spec.yaml \
export.checkpoint=/results/clip_experiment/train/epoch_100.pth \
export.onnx_file=/results/clip_experiment/export/clip_model.onnx \
export.encoder_type=combined
Sample Usage: Separate Encoders
tao model clip export -e /path/to/experiment_spec.yaml \
export.checkpoint=/results/clip_experiment/train/epoch_100.pth \
export.onnx_file=/results/clip_experiment/export/clip_model.onnx \
export.encoder_type=separate
Usage Notes for ONNX and TensorRT Deployment#
The following notes apply when you load the exported ONNX model or TRT engine directly, outside of TAO Deploy.
Attention Mask Behavior#
attention_mask is present as an ONNX graph input for backward
compatibility, but its values are not used. The model always substitutes an
all-ones mask internally. You can safely pass the tokenizer’s
attention_mask or np.ones_like(input_ids); both produce identical
results.
Warning
Currently, attention_mask is accepted as an explicit graph input for
backward compatibility only. This input is deprecated and scheduled for
removal. Remove it from your inference pipeline to avoid a future breaking
change.
Sequence Length#
The text inputs (input_ids and attention_mask) must use the same
max_length passed to the tokenizer. CLIP tokenizers typically use 77;
SigLIP2 tokenizers use 64. Passing a different length causes a shape
mismatch at runtime.
Dynamic Batch and TensorRT Shape Profiles#
When export.batch_size: -1, the batch dimension is dynamic. When
building a TRT engine with trtexec, provide --minShapes,
--optShapes, and --maxShapes for every input. For a combined
encoder with 77-token sequences:
trtexec --onnx=clip_model.onnx \
--minShapes=image:1x3x256x256,input_ids:1x77,attention_mask:1x77 \
--optShapes=image:8x3x256x256,input_ids:8x77,attention_mask:8x77 \
--maxShapes=image:32x3x256x256,input_ids:32x77,attention_mask:32x77
The attention_mask shape profile must match input_ids exactly. After
the deprecation takes effect, omit attention_mask from all three shape
arguments.
Image Preprocessing#
The image tensor must be preprocessed to the same pixel statistics used
during training. TAO Deploy handles this automatically when running
tao deploy clip inference. When loading the ONNX model directly, apply
the per-channel mean and standard deviation stored in
<name>_config.yaml (exported alongside the ONNX file).
Logit Scale and Logit Bias#
Both logit_scale and logit_bias are exported as scalar outputs. For
SigLIP-style models, compute the match probability as
sigmoid(logit_scale * dot(image_emb, text_emb) + logit_bias).
For CLIP-style models, logit_bias is zero and softmax over class scores
works as well. When using separate encoders, logit_scale and
logit_bias are available from either encoder; you only need one copy.
Deploying with TensorRT#
After exporting to ONNX, convert the model to a TensorRT™ engine for optimized inference. CLIP TRT deployment supports FP16 and FP32. TAO supports both combined and separate encoder formats.
For the full gen_trt_engine, evaluate, and inference commands, refer to
CLIP with TAO Deploy.