CLIP Training, Evaluation, Inference, and Export#

The following sections cover the experiment specification parameters, training and evaluation commands, inference, export, and TRT deployment for CLIP.

For an overview of supported models, data formats, and end-to-end workflows, refer to CLIP Introduction.

Creating an Experiment Specification File#

tao clip get-spec \
  --action train \
  --output /path/to/experiment_spec.json
results_dir: /results/clip_experiment

model:
  type: siglip2-so400m-patch16-256
  adaptor_name: null
  freeze_vision_encoder: false
  freeze_text_encoder: false
  canonicalize_text: false

train:
  num_epochs: 100
  num_gpus: 1
  num_nodes: 1
  checkpoint_interval: 10
  resume_training_checkpoint_path: null
  pretrained_model_path: null
  loss_type: siglip
  precision: fp16
  grad_checkpointing: false
  grad_clip_norm: null
  distributed_strategy: ddp
  validation_interval: 1
  val_check_interval: null
  optim:
    optimizer_type: adamw
    vision_lr: 1.0e-4
    text_lr: 1.0e-4
    weight_decay: 1.0e-4
    betas: [0.9, 0.95]
    eps: 1.0e-6
    warmup_steps: 100
    scheduler: cosine

dataset:
  seed: 42
  train:
    type: custom
    datasets:
      - image_dir: /data/train/images
        caption_dir: /data/train/captions
        caption_file_suffix: .txt
        image_list_file: null
    batch_size: 16
    num_workers: 8
  val:
    datasets:
      - image_dir: /data/val/images
        caption_dir: /data/val/captions
    batch_size: 16
    num_workers: 8
  augmentation:
    scale: [0.4, 1.0]
    color_jitter: [0.8, 0.32, 0.32, 0.32, 0.08]
    grayscale: 0.2

evaluate:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  batch_size: 16

inference:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  datasets:
    - image_dir: /data/inference/images
  text_file: /data/inference/prompts.txt
  batch_size: 16

export:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  onnx_file: /results/clip_experiment/export/clip_model.onnx
  encoder_type: combined
  input_height: 256
  input_width: 256
  batch_size: -1
  opset_version: 17

gen_trt_engine:
  onnx_file: /results/clip_experiment/export/clip_model.onnx
  trt_engine: /results/clip_experiment/deploy/clip_model.engine
  batch_size: -1
  tensorrt:
    workspace_size: 4096
    data_type: fp16
    min_batch_size: 1
    opt_batch_size: 8
    max_batch_size: 16

model#

model:
  type: siglip2-so400m-patch16-256
  adaptor_name: null
  freeze_vision_encoder: false
  freeze_text_encoder: false
  canonicalize_text: false

Field

Data type

Description

Default value

Valid options

model.type

string

Backbone architecture. Refer to CLIP Introduction for all valid values.

siglip2-so400m-patch16-256

model.adaptor_name

string

Text adaptor for Radio-CLIP models. Set to siglip or clip. Required when using a Radio-CLIP backbone.

null

siglip, clip

model.freeze_vision_encoder

bool

Freeze vision encoder weights during training.

false

true, false

model.freeze_text_encoder

bool

Freeze text encoder weights during training.

false

true, false

model.canonicalize_text

bool

Lowercase and remove punctuation from captions before tokenization. Enable this only if the pretrained model was trained with text canonicalization.

false

true, false

train#

train:
  num_epochs: 100
  num_gpus: 1
  num_nodes: 1
  checkpoint_interval: 10
  loss_type: siglip
  precision: fp16
  distributed_strategy: ddp
  grad_checkpointing: false
  grad_clip_norm: null
  pretrained_model_path: null
  resume_training_checkpoint_path: null
  validation_interval: 1
  val_check_interval: null

Field

Data type

Description

Default value

Valid options

train.num_epochs

int

Total number of training epochs.

100

train.num_gpus

int

Number of GPUs per node.

1

train.num_nodes

int

Number of nodes for distributed training.

1

train.checkpoint_interval

int

Save a checkpoint every N epochs.

10

train.loss_type

string

Contrastive loss formulation. Use siglip for sigmoid-based loss — the recommended choice for SigLIP2 and Radio-CLIP — or clip for softmax-based loss.

siglip

siglip, clip

train.precision

string

Training precision.

fp16

fp16, bf16, fp32

train.distributed_strategy

string

Distributed training strategy. Use fsdp for very large models that exceed single-GPU memory.

ddp

ddp, fsdp

train.grad_checkpointing

bool

Enable gradient checkpointing to reduce GPU memory at the cost of additional compute.

false

true, false

train.grad_clip_norm

float

Maximum gradient norm for clipping. Set to null to disable.

null

train.pretrained_model_path

string

Path to a TAO checkpoint to use as the starting point for fine-tuning. When set to null, TAO loads pretrained weights from HuggingFace or torch.hub.

null

train.resume_training_checkpoint_path

string

Path to a TAO checkpoint from which to resume an interrupted training run.

null

train.validation_interval

int

Run validation every N epochs.

1

train.val_check_interval

int

Run validation every N steps. When set, this takes precedence over validation_interval.

null

optim#

train:
  optim:
    optimizer_type: adamw
    vision_lr: 1.0e-4
    text_lr: 1.0e-4
    weight_decay: 1.0e-4
    betas: [0.9, 0.95]
    eps: 1.0e-6
    warmup_steps: 100
    scheduler: cosine

Field

Data type

Description

Default value

Valid options

optim.optimizer_type

string

Optimizer. Use lamb for large-batch distributed training.

adamw

adamw, lamb

optim.vision_lr

float

Learning rate for the vision encoder.

1.0e-4

optim.text_lr

float

Learning rate for the text encoder.

1.0e-4

optim.weight_decay

float

L2 regularization coefficient.

1.0e-4

optim.betas

list[float]

Adam/LAMB beta parameters.

[0.9, 0.95]

optim.eps

float

Epsilon for numerical stability.

1.0e-6

optim.warmup_steps

int

Number of linear warmup steps at the start of training.

100

optim.scheduler

string

Learning rate schedule after warmup.

cosine

cosine, constant, linear

dataset#

dataset:
  seed: 42
  train:
    type: custom
    datasets:
      - image_dir: /data/train/images
        caption_dir: /data/train/captions
        caption_file_suffix: .txt
        image_list_file: null
    batch_size: 16
    num_workers: 8
  val:
    datasets: []
    batch_size: 16
    num_workers: 8

Field

Data type

Description

Default value

Valid options

dataset.seed

int

Random seed for data loading and shuffling.

42

dataset.train.type

string

Dataset format for training.

custom

custom, wds

dataset.train.datasets

list

List of dataset entries. Each entry specifies image_dir, caption_dir, caption_file_suffix, and image_list_file. Multiple entries are concatenated.

dataset.train.batch_size

int

Batch size per GPU during training.

16

dataset.train.num_workers

int

Number of dataloader worker processes.

8

augmentation#

dataset:
  augmentation:
    scale: [0.4, 1.0]
    color_jitter: [0.8, 0.32, 0.32, 0.32, 0.08]
    grayscale: 0.2

Field

Data type

Description

Default value

augmentation.scale

list[float]

Random resized crop scale range [min, max]. Set to [1.0, 1.0] to disable cropping.

[0.4, 1.0]

augmentation.color_jitter

list[float]

Color jitter parameters: [probability, brightness, contrast, saturation, hue]. Set to [] to disable.

[0.8, 0.32, 0.32, 0.32, 0.08]

augmentation.grayscale

float

Probability of converting an image to grayscale during training.

0.2

evaluate#

evaluate:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  batch_size: 16

Field

Data type

Description

Default value

evaluate.checkpoint

string

Path to the TAO training checkpoint. When set to null, TAO loads pretrained weights directly — enabling zero-shot evaluation.

null

evaluate.batch_size

int

Batch size for embedding extraction during evaluation.

16

inference#

inference:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  datasets:
    - image_dir: /data/inference/images
  text_file: /data/inference/prompts.txt
  batch_size: 16

Field

Data type

Description

Default value

inference.checkpoint

string

Path to the TAO training checkpoint. When set to null, TAO loads pretrained weights directly.

null

inference.datasets

list

List of image dataset entries. Supported image extensions: .jpg, .jpeg, .png, .bmp, .gif, .webp.

inference.text_file

string

Path to a plain text file with one prompt per line. TAO extracts a text embedding for each prompt.

null

inference.batch_size

int

Batch size for embedding extraction.

16

export#

export:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  onnx_file: /results/clip_experiment/export/clip_model.onnx
  encoder_type: combined
  input_height: 256
  input_width: 256
  batch_size: -1
  opset_version: 17

Field

Data type

Description

Default value

export.checkpoint

string

Path to the TAO training checkpoint. When set to null, TAO exports the pretrained model directly.

null

export.onnx_file

string

Output path for the ONNX file.

export.encoder_type

string

Controls whether TAO exports a single combined encoder or two separate encoders. Refer to Exporting the Model for guidance on which to choose.

combined

export.input_height

int

Input image height in pixels.

256

export.input_width

int

Input image width in pixels.

256

export.batch_size

int

Export batch size. Set to -1 for a dynamic batch axis.

-1

export.opset_version

int

ONNX opset version. The minimum supported value is 11.

17

Training the Model#

tao clip get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID \
  --output @train_spec.yaml
# Edit train_spec.yaml as needed

TRAIN_JOB_ID=$(tao clip create-job \
  --kind experiment \
  --name "clip_train" \
  --action train \
  --workspace-id $WORKSPACE_ID \
  --specs @train_spec.yaml \
  --train-dataset-uri "$DATASET_URI" \
  --eval-dataset-uri "$DATASET_URI" \
  --base-experiment-id "$BASE_EXPERIMENT_ID" \
  --encryption-key "nvidia_tlt" \
  --output json | jq -r '.id')

Multi-Node Training with FTMS

Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.

Verify that your cluster has multiple GPU enabled nodes available for training by running this command:

kubectl get nodes -o wide

The command lists the nodes in your cluster. If it does not list multiple nodes, contact your cluster administrator to get more nodes added to your cluster.

To run a multi-node training job through FTMS, modify these fields in the training job specification:

{
    "train": {
        "num_gpus": 8, // Number of GPUs per node
        "num_nodes": 2 // Number of nodes to use for training
    }
}

If these fields are not specified, FTMS uses the default values of one GPU per node and one node.

Note

The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster. The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.

tao model clip train -e /path/to/experiment_spec.yaml

Required Arguments

  • -e, --experiment_spec_file: Path to the experiment specification file.

Optional Arguments

  • -r, --results_dir: Path to the directory for storing results. Overrides results_dir in the specification file.

  • -g, --num_gpus: Number of GPUs to use for training

  • -h, --help: Display the help message

Sample Usage

tao model clip train -e /path/to/experiment_spec.yaml \
  train.num_gpus=4 \
  train.num_epochs=50 \
  results_dir=/results/clip_run1

Note

To run multi-GPU training, set train.num_gpus in the specification file or pass it as a command-line override. For multi-node training, set train.num_nodes and train.distributed_strategy: ddp. Use distributed_strategy: fsdp for models that exceed single-node GPU memory.

Evaluating the Model#

CLIP evaluation runs bidirectional retrieval across your validation dataset and reports the following metrics:

  • R@1, R@5, R@10: Recall at k. The fraction of queries for which the correct match appears in the top-k retrieved results.

  • mAP: Mean average precision across all queries.

  • Median Rank: The median rank position of the first correct match across all queries. Lower is better.

  • Mean Rank: The mean rank position of the first correct match. Lower is better.

  • AUC: Area under the precision-recall curve.

TAO reports all metrics for two directions: image-to-text (given an image, retrieve the matching caption) and text-to-image (given a caption, retrieve the matching image).

EVAL_JOB_ID=$(tao clip create-job \
  --kind experiment \
  --name "clip_evaluate" \
  --action evaluate \
  --workspace-id $WORKSPACE_ID \
  --specs @eval_spec.yaml \
  --eval-dataset-uri "$DATASET_URI" \
  --base-experiment-id "$BASE_EXPERIMENT_ID" \
  --encryption-key "nvidia_tlt" | jq -r '.id')
tao model clip evaluate -e /path/to/experiment_spec.yaml

Required Arguments

  • -e, --experiment_spec_file: Path to the experiment specification file.

Optional Arguments

  • evaluate.checkpoint: Path to the checkpoint to evaluate. When omitted, TAO evaluates the pretrained model.

  • -h, --help: Display the help message

Sample Usage

tao model clip evaluate -e /path/to/experiment_spec.yaml \
  evaluate.checkpoint=/results/clip_experiment/train/epoch_100.pth

Running Inference#

The inference task extracts image and text embeddings and saves them as HDF5 files in results_dir:

  • image_embeddings.h5: Contains datasets embeddings (float32, shape N × D) and image_paths (string).

  • text_embeddings.h5: Contains datasets embeddings (float32, shape N × D) and texts (string).

All embeddings are L2-normalized before saving.

For examples of how to use these embeddings in downstream applications, refer to Using CLIP Embeddings.

INFER_JOB_ID=$(tao clip create-job \
  --kind experiment \
  --name "clip_inference" \
  --action inference \
  --workspace-id $WORKSPACE_ID \
  --specs @inference_spec.yaml \
  --eval-dataset-uri "$DATASET_URI" \
  --base-experiment-id "$BASE_EXPERIMENT_ID" \
  --encryption-key "nvidia_tlt" | jq -r '.id')
tao model clip inference -e /path/to/experiment_spec.yaml

Required Arguments

  • -e, --experiment_spec_file: Path to the experiment specification file.

Optional Arguments

  • inference.checkpoint: Path to the checkpoint

  • inference.text_file: Path to a text file of prompts for text embedding extraction

  • -h, --help: Display the help message

Sample Usage

tao model clip inference -e /path/to/experiment_spec.yaml \
  inference.checkpoint=/results/clip_experiment/train/epoch_100.pth \
  inference.text_file=/data/prompts.txt \
  results_dir=/results/clip_experiment/inference

Exporting the Model#

TAO exports CLIP models to ONNX. You can export either a single combined encoder or two separate encoders, depending on your deployment requirements.

Combined encoder (encoder_type: combined): Produces a single ONNX file containing both the vision and text encoders.

Direction

Details

Inputs

image (B × 3 × H × W, float32), input_ids (B × seq_len, int64), attention_mask (B × seq_len, int64)

Outputs

image_embedding (B × D), text_embedding (B × D), logit_scale (scalar), logit_bias (scalar)

Use this format when you run vision and text encoding together at inference time—for example, in real-time retrieval or classification pipelines.

Separate encoders (encoder_type: separate): Produces two ONNX files: clip_model_vision.onnx and clip_model_text.onnx.

Engine

Details

Vision

Input: image (B × 3 × H × W). Outputs: image_embedding, logit_scale, logit_bias.

Text

Inputs: input_ids, attention_mask. Output: text_embedding, logit_scale, logit_bias.

Use this format when you want to pre-compute text embeddings offline—for example, to index a fixed set of class names or captions once and then run only the vision encoder at query time. TAO Deploy and trtexec support both combined and separate engine formats.

Note

attention_mask is a required ONNX graph input but its values are not usedthe model always substitutes an all-ones mask internally. Passing the tokenizer’s mask or np.ones_like(input_ids) produces identical results. Refer to Usage Notes for ONNX and TensorRT Deployment for details.

Warning

Currently, attention_mask is accepted as an explicit graph input for backward compatibility only. This input is deprecated and scheduled for removal. Remove it from your inference pipeline to avoid a future breaking change.

The export command also produces two artifact files alongside the ONNX output:

  • <name>_config.yaml: Saved experiment configuration, required by TAO Deploy for engine generation and inference.

  • <name>_tokenizer/: Saved HuggingFace tokenizer directory, required by TAO Deploy for text preprocessing.

Important

Keep _config.yaml and _tokenizer/ in the same directory as the ONNX file. TAO Deploy discovers these artifacts automatically. Moving or renaming them causes engine generation and inference to fail.

Important

For models larger than 2 GB, ONNX export writes two files: the .onnx file and an external data file that stores the large weight tensors (an ONNX external data limitation). The external data file name is set by the ONNX export path configuration. Do not rename it after export; the .onnx file references it by the exact name written at export time. Both files must remain in the same directory. If you move the .onnx file, move the external data file alongside it, or the engine build cannot succeed.

Warning

siglip2-so400m-patch16-naflex cannot be exported to ONNX. Use a fixed-resolution variant such as siglip2-so400m-patch16-384 instead.

EXPORT_JOB_ID=$(tao clip create-job \
  --kind experiment \
  --name "clip_export" \
  --action export \
  --workspace-id $WORKSPACE_ID \
  --specs @export_spec.yaml \
  --base-experiment-id "$BASE_EXPERIMENT_ID" \
  --encryption-key "nvidia_tlt" | jq -r '.id')
tao model clip export -e /path/to/experiment_spec.yaml

Required Arguments

  • -e, --experiment_spec_file: Path to the experiment specification file.

Optional Arguments

  • export.checkpoint: Path to the checkpoint

  • export.encoder_type: combined or separate

  • export.onnx_file: Output path for the ONNX file

  • -h, --help: Display the help message

Sample Usage: Combined Encoder

tao model clip export -e /path/to/experiment_spec.yaml \
  export.checkpoint=/results/clip_experiment/train/epoch_100.pth \
  export.onnx_file=/results/clip_experiment/export/clip_model.onnx \
  export.encoder_type=combined

Sample Usage: Separate Encoders

tao model clip export -e /path/to/experiment_spec.yaml \
  export.checkpoint=/results/clip_experiment/train/epoch_100.pth \
  export.onnx_file=/results/clip_experiment/export/clip_model.onnx \
  export.encoder_type=separate

Usage Notes for ONNX and TensorRT Deployment#

The following notes apply when you load the exported ONNX model or TRT engine directly, outside of TAO Deploy.

Attention Mask Behavior#

attention_mask is present as an ONNX graph input for backward compatibility, but its values are not used. The model always substitutes an all-ones mask internally. You can safely pass the tokenizer’s attention_mask or np.ones_like(input_ids); both produce identical results.

Warning

Currently, attention_mask is accepted as an explicit graph input for backward compatibility only. This input is deprecated and scheduled for removal. Remove it from your inference pipeline to avoid a future breaking change.

Sequence Length#

The text inputs (input_ids and attention_mask) must use the same max_length passed to the tokenizer. CLIP tokenizers typically use 77; SigLIP2 tokenizers use 64. Passing a different length causes a shape mismatch at runtime.

Dynamic Batch and TensorRT Shape Profiles#

When export.batch_size: -1, the batch dimension is dynamic. When building a TRT engine with trtexec, provide --minShapes, --optShapes, and --maxShapes for every input. For a combined encoder with 77-token sequences:

trtexec --onnx=clip_model.onnx \
  --minShapes=image:1x3x256x256,input_ids:1x77,attention_mask:1x77 \
  --optShapes=image:8x3x256x256,input_ids:8x77,attention_mask:8x77 \
  --maxShapes=image:32x3x256x256,input_ids:32x77,attention_mask:32x77

The attention_mask shape profile must match input_ids exactly. After the deprecation takes effect, omit attention_mask from all three shape arguments.

Image Preprocessing#

The image tensor must be preprocessed to the same pixel statistics used during training. TAO Deploy handles this automatically when running tao deploy clip inference. When loading the ONNX model directly, apply the per-channel mean and standard deviation stored in <name>_config.yaml (exported alongside the ONNX file).

Logit Scale and Logit Bias#

Both logit_scale and logit_bias are exported as scalar outputs. For SigLIP-style models, compute the match probability as sigmoid(logit_scale * dot(image_emb, text_emb) + logit_bias). For CLIP-style models, logit_bias is zero and softmax over class scores works as well. When using separate encoders, logit_scale and logit_bias are available from either encoder; you only need one copy.

Deploying with TensorRT#

After exporting to ONNX, convert the model to a TensorRT engine for optimized inference. CLIP TRT deployment supports FP16 and FP32. TAO supports both combined and separate encoder formats.

For the full gen_trt_engine, evaluate, and inference commands, refer to CLIP with TAO Deploy.