RADIO-CLIP object embeddings#

Role in VSS#

The RT-CV microservice (Object Detection and Tracking) can produce object embeddings from detection crops and text embeddings from prompts using a combined CLIP-style ONNX export: vision on TensorRT, text on ONNX Runtime with tokenizer assets. Runtime INI keys and APIs are documented under ReID and Embeddings on that page.

Note

Purpose-built model cards (inputs, outputs, licensing, integration notes) are maintained alongside NGC collaterals—for example RADIO-CLIP and SigLIP_2 under ngc-collaterals/multimodal/docs/purpose_built_models/ in the collaterals repo. Use those files together with the published TAO embedding guides when you need field-level model card detail.

RADIO-CLIP and SigLIP 2 at a glance#

RADIO-CLIP — C-RADIO vision backbones (for example ViT-H/16) paired with a SigLIP text adapter; combined checkpoints suit industrial retrieval and re-identification. For RADIO-CLIP, set model.type to a c-radio_v3-* value and set model.adaptor_name to siglip or clip. Image weights may load via torch.hub (NVlabs/RADIO) on first use. See the RADIO-CLIP model card and RADIO (GitHub).

SigLIP 2 — Vision–language encoders with multiple resolutions (for example 224–512) and joint image–text embeddings for retrieval and ReID-style use cases. Typical TAO entry uses Hugging Face weights; model.type selects the variant (for example siglip2-so400m-patch16-256). See the SigLIP 2 model card and the SigLIP 2 paper.

Hardware and software requirements#

Follow the TAO Toolkit Quick Start — Requirements. Use a Linux environment with a supported NVIDIA GPU stack; CLIP training and export use the same TAO Launcher pattern as other embedding models.

Dataset requirements#

TAO CLIP fine-tuning expects image–text pairs. Two dataset layouts are supported in depth in the TAO docs:

Custom directory layout — Images and captions in parallel trees; each .txt caption file shares a basename with its image (images/foo.jpgcaptions/foo.txt). Supported image extensions include .jpg, .jpeg, .png.

WebDataset (WDS) — Sharded .tar archives with paired image and caption keys; suitable for large or cloud-hosted corpora (including S3 URLs via shard lists).

For inference-time text embedding, provide a plain-text file with one prompt per line and set inference.text_file in the experiment spec. Full layout examples and YAML snippets are in the CLIP introduction (mirrors the in-repo tlt-docs/docs/text/embedding/clip/overview.rst).

Supported model.type values (summary)#

The TAO CLIP introduction documents all families. The table below is a condensed reference; see the published page for the authoritative list.

Family

Example model.type values

Typical image size

Embedding dim

Weights source

RADIO-CLIP

c-radio_v3-b, c-radio_v3-l, c-radio_v3-h, c-radio_v3-g

224

768–1536

torch.hub

SigLIP 2

siglip2-so400m-patch16-256, patch14-224, patch16-384, patch16-512, patch16-naflex (training only; see TAO docs)

224–512

1152

Hugging Face

OpenCLIP / NV-CLIP

ViT-L-14-SigLIP-CLIPA-224, ViT-L-14-SigLIP-CLIPA-336, additional ViT-H variants per TAO docs

224–574

768–1024

Hugging Face

For RADIO-CLIP (c-radio_v3-*), set model.adaptor_name to siglip (typical) or clip as required by your spec.

Deployment limitations (read before export)#

  • TensorRT deployment supports FP16 and FP32 for generated engines.

  • The siglip2-so400m-patch16-naflex variant supports dynamic resolution in training but cannot be exported to ONNX or TensorRT in current TAO—choose a fixed-resolution SigLIP 2 variant for RT-CV deployment.

  • attention_mask may appear as an ONNX input for backward compatibility; behavior is documented in the CLIP introduction.

TAO fine-tuning configuration#

Fine-tuning uses a YAML experiment specification file (same pattern as other TAO embedding models). The spec is organized into sections that mirror the CLIP training documentation and the in-repo sources under tlt-docs/docs/text/embedding/clip/training.rst.

Invoke tasks with:

tao model clip <action> -e /path/to/experiment_spec.yaml [overrides]

# TensorRT engine generation uses TAO Deploy:
tao deploy clip gen_trt_engine -e /path/to/experiment_spec.yaml

Supported actions: train, evaluate, inference, export, and (TAO Deploy) gen_trt_engine.

An experiment specification file consists of several main components:

  • model — backbone type, adaptor_name (required for RADIO-CLIP c-radio_v3-* types), freeze flags, text canonicalization

  • train — epochs, GPUs, loss (siglip / clip), precision, optimizer (train.optim), resume paths

  • datasettrain / val loaders (custom or wds), augmentation

  • evaluate — checkpoint and batch size for retrieval metrics

  • inference — image dirs, text_file for prompts

  • export — ONNX path, encoder_type (combined or separate), input height/width, opset

  • gen_trt_engine — ONNX input, TRT output path, TensorRT precision and batch bounds

For field-by-field tables (defaults and valid options), use the published CLIP training page (sourced from tlt-docs).

Example experiment specification#

The following example follows the TAO Launcher layout for RADIO-CLIP (c-radio_v3-h): custom image–caption dataset, combined ONNX export at 224×224, FP16 TensorRT profile. Adjust model.type / adaptor_name if you use another C-RADIO size or SigLIP 2 (see the next subsection).

results_dir: /results/clip_experiment

model:
  type: c-radio_v3-h
  adaptor_name: siglip
  freeze_vision_encoder: false
  freeze_text_encoder: false
  canonicalize_text: false

train:
  num_epochs: 100
  num_gpus: 1
  num_nodes: 1
  checkpoint_interval: 10
  resume_training_checkpoint_path: null
  pretrained_model_path: null
  loss_type: siglip
  precision: fp16
  grad_checkpointing: false
  grad_clip_norm: null
  distributed_strategy: ddp
  validation_interval: 1
  val_check_interval: null
  optim:
    optimizer_type: adamw
    vision_lr: 1.0e-4
    text_lr: 1.0e-4
    weight_decay: 1.0e-4
    betas: [0.9, 0.95]
    eps: 1.0e-6
    warmup_steps: 100
    scheduler: cosine

dataset:
  seed: 42
  train:
    type: custom
    datasets:
      - image_dir: /data/train/images
        caption_dir: /data/train/captions
        caption_file_suffix: .txt
        image_list_file: null
    batch_size: 16
    num_workers: 8
  val:
    datasets:
      - image_dir: /data/val/images
        caption_dir: /data/val/captions
    batch_size: 16
    num_workers: 8
  augmentation:
    scale: [0.4, 1.0]
    color_jitter: [0.8, 0.32, 0.32, 0.32, 0.08]
    grayscale: 0.2

evaluate:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  batch_size: 16

inference:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  datasets:
    - image_dir: /data/inference/images
  text_file: /data/inference/prompts.txt
  batch_size: 16

export:
  checkpoint: /results/clip_experiment/train/epoch_100.pth
  onnx_file: /results/clip_experiment/export/clip_model.onnx
  encoder_type: combined
  input_height: 224
  input_width: 224
  batch_size: -1
  opset_version: 17

gen_trt_engine:
  onnx_file: /results/clip_experiment/export/clip_model.onnx
  trt_engine: /results/clip_experiment/deploy/clip_model.engine
  batch_size: -1
  tensorrt:
    workspace_size: 4096
    data_type: fp16
    min_batch_size: 1
    opt_batch_size: 8
    max_batch_size: 16

SigLIP 2 alternative (example)#

If you fine-tune a SigLIP 2 backbone instead, set model.type accordingly and leave adaptor_name null unless the TAO spec requires otherwise:

model:
  type: siglip2-so400m-patch16-256
  adaptor_name: null

Launch model fine-tuning#

Below are important fields to set before launching training (match paths to your dataset):

dataset:
  train:
    datasets:
      - image_dir: ??  # training images
        caption_dir: ??  # parallel captions (.txt per image)
        caption_file_suffix: .txt
        image_list_file: null
  val:
    datasets:
      - image_dir: ??
        caption_dir: ??
model:
  type: c-radio_v3-h
  adaptor_name: siglip
train:
  num_epochs: ??  # e.g. 50–100 for domain fine-tune
  num_gpus: ??    # or pass train.num_gpus=4 on the CLI
  pretrained_model_path: null  # or path to a TAO checkpoint
  loss_type: siglip
  precision: fp16

Launch training:

tao model clip train -e /path/to/experiment_spec.yaml \
  results_dir=/results/clip_run1 \
  train.num_gpus=4

Evaluate the fine-tuned model#

Point evaluate.checkpoint at your trained .pth (or leave unset for zero-shot pretrained evaluation per TAO docs).

tao model clip evaluate -e /path/to/experiment_spec.yaml \
  evaluate.checkpoint=/results/clip_experiment/train/epoch_100.pth

Run inference (embedding extraction)#

Inference writes HDF5 outputs under results_dir (image and text embeddings). Configure inference.datasets and inference.text_file as in the example spec above.

tao model clip inference -e /path/to/experiment_spec.yaml \
  inference.checkpoint=/results/clip_experiment/train/epoch_100.pth \
  inference.text_file=/data/prompts.txt \
  results_dir=/results/clip_experiment/inference

Export the fine-tuned model#

Use export.encoder_type: combined for a single ONNX for RT-CV-style deployment (vision + text). Match export.input_height / input_width to the trained resolution. Export also emits _config.yaml and a tokenizer directory beside the ONNX—keep them together for TAO Deploy.

tao model clip export -e /path/to/experiment_spec.yaml \
  export.checkpoint=/results/clip_experiment/train/epoch_100.pth \
  export.onnx_file=/results/clip_experiment/export/clip_model.onnx \
  export.encoder_type=combined

Generate a TensorRT engine (TAO Deploy)#

tao deploy clip gen_trt_engine -e /path/to/experiment_spec.yaml \
  gen_trt_engine.onnx_file=/results/clip_experiment/export/clip_model.onnx \
  gen_trt_engine.trt_engine=/results/clip_experiment/deploy/clip_model.engine \
  gen_trt_engine.tensorrt.data_type=fp16

Further export shapes, separate vision/text ONNX, tokenizer artifacts, and ONNX external-data notes are in the Exporting the Model section of the CLIP training documentation.

Integrating fine-tuned weights into RT-CV#

  1. Export a combined image+text ONNX (and tokenizer directory) so vision and text paths stay aligned.

  2. Deploy artifacts beside the RT-CV container or mount them with volumes.

  3. In DeepStream, enable the vision encoder and text embedder blocks; set onnx-model-path and tokenizer-dir; rebuild the TensorRT engine for the vision branch. Keep image size, tokenizer max length, and normalization consistent with training.

  4. Validate REST text embedding APIs, Kafka embedding dimensions, and downstream services (for example Behavior Analytics).

Note

Changing embedding dimension or geometry usually requires re-indexing vectors and rechecking similarity thresholds—see Model customization overview.

Source material in this repo’s doc trees#

The following paths (local checkouts) align with the published TAO URLs above and are useful when drafting or reviewing details:

  • tlt-docs/docs/text/embedding/index.rst, clip/overview.rst, clip/training.rst, clip/applications.rst

  • ngc-collaterals/multimodal/docs/purpose_built_models/RADIO-CLIP/overview.md, SigLIP_2/overview.md