RADIO-CLIP object embeddings#
Role in VSS#
The RT-CV microservice (Object Detection and Tracking) can produce object embeddings from detection crops and text embeddings from prompts using a combined CLIP-style ONNX export: vision on TensorRT, text on ONNX Runtime with tokenizer assets. Runtime INI keys and APIs are documented under ReID and Embeddings on that page.
Note
Purpose-built model cards (inputs, outputs, licensing, integration notes) are maintained alongside NGC collaterals—for example RADIO-CLIP and SigLIP_2 under ngc-collaterals/multimodal/docs/purpose_built_models/ in the collaterals repo. Use those files together with the published TAO embedding guides when you need field-level model card detail.
RADIO-CLIP and SigLIP 2 at a glance#
RADIO-CLIP — C-RADIO vision backbones (for example ViT-H/16) paired with a SigLIP text adapter; combined checkpoints suit industrial retrieval and re-identification. For RADIO-CLIP, set model.type to a c-radio_v3-* value and set model.adaptor_name to siglip or clip. Image weights may load via torch.hub (NVlabs/RADIO) on first use. See the RADIO-CLIP model card and RADIO (GitHub).
SigLIP 2 — Vision–language encoders with multiple resolutions (for example 224–512) and joint image–text embeddings for retrieval and ReID-style use cases. Typical TAO entry uses Hugging Face weights; model.type selects the variant (for example siglip2-so400m-patch16-256). See the SigLIP 2 model card and the SigLIP 2 paper.
Hardware and software requirements#
Follow the TAO Toolkit Quick Start — Requirements. Use a Linux environment with a supported NVIDIA GPU stack; CLIP training and export use the same TAO Launcher pattern as other embedding models.
Dataset requirements#
TAO CLIP fine-tuning expects image–text pairs. Two dataset layouts are supported in depth in the TAO docs:
Custom directory layout — Images and captions in parallel trees; each .txt caption file shares a basename with its image (images/foo.jpg ↔ captions/foo.txt). Supported image extensions include .jpg, .jpeg, .png.
WebDataset (WDS) — Sharded .tar archives with paired image and caption keys; suitable for large or cloud-hosted corpora (including S3 URLs via shard lists).
For inference-time text embedding, provide a plain-text file with one prompt per line and set inference.text_file in the experiment spec. Full layout examples and YAML snippets are in the CLIP introduction (mirrors the in-repo tlt-docs/docs/text/embedding/clip/overview.rst).
Supported model.type values (summary)#
The TAO CLIP introduction documents all families. The table below is a condensed reference; see the published page for the authoritative list.
Family |
Example |
Typical image size |
Embedding dim |
Weights source |
|---|---|---|---|---|
RADIO-CLIP |
|
224 |
768–1536 |
|
SigLIP 2 |
|
224–512 |
1152 |
Hugging Face |
OpenCLIP / NV-CLIP |
|
224–574 |
768–1024 |
Hugging Face |
For RADIO-CLIP (c-radio_v3-*), set model.adaptor_name to siglip (typical) or clip as required by your spec.
Deployment limitations (read before export)#
TensorRT deployment supports FP16 and FP32 for generated engines.
The
siglip2-so400m-patch16-naflexvariant supports dynamic resolution in training but cannot be exported to ONNX or TensorRT in current TAO—choose a fixed-resolution SigLIP 2 variant for RT-CV deployment.attention_maskmay appear as an ONNX input for backward compatibility; behavior is documented in the CLIP introduction.
TAO fine-tuning configuration#
Fine-tuning uses a YAML experiment specification file (same pattern as other TAO embedding models). The spec is organized into sections that mirror the CLIP training documentation and the in-repo sources under tlt-docs/docs/text/embedding/clip/training.rst.
Invoke tasks with:
tao model clip <action> -e /path/to/experiment_spec.yaml [overrides]
# TensorRT engine generation uses TAO Deploy:
tao deploy clip gen_trt_engine -e /path/to/experiment_spec.yaml
Supported actions: train, evaluate, inference, export, and (TAO Deploy) gen_trt_engine.
An experiment specification file consists of several main components:
model— backbonetype,adaptor_name(required for RADIO-CLIPc-radio_v3-*types), freeze flags, text canonicalizationtrain— epochs, GPUs, loss (siglip/clip), precision, optimizer (train.optim), resume pathsdataset—train/valloaders (customorwds),augmentationevaluate— checkpoint and batch size for retrieval metricsinference— image dirs,text_filefor promptsexport— ONNX path,encoder_type(combinedorseparate), input height/width, opsetgen_trt_engine— ONNX input, TRT output path, TensorRT precision and batch bounds
For field-by-field tables (defaults and valid options), use the published CLIP training page (sourced from tlt-docs).
Example experiment specification#
The following example follows the TAO Launcher layout for RADIO-CLIP (c-radio_v3-h): custom image–caption dataset, combined ONNX export at 224×224, FP16 TensorRT profile. Adjust model.type / adaptor_name if you use another C-RADIO size or SigLIP 2 (see the next subsection).
results_dir: /results/clip_experiment
model:
type: c-radio_v3-h
adaptor_name: siglip
freeze_vision_encoder: false
freeze_text_encoder: false
canonicalize_text: false
train:
num_epochs: 100
num_gpus: 1
num_nodes: 1
checkpoint_interval: 10
resume_training_checkpoint_path: null
pretrained_model_path: null
loss_type: siglip
precision: fp16
grad_checkpointing: false
grad_clip_norm: null
distributed_strategy: ddp
validation_interval: 1
val_check_interval: null
optim:
optimizer_type: adamw
vision_lr: 1.0e-4
text_lr: 1.0e-4
weight_decay: 1.0e-4
betas: [0.9, 0.95]
eps: 1.0e-6
warmup_steps: 100
scheduler: cosine
dataset:
seed: 42
train:
type: custom
datasets:
- image_dir: /data/train/images
caption_dir: /data/train/captions
caption_file_suffix: .txt
image_list_file: null
batch_size: 16
num_workers: 8
val:
datasets:
- image_dir: /data/val/images
caption_dir: /data/val/captions
batch_size: 16
num_workers: 8
augmentation:
scale: [0.4, 1.0]
color_jitter: [0.8, 0.32, 0.32, 0.32, 0.08]
grayscale: 0.2
evaluate:
checkpoint: /results/clip_experiment/train/epoch_100.pth
batch_size: 16
inference:
checkpoint: /results/clip_experiment/train/epoch_100.pth
datasets:
- image_dir: /data/inference/images
text_file: /data/inference/prompts.txt
batch_size: 16
export:
checkpoint: /results/clip_experiment/train/epoch_100.pth
onnx_file: /results/clip_experiment/export/clip_model.onnx
encoder_type: combined
input_height: 224
input_width: 224
batch_size: -1
opset_version: 17
gen_trt_engine:
onnx_file: /results/clip_experiment/export/clip_model.onnx
trt_engine: /results/clip_experiment/deploy/clip_model.engine
batch_size: -1
tensorrt:
workspace_size: 4096
data_type: fp16
min_batch_size: 1
opt_batch_size: 8
max_batch_size: 16
SigLIP 2 alternative (example)#
If you fine-tune a SigLIP 2 backbone instead, set model.type accordingly and leave adaptor_name null unless the TAO spec requires otherwise:
model:
type: siglip2-so400m-patch16-256
adaptor_name: null
Launch model fine-tuning#
Below are important fields to set before launching training (match paths to your dataset):
dataset:
train:
datasets:
- image_dir: ?? # training images
caption_dir: ?? # parallel captions (.txt per image)
caption_file_suffix: .txt
image_list_file: null
val:
datasets:
- image_dir: ??
caption_dir: ??
model:
type: c-radio_v3-h
adaptor_name: siglip
train:
num_epochs: ?? # e.g. 50–100 for domain fine-tune
num_gpus: ?? # or pass train.num_gpus=4 on the CLI
pretrained_model_path: null # or path to a TAO checkpoint
loss_type: siglip
precision: fp16
Launch training:
tao model clip train -e /path/to/experiment_spec.yaml \
results_dir=/results/clip_run1 \
train.num_gpus=4
Evaluate the fine-tuned model#
Point evaluate.checkpoint at your trained .pth (or leave unset for zero-shot pretrained evaluation per TAO docs).
tao model clip evaluate -e /path/to/experiment_spec.yaml \
evaluate.checkpoint=/results/clip_experiment/train/epoch_100.pth
Run inference (embedding extraction)#
Inference writes HDF5 outputs under results_dir (image and text embeddings). Configure inference.datasets and inference.text_file as in the example spec above.
tao model clip inference -e /path/to/experiment_spec.yaml \
inference.checkpoint=/results/clip_experiment/train/epoch_100.pth \
inference.text_file=/data/prompts.txt \
results_dir=/results/clip_experiment/inference
Export the fine-tuned model#
Use export.encoder_type: combined for a single ONNX for RT-CV-style deployment (vision + text). Match export.input_height / input_width to the trained resolution. Export also emits _config.yaml and a tokenizer directory beside the ONNX—keep them together for TAO Deploy.
tao model clip export -e /path/to/experiment_spec.yaml \
export.checkpoint=/results/clip_experiment/train/epoch_100.pth \
export.onnx_file=/results/clip_experiment/export/clip_model.onnx \
export.encoder_type=combined
Generate a TensorRT engine (TAO Deploy)#
tao deploy clip gen_trt_engine -e /path/to/experiment_spec.yaml \
gen_trt_engine.onnx_file=/results/clip_experiment/export/clip_model.onnx \
gen_trt_engine.trt_engine=/results/clip_experiment/deploy/clip_model.engine \
gen_trt_engine.tensorrt.data_type=fp16
Further export shapes, separate vision/text ONNX, tokenizer artifacts, and ONNX external-data notes are in the Exporting the Model section of the CLIP training documentation.
Integrating fine-tuned weights into RT-CV#
Export a combined image+text ONNX (and tokenizer directory) so vision and text paths stay aligned.
Deploy artifacts beside the RT-CV container or mount them with volumes.
In DeepStream, enable the vision encoder and text embedder blocks; set
onnx-model-pathandtokenizer-dir; rebuild the TensorRT engine for the vision branch. Keep image size, tokenizer max length, and normalization consistent with training.Validate REST text embedding APIs, Kafka embedding dimensions, and downstream services (for example Behavior Analytics).
Note
Changing embedding dimension or geometry usually requires re-indexing vectors and rechecking similarity thresholds—see Model customization overview.
Source material in this repo’s doc trees#
The following paths (local checkouts) align with the published TAO URLs above and are useful when drafting or reviewing details:
tlt-docs/docs/text/embedding/—index.rst,clip/overview.rst,clip/training.rst,clip/applications.rstngc-collaterals/multimodal/docs/purpose_built_models/—RADIO-CLIP/overview.md,SigLIP_2/overview.md