Cosmos-Embed1 (video embedding)#
Role in VSS#
The Real-Time Embedding microservice (Real-Time Embedding) loads Cosmos-Embed1 variants by default (for example 224p / 336p / 448p Hugging Face checkpoints) and publishes video and text embeddings for semantic search and Kafka consumers. The Cosmos-Embed1 NGC model card lists pretrained checkpoints. Container variables such as MODEL_PATH, optional MODEL_IMPLEMENTATION_PATH, and Triton repo scripts are described under Customizations on Real-Time Embedding.
Note
Full TAO reference for Cosmos-Embed1—including experiment schema tables, dataset configs, and optional Weights & Biases logging—lives in the published embedding chapter and mirrors the in-repo tlt-docs/docs/text/embedding/cosmos_embed1.rst (and cosmos_embed1_tables.rst) sources.
Architecture and tasks#
Cosmos-Embed1 is a dual-encoder video–text model: an EVA-ViT-G visual encoder on sampled frames (typically 224×224), a Q-Former distilled from video features, a BERT-style text encoder, and CLIP-style or SigLIP-style contrastive alignment. LoRA is supported for efficient fine-tuning of visual and Q-Former attention layers.
TAO tasks:
tao model cosmos-embed1 <action> -e /path/to/spec.yaml [overrides]
Where <action> is one of train, evaluate, inference, or export.
Hardware requirements#
Minimum (typical single-GPU training at 224p): one NVIDIA GPU with at least 40 GB GPU memory; Ubuntu 20.04+; CUDA 12.1+.
Recommended: multiple A100 or H100 GPUs, fast storage for video I/O, and sufficient CPU cores for dataloaders. Exact needs depend on resolution, batch size, and train.max_iter.
Data input (summary)#
Cosmos-Embed1 uses a JSON or JSONL metadata file mapping each video to a caption (and optional labels). The dataset_type field selects loaders. Common types include:
|
Description |
|---|---|
|
Synthetic data for pipeline tests (no real video files) |
|
VAD-R1 anomaly-style video with captions and optional temporal chunks |
|
MSR-VTT-style video–description pairs |
|
Kinetics-style action clips with metadata CSV |
|
Videos referenced by HTTP URLs in metadata |
Field-level requirements (paths, caption_field, chunking) are in the Cosmos-Embed1 TAO chapter. Align your frame sampling and resolution with the variant you deploy (the reference microservice often uses 8 frames per chunk for Cosmos-Embed1—confirm against your runtime config).
TAO fine-tuning configuration#
Fine-tuning uses YAML experiment files that follow the ExperimentConfig schema. Full parameter tables live in the published Cosmos-Embed1 chapter; they are generated from tlt-docs/docs/text/embedding/cosmos_embed1_tables.rst.
tao model cosmos-embed1 <action> -e /path/to/spec.yaml [overrides]
Top-level sections in the experiment file:
results_dir— run output rootwandb— optional Weights & Biases loggingmodel—pretrained_model_path,precision,input_hw,network(EVA-ViT, Q-Former, frames, resolution, contrastive type),lora,fsdpdataset—train_dataset,val_dataset,test_dataset,inference_dataset(each aSingleDatasetConfigwithdataset_type,metadata,data_root, batching, etc.)train—max_iter,num_gpus,optim,loss_weights,callbacks,ema, resume paths,freeze_visual_encoder, …evaluate— checkpoint, callbacks (top-K, UMAP), optional embedding cache pathsinference— checkpoint, query config,k, corpus cache pathsexport—mode(video/text/combined/ Hugging Face),onnx_file,opset_version,hf_output_dir
Example experiment specification#
Illustrative 224p layout (vad_r1-style metadata). Replace ?? with your paths; align network.num_video_frames and dataset.*.num_video_frames with deployment. When available, start from the packaged cosmos_embed1/configs/experiment_specs/*.yaml files in your TAO Toolkit tree and merge overrides rather than authoring a spec from scratch.
results_dir: /results/cosmos_experiment
wandb:
enable: false
project: cosmos_embed1
model:
pretrained_model_path: nvidia/Cosmos-Embed1-224p
pretrained_model_strict: true
precision: bf16
input_hw: [224, 224]
network:
embed_dim: 256
num_query_tokens: 32
max_txt_len: 128
num_video_frames: 8
spatial_resolution: [224, 224]
contrastive_type: clip
lora:
enabled: false
lora_rank: 8
lora_alpha: 16
dataset:
train_dataset:
dataset_type: vad_r1
metadata: /data/train.jsonl
data_root: /data/videos
batch_size: 4
num_video_frames: 8
resolution: [224, 224]
caption_field: anomaly_type
val_dataset:
dataset_type: vad_r1
metadata: /data/val.jsonl
data_root: /data/videos
batch_size: 4
num_video_frames: 8
resolution: [224, 224]
train:
max_iter: 50000
num_gpus: 1
num_nodes: 1
gpu_ids: [0]
validation_iter: 1000
checkpoint_iter: 1000
clip_grad_norm: 0.0
precision: bf16
resume_training_checkpoint_path: null
freeze_visual_encoder: true
use_captioning_loss: true
use_text_matching_loss: false
optim:
optim: adamw
lr: 1.0e-5
weight_decay: 1.0e-5
betas: [0.9, 0.98]
warmup_steps: 1000
policy: cosine
lr_decay_iters: 50000
loss_weights:
contrastive_loss: 1.0
captioning_loss: 1.0
matching_loss: 1.0
evaluate:
checkpoint: /results/cosmos_experiment/train/cosmos_embed1_model_latest.pth
num_gpus: 1
inference:
checkpoint: /results/cosmos_experiment/train/cosmos_embed1_model_latest.pth
k: 5
export:
checkpoint: /results/cosmos_experiment/train/cosmos_embed1_model_latest.pth
mode: combined
opset_version: 17
batch_size: 1
simplify: false
LoRA fine-tuning (config fragment)#
Use a LoRA spec (for example finetune_224p_lora.yaml in the TAO starter layouts) and ensure the visual encoder allows PEFT injection (transformer_engine: false on the visual encoder when required by the TAO release):
model:
lora:
enabled: true
lora_rank: 8
lora_alpha: 16
Launch model fine-tuning#
Update dataset paths and training budget, then run:
dataset:
train_dataset:
metadata: ?? # JSON or JSONL
data_root: ??
val_dataset:
metadata: ??
data_root: ??
train:
max_iter: ?? # e.g. 10000–50000 depending on data
num_gpus: ?? # or train.num_gpus=-1 for all GPUs
tao model cosmos-embed1 train \
-e /path/to/finetune_224p.yaml \
results_dir=/results/my_experiment \
model.pretrained_model_path=nvidia/Cosmos-Embed1-224p \
dataset.train_dataset.metadata=/data/train.jsonl \
dataset.train_dataset.data_root=/data/videos \
dataset.val_dataset.metadata=/data/val.jsonl \
dataset.val_dataset.data_root=/data/videos
Checkpoints are written under <results_dir>/train/ (for example cosmos_embed1_model_latest.pth).
Evaluate the fine-tuned model#
tao model cosmos-embed1 evaluate \
-e /path/to/evaluate_224p.yaml \
results_dir=/results/my_eval \
evaluate.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
dataset.test_dataset.metadata=/data/test.jsonl \
dataset.test_dataset.data_root=/data/videos
Optional: set evaluate.save_dataset_pkl / evaluate.load_dataset_pkl to cache embeddings between runs.
Run inference (text / video search)#
tao model cosmos-embed1 inference \
-e /path/to/inference_224p.yaml \
inference.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
dataset.inference_dataset.metadata=/data/search_corpus.jsonl \
dataset.inference_dataset.data_root=/data/videos \
'inference.query.input_texts=["a car crash", "flooding on highway"]'
Use inference.save_dataset_pkl / inference.load_dataset_pkl to cache the encoded corpus.
Export ONNX and Hugging Face#
ONNX — export.mode selects video, text, combined, or Hugging Face export paths:
tao model cosmos-embed1 export \
-e /path/to/export_onnx_224p.yaml \
export.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
export.mode=combined
Hugging Face layout (for drop-in MODEL_PATH):
tao model cosmos-embed1 export \
-e /path/to/export_hf_224p.yaml \
export.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
export.hf_output_dir=/results/my_experiment/hf_export
Integrating fine-tuned weights into RT-Embedding#
Fine-tune with TAO using resolution and sampling that match production (chunk length, frame count, variant).
Export to ONNX or Hugging Face layout as required by your runtime.
Set
MODEL_PATHto agit:Hugging Face URL or a mounted local directory with the expected structure. If you change the Python implementation or Triton repository layout, setMODEL_IMPLEMENTATION_PATHandMODEL_REPOSITORY_SCRIPT_PATHconsistently with Real-Time Embedding.Re-index Elasticsearch (or other vector stores) if embedding dimension or semantics change; rerun integration tests for RTSP ingest, file ingest, and Kafka consumers.
Note
For embedding search and agent workflows, see Search Workflow and Model customization overview.
Source material in this repo’s doc trees#
tlt-docs/docs/text/embedding/cosmos_embed1.rst— full narrative, commands, and configurationtlt-docs/docs/text/embedding/cosmos_embed1_tables.rst— generated parameter tablestlt-docs/docs/text/embedding/index.rst— embedding overview and launcher pattern