Cosmos-Embed1 (video embedding)#

Role in VSS#

The Real-Time Embedding microservice (Real-Time Embedding) loads Cosmos-Embed1 variants by default (for example 224p / 336p / 448p Hugging Face checkpoints) and publishes video and text embeddings for semantic search and Kafka consumers. The Cosmos-Embed1 NGC model card lists pretrained checkpoints. Container variables such as MODEL_PATH, optional MODEL_IMPLEMENTATION_PATH, and Triton repo scripts are described under Customizations on Real-Time Embedding.

Note

Full TAO reference for Cosmos-Embed1—including experiment schema tables, dataset configs, and optional Weights & Biases logging—lives in the published embedding chapter and mirrors the in-repo tlt-docs/docs/text/embedding/cosmos_embed1.rst (and cosmos_embed1_tables.rst) sources.

Architecture and tasks#

Cosmos-Embed1 is a dual-encoder video–text model: an EVA-ViT-G visual encoder on sampled frames (typically 224×224), a Q-Former distilled from video features, a BERT-style text encoder, and CLIP-style or SigLIP-style contrastive alignment. LoRA is supported for efficient fine-tuning of visual and Q-Former attention layers.

TAO tasks:

tao model cosmos-embed1 <action> -e /path/to/spec.yaml [overrides]

Where <action> is one of train, evaluate, inference, or export.

Hardware requirements#

Minimum (typical single-GPU training at 224p): one NVIDIA GPU with at least 40 GB GPU memory; Ubuntu 20.04+; CUDA 12.1+.

Recommended: multiple A100 or H100 GPUs, fast storage for video I/O, and sufficient CPU cores for dataloaders. Exact needs depend on resolution, batch size, and train.max_iter.

Data input (summary)#

Cosmos-Embed1 uses a JSON or JSONL metadata file mapping each video to a caption (and optional labels). The dataset_type field selects loaders. Common types include:

dataset_type

Description

mock

Synthetic data for pipeline tests (no real video files)

vad_r1 / vad_r1_chunks

VAD-R1 anomaly-style video with captions and optional temporal chunks

msrvtt

MSR-VTT-style video–description pairs

kinetics

Kinetics-style action clips with metadata CSV

http

Videos referenced by HTTP URLs in metadata

Field-level requirements (paths, caption_field, chunking) are in the Cosmos-Embed1 TAO chapter. Align your frame sampling and resolution with the variant you deploy (the reference microservice often uses 8 frames per chunk for Cosmos-Embed1—confirm against your runtime config).

TAO fine-tuning configuration#

Fine-tuning uses YAML experiment files that follow the ExperimentConfig schema. Full parameter tables live in the published Cosmos-Embed1 chapter; they are generated from tlt-docs/docs/text/embedding/cosmos_embed1_tables.rst.

tao model cosmos-embed1 <action> -e /path/to/spec.yaml [overrides]

Top-level sections in the experiment file:

  • results_dir — run output root

  • wandb — optional Weights & Biases logging

  • modelpretrained_model_path, precision, input_hw, network (EVA-ViT, Q-Former, frames, resolution, contrastive type), lora, fsdp

  • datasettrain_dataset, val_dataset, test_dataset, inference_dataset (each a SingleDatasetConfig with dataset_type, metadata, data_root, batching, etc.)

  • trainmax_iter, num_gpus, optim, loss_weights, callbacks, ema, resume paths, freeze_visual_encoder, …

  • evaluate — checkpoint, callbacks (top-K, UMAP), optional embedding cache paths

  • inference — checkpoint, query config, k, corpus cache paths

  • exportmode (video / text / combined / Hugging Face), onnx_file, opset_version, hf_output_dir

Example experiment specification#

Illustrative 224p layout (vad_r1-style metadata). Replace ?? with your paths; align network.num_video_frames and dataset.*.num_video_frames with deployment. When available, start from the packaged cosmos_embed1/configs/experiment_specs/*.yaml files in your TAO Toolkit tree and merge overrides rather than authoring a spec from scratch.

results_dir: /results/cosmos_experiment

wandb:
  enable: false
  project: cosmos_embed1

model:
  pretrained_model_path: nvidia/Cosmos-Embed1-224p
  pretrained_model_strict: true
  precision: bf16
  input_hw: [224, 224]
  network:
    embed_dim: 256
    num_query_tokens: 32
    max_txt_len: 128
    num_video_frames: 8
    spatial_resolution: [224, 224]
    contrastive_type: clip
  lora:
    enabled: false
    lora_rank: 8
    lora_alpha: 16

dataset:
  train_dataset:
    dataset_type: vad_r1
    metadata: /data/train.jsonl
    data_root: /data/videos
    batch_size: 4
    num_video_frames: 8
    resolution: [224, 224]
    caption_field: anomaly_type
  val_dataset:
    dataset_type: vad_r1
    metadata: /data/val.jsonl
    data_root: /data/videos
    batch_size: 4
    num_video_frames: 8
    resolution: [224, 224]

train:
  max_iter: 50000
  num_gpus: 1
  num_nodes: 1
  gpu_ids: [0]
  validation_iter: 1000
  checkpoint_iter: 1000
  clip_grad_norm: 0.0
  precision: bf16
  resume_training_checkpoint_path: null
  freeze_visual_encoder: true
  use_captioning_loss: true
  use_text_matching_loss: false
  optim:
    optim: adamw
    lr: 1.0e-5
    weight_decay: 1.0e-5
    betas: [0.9, 0.98]
    warmup_steps: 1000
    policy: cosine
    lr_decay_iters: 50000
  loss_weights:
    contrastive_loss: 1.0
    captioning_loss: 1.0
    matching_loss: 1.0

evaluate:
  checkpoint: /results/cosmos_experiment/train/cosmos_embed1_model_latest.pth
  num_gpus: 1

inference:
  checkpoint: /results/cosmos_experiment/train/cosmos_embed1_model_latest.pth
  k: 5

export:
  checkpoint: /results/cosmos_experiment/train/cosmos_embed1_model_latest.pth
  mode: combined
  opset_version: 17
  batch_size: 1
  simplify: false

LoRA fine-tuning (config fragment)#

Use a LoRA spec (for example finetune_224p_lora.yaml in the TAO starter layouts) and ensure the visual encoder allows PEFT injection (transformer_engine: false on the visual encoder when required by the TAO release):

model:
  lora:
    enabled: true
    lora_rank: 8
    lora_alpha: 16

Launch model fine-tuning#

Update dataset paths and training budget, then run:

dataset:
  train_dataset:
    metadata: ??  # JSON or JSONL
    data_root: ??
  val_dataset:
    metadata: ??
    data_root: ??
train:
  max_iter: ??  # e.g. 10000–50000 depending on data
  num_gpus: ??  # or train.num_gpus=-1 for all GPUs
tao model cosmos-embed1 train \
    -e /path/to/finetune_224p.yaml \
    results_dir=/results/my_experiment \
    model.pretrained_model_path=nvidia/Cosmos-Embed1-224p \
    dataset.train_dataset.metadata=/data/train.jsonl \
    dataset.train_dataset.data_root=/data/videos \
    dataset.val_dataset.metadata=/data/val.jsonl \
    dataset.val_dataset.data_root=/data/videos

Checkpoints are written under <results_dir>/train/ (for example cosmos_embed1_model_latest.pth).

Evaluate the fine-tuned model#

tao model cosmos-embed1 evaluate \
    -e /path/to/evaluate_224p.yaml \
    results_dir=/results/my_eval \
    evaluate.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
    dataset.test_dataset.metadata=/data/test.jsonl \
    dataset.test_dataset.data_root=/data/videos

Optional: set evaluate.save_dataset_pkl / evaluate.load_dataset_pkl to cache embeddings between runs.

Export ONNX and Hugging Face#

ONNXexport.mode selects video, text, combined, or Hugging Face export paths:

tao model cosmos-embed1 export \
    -e /path/to/export_onnx_224p.yaml \
    export.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
    export.mode=combined

Hugging Face layout (for drop-in MODEL_PATH):

tao model cosmos-embed1 export \
    -e /path/to/export_hf_224p.yaml \
    export.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
    export.hf_output_dir=/results/my_experiment/hf_export

Integrating fine-tuned weights into RT-Embedding#

  1. Fine-tune with TAO using resolution and sampling that match production (chunk length, frame count, variant).

  2. Export to ONNX or Hugging Face layout as required by your runtime.

  3. Set MODEL_PATH to a git: Hugging Face URL or a mounted local directory with the expected structure. If you change the Python implementation or Triton repository layout, set MODEL_IMPLEMENTATION_PATH and MODEL_REPOSITORY_SCRIPT_PATH consistently with Real-Time Embedding.

  4. Re-index Elasticsearch (or other vector stores) if embedding dimension or semantics change; rerun integration tests for RTSP ingest, file ingest, and Kafka consumers.

Note

For embedding search and agent workflows, see Search Workflow and Model customization overview.

Source material in this repo’s doc trees#

  • tlt-docs/docs/text/embedding/cosmos_embed1.rst — full narrative, commands, and configuration

  • tlt-docs/docs/text/embedding/cosmos_embed1_tables.rst — generated parameter tables

  • tlt-docs/docs/text/embedding/index.rst — embedding overview and launcher pattern