Cosmos-Embed1#
Cosmos-Embed1 is a video-text embedding model designed for physical AI applications. It supports the following tasks:
trainevaluateinferenceexport
You can invoke these tasks using the TAO Launcher:
tao model cosmos-embed1 <action> -e /path/to/spec.yaml [overrides]
Where:
<action>: One oftrainevaluateinferenceexport
overrides: Optional dot-notation key-value pairs that override configuration fields
Architecture Overview#
Cosmos-Embed1 is a dual-encoder video-text embedding model that produces a joint embedding space for videos and text. The model architecture combines:
EVA-ViT-G visual encoder: A large Vision Transformer (ViT) backbone that processes sampled video frames at 224 × 224 pixel resolution
Q-Former module: A Querying Transformer that distills visual features into a fixed number of learnable query token embeddings, initialized from a pretrained Bidirectional Encoder Representations from Transformers (BERT) checkpoint
Text encoder: A BERT-based text encoder that projects text captions into the same embedding space as the video encoder
Contrastive loss: Either CLIP (Contrastive Language-Image Pretraining)-style or SigLIP-style contrastive learning to align video and text embeddings
The model supports LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, which significantly reduces the number of trainable parameters by adding low-rank decomposition matrices to the visual encoder and Q-Former attention layers.
Use Cases#
Cosmos-Embed1 is designed for physical AI data curation tasks that require understanding the semantic content of video collections:
Text-to-video retrieval: Find the most semantically relevant videos for a text query, such as “flooding on a highway” or “robotic arm picking an object”
Video-to-video retrieval: Find videos that are semantically similar to a query video
Semantic deduplication: Identify near-duplicate or highly similar videos in a large corpus by comparing their embedding vectors
Targeted filtering: Filter a video dataset by semantic category or anomaly type using text-based queries
Hardware Requirements#
The following hardware is required to run Cosmos-Embed1:
Minimum:
One NVIDIA GPU with at least 40 GB of GPU memory (for single-GPU training at 224 p resolution)
Ubuntu 20.04 or later
CUDA 12.1 or later
Recommended:
Multiple NVIDIA A100 or H100 GPUs for faster training
High-bandwidth storage for video data access
Multiple CPU cores for parallel data loading
Data Input for Cosmos-Embed1#
Cosmos-Embed1 uses a JSON or JSONL metadata file to describe the video dataset. Each entry in the metadata file maps a video path to a text caption (and optionally a label).
Supported Dataset Types#
Cosmos-Embed1 supports the following dataset types, configured with the dataset_type field.
For the full list of configuration parameters, see the
SingleDatasetConfig table. For metadata format
details and required fields per dataset type, see the
Dataset Format Reference.
dataset_type |
Description |
|---|---|
|
Generates random data for testing and validation without real video files.
Configured entirely through |
|
Loads the VAD-R1 video anomaly detection dataset from a
JSONL metadata file. VAD-R1 provides Perception-to-Cognition Chain-of-Thought annotations
(what, when, where, why, how) for anomalous events in surveillance videos. Requires
|
|
Loads VAD-R1 with temporal chunking, splitting each
video into fixed-duration segments for fine-grained anomaly detection at the chunk level.
Requires |
|
Loads the MSR-VTT
video description dataset. MSR-VTT provides 10K web video clips with 200K clip-sentence
pairs across diverse categories. Requires |
|
Loads the Kinetics
action-recognition dataset. Requires |
|
Loads videos streamed from HTTP/HTTPS URLs listed in a JSON metadata file. Requires
|
Metadata File Format#
The metadata file is a JSON or JSONL file where each entry contains at minimum a path to a video file and a caption field. The following example shows the VAD-R1 format:
[
{
"path": "/data/videos/clip_001.mp4",
"anomaly_type": "car crash",
"split": "train",
"total_frames": 450,
"video_duration_sec": 15.0,
"fps": 30.0,
"anomaly_start_frame": 90,
"anomaly_end_frame": 300,
"anomaly_start_sec": 3.0,
"anomaly_end_sec": 10.0,
"what": "a car runs a red light and collides with another vehicle",
"when": "[0.2, 0.67]",
"where": "center of the intersection",
"why": "the driver failed to stop at the red light",
"how": "risk of injury to occupants and traffic disruption"
}
]
The core fields describe the video and the anomaly timing:
path: Path to the video file on disk.anomaly_type: Category label for the anomaly (e.g., “car crash”). Used as the default caption field.split: Dataset split, e.g., “train”, “val”, or “test”.total_frames: Total number of frames in the video.video_duration_sec: Total video duration in seconds.fps: Frames per second of the video.anomaly_start_frame/anomaly_end_frame: Start and end frame indices of the annotated anomaly region.anomaly_start_sec/anomaly_end_sec: Start and end timestamps (in seconds) of the annotated anomaly region.
The Perception-to-Cognition Chain-of-Thought (P2C-CoT) reasoning fields provide structured
annotations that describe the anomaly from multiple perspectives. Any of these fields can be
used as alternative caption fields via the caption_field configuration parameter:
what: Describes the anomalous event itself (e.g., “a car runs a red light and collides with another vehicle”).when: Normalized time range of the anomaly as a proportion of the total video duration, expressed as[start, end](e.g.,[0.2, 0.67]).where: Spatial location of the anomaly within the video frame (e.g., “center of the intersection”).why: Explains why the event is considered anomalous (e.g., “the driver failed to stop at the red light”).how: Describes the potential consequences or severity of the anomaly (e.g., “risk of injury to occupants and traffic disruption”).
For the vad_r1_chunks dataset type, each entry also contains a chunks array that
defines temporal segments of the video. Each chunk specifies a time range, frame range,
and whether that segment contains an anomaly:
[
{
"video_path": "/data/videos/clip_001.mp4",
"anomaly_type": "Animals Obstructing Traffic",
"split": "test",
"video_duration_sec": 29.42,
"fps": 30.0,
"what": "a giraffe walks onto the road and blocks traffic",
"chunks": [
{
"chunk_index": 0,
"start_time_sec": 0.0,
"end_time_sec": 5.0,
"duration_sec": 5.0,
"start_frame": 0,
"end_frame": 150,
"is_anomaly": true,
"overlap_ratio": 1.0
},
{
"chunk_index": 1,
"start_time_sec": 5.0,
"end_time_sec": 10.0,
"duration_sec": 5.0,
"start_frame": 150,
"end_frame": 300,
"is_anomaly": true,
"overlap_ratio": 1.0
}
]
}
]
The chunk_size_sec configuration parameter controls the duration of each temporal chunk
(default: 5.0 seconds). The is_anomaly field indicates whether the chunk overlaps with
an annotated anomaly region, and overlap_ratio indicates the degree of overlap.
The caption_field configuration parameter controls which metadata field the model uses as
the text caption. You can set it to a single field name (for example, anomaly_type) or to a
list of field names (for example, ["anomaly_type", "what"]), in which case the model
randomly samples from the available fields during training.
Training Cosmos-Embed1#
Training fine-tunes the video and Q-Former encoders using contrastive and captioning losses. By default, the visual encoder weights are frozen and only the Q-Former and projection heads are trained.
To Train the Model#
tao model cosmos-embed1 train \
-e cosmos_embed1/configs/experiment_specs/finetune_224p.yaml \
results_dir=/results/my_experiment \
model.pretrained_model_path=/data/checkpoints/finetune_224p.pth \
dataset.train_dataset.metadata=/data/train.json \
dataset.train_dataset.data_root=/data/videos \
dataset.val_dataset.metadata=/data/val.json \
dataset.val_dataset.data_root=/data/videos
Required arguments:
-e: Path to the experiment specification fileresults_dir: Directory in which to save checkpoints and logs
Optional arguments:
model.pretrained_model_path: Path to a pretrained checkpoint to initialize from; accepts a local file path or a HuggingFace repository ID such asnvidia/Cosmos-Embed1-224ptrain.num_gpus: Number of GPUs to use; set to-1to use all available GPUs; default1train.max_iter: Maximum number of training iterations; default50000train.optim.lr: Learning rate; default1e-5dataset.train_dataset.batch_size: Batch size per GPU; default4
Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning by adding low-rank
adapter matrices to the attention and multilayer perceptron (MLP) layers of the visual
encoder and Q-Former.
LoRA requires setting model.network.visual_encoder.transformer_engine to false
because the Parameter-Efficient Fine-Tuning (PEFT) library cannot inject adapters into
Transformer Engine layers.
tao model cosmos-embed1 train \
-e cosmos_embed1/configs/experiment_specs/finetune_224p_lora.yaml \
results_dir=/results/my_lora_experiment \
model.pretrained_model_path=/data/checkpoints/finetune_224p.pth \
dataset.train_dataset.metadata=/data/train.json \
dataset.train_dataset.data_root=/data/videos
LoRA-specific arguments:
model.lora.enabled: Whether to enable LoRA; defaultfalsemodel.lora.lora_rank: Rank of the low-rank matrices; default8model.lora.lora_alpha: Scaling factor for LoRA; default16
Multi-GPU Training#
To run training on multiple GPUs, set the train.num_gpus parameter:
tao model cosmos-embed1 train \
-e cosmos_embed1/configs/experiment_specs/finetune_224p.yaml \
results_dir=/results/my_experiment \
train.num_gpus=4
To use all available GPUs automatically, set train.num_gpus=-1.
Resuming Training#
To resume training from a checkpoint, set the
train.resume_training_checkpoint_path parameter:
tao model cosmos-embed1 train \
-e cosmos_embed1/configs/experiment_specs/finetune_224p.yaml \
results_dir=/results/my_experiment \
train.resume_training_checkpoint_path=/results/my_experiment/train/cosmos_embed1_model_latest.pth
Training Outputs#
Training saves the following files to <results_dir>/train/:
cosmos_embed1_model_latest.pth: Most recent checkpointcosmos_embed1_model_<iter>.pth: Periodic checkpoints saved everycheckpoint_iteriterationsexperiment.yaml: Resolved configuration snapshot for reproducibilityconsole.log,debug.log: Training logswandb/: Weights and Biases run data (when Weights and Biases logging is enabled)
Evaluating Cosmos-Embed1#
Evaluation computes top-K hit rate classification metrics and optionally generates Uniform Manifold Approximation and Projection (UMAP) embedding visualizations on a test dataset.
To Evaluate the Model#
tao model cosmos-embed1 evaluate \
-e cosmos_embed1/configs/experiment_specs/evaluate_224p.yaml \
results_dir=/results/my_experiment \
evaluate.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
dataset.test_dataset.metadata=/data/test.json \
dataset.test_dataset.data_root=/data/videos
Required arguments:
-e: Path to the experiment specification fileevaluate.checkpoint: Path to the model checkpoint
Optional arguments:
evaluate.callbacks.topk_classification: Whether to compute top-K hit rate metrics; defaulttrueevaluate.callbacks.embedding_visualization: Whether to generate UMAP embedding visualizations; defaultfalseevaluate.callbacks.top_k_values: List of K values for top-K hit rate computation; default[1, 3, 5, 10]evaluate.num_gpus: Number of GPUs for evaluation; default1
Caching Embeddings#
For repeated evaluations against the same dataset, you can cache the model embeddings to a pickle file and reload them on subsequent runs, skipping model inference:
# First run: generate and save embeddings
tao model cosmos-embed1 evaluate -e evaluate_224p.yaml \
evaluate.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
evaluate.save_dataset_pkl=/results/evaluate/embeddings.pkl \
dataset.test_dataset.metadata=/data/test.json
# Subsequent runs: load cached embeddings (skips model inference)
tao model cosmos-embed1 evaluate -e evaluate_224p.yaml \
evaluate.load_dataset_pkl=/results/evaluate/embeddings.pkl
Running Inference with Cosmos-Embed1#
Inference performs a top-K similarity search against a video corpus using text or video queries. The model encodes all videos in the search database and returns the K most similar videos to each query.
To Run Text-to-Video Search#
tao model cosmos-embed1 inference \
-e cosmos_embed1/configs/experiment_specs/inference_224p.yaml \
results_dir=/results/my_experiment \
inference.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
dataset.inference_dataset.metadata=/data/search_database.json \
dataset.inference_dataset.data_root=/data/videos \
'inference.query.input_texts=["a car crash", "flooding on highway"]'
To Run Video-to-Video Search#
tao model cosmos-embed1 inference \
-e cosmos_embed1/configs/experiment_specs/inference_224p.yaml \
inference.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
dataset.inference_dataset.metadata=/data/search_database.json \
'inference.query.input_videos=["/path/to/query1.mp4", "/path/to/query2.mp4"]'
Optional arguments:
inference.k: Number of nearest-neighbor results to return per query; default5inference.save_dataset_pkl: Path to save the encoded search database embeddings for reuse in subsequent searchesinference.load_dataset_pkl: Path to load previously cached search database embeddings, skipping model inference on the corpus
Caching the Search Database#
For repeated searches against the same video corpus, cache the encoded search database to avoid re-running model inference:
# Build and cache the search database
tao model cosmos-embed1 inference -e inference_224p.yaml \
inference.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
dataset.inference_dataset.metadata=/data/videos.json \
inference.save_dataset_pkl=/results/inference/db_embeddings.pkl \
'inference.query.input_texts=["flooding"]'
# Fast subsequent searches using the cached database
tao model cosmos-embed1 inference -e inference_224p.yaml \
inference.load_dataset_pkl=/results/inference/db_embeddings.pkl \
'inference.query.input_texts=["a car crash", "normal traffic"]'
Exporting Cosmos-Embed1#
The export task converts a trained Cosmos-Embed1 checkpoint to Open Neural Network Exchange (ONNX) format for deployment, or to HuggingFace format for sharing and downstream use.
ONNX Export#
Cosmos-Embed1 supports exporting the video encoder, text encoder, or both encoders together:
# Export the video encoder (default)
tao model cosmos-embed1 export \
-e cosmos_embed1/configs/experiment_specs/export_onnx_224p.yaml \
export.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
export.mode=video
# Export the text encoder
tao model cosmos-embed1 export \
-e cosmos_embed1/configs/experiment_specs/export_onnx_224p.yaml \
export.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
export.mode=text
# Export both encoders in a single ONNX graph
tao model cosmos-embed1 export \
-e cosmos_embed1/configs/experiment_specs/export_onnx_224p.yaml \
export.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
export.mode=combined
Required arguments:
export.checkpoint: Path to the model checkpoint
Optional arguments:
export.mode: ONNX export mode; valid options:video,text,combined; defaultvideoexport.onnx_file: Output ONNX file path; when not set, the path is auto-derived from the checkpoint path and modeexport.batch_size: Batch size for export; set to-1for a dynamic batch dimension; default1export.opset_version: ONNX opset version; default17export.simplify: Whether to applyonnxsimsimplification after export; defaultfalse
HuggingFace Export#
To export a trained checkpoint to HuggingFace format (sharded safetensors and a
config.json):
tao model cosmos-embed1 export \
-e cosmos_embed1/configs/experiment_specs/export_hf_224p.yaml \
export.checkpoint=/results/my_experiment/train/cosmos_embed1_model_latest.pth \
export.hf_output_dir=/results/my_experiment/hf_export
Creating a Configuration File#
Cosmos-Embed1 uses YAML experiment specification files that follow the ExperimentConfig
dataclass schema. You can start from one of the provided experiment specification files in
cosmos_embed1/configs/experiment_specs/ and override individual fields on the command line
using dot-notation:
tao model cosmos-embed1 train -e finetune_224p.yaml \
results_dir=/my/output \
train.max_iter=10000 \
train.num_gpus=4 \
dataset.train_dataset.batch_size=8
The following sections describe all available configuration parameters.
Experiment Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
WandbConfig |
Weights and Biases logging configuration. Auto-disables if no API key is found. |
|
|
ModelConfig |
Model configuration. |
|
|
DatasetConfig |
Dataset configuration. |
|
|
TrainConfig |
Training experiment configuration. |
|
|
EvaluateConfig |
Evaluation experiment configuration. |
|
|
InferenceConfig |
Inference experiment configuration. |
|
|
ExportConfig |
ONNX export experiment configuration. |
|
|
str |
Directory to save results, checkpoints, and logs. |
“/results” |
|
Optional[str] |
Encryption key for model export (TAO compatibility). |
None |
|
str |
Model name identifier. |
“cosmos_embed1” |
Weights and Biases Logging Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable Weights and Biases logging. |
False |
|
str |
Weights and Biases project name. |
“cosmos_embed1” |
|
str |
Run group for organizing related runs in the dashboard. |
“” |
|
str |
Run name. Empty string auto-generates a name. |
“” |
|
list[str] |
List of tags for filtering runs in the dashboard. |
[] |
|
bool |
Save a copy of the training code to Weights and Biases. |
False |
|
str |
API key. If empty, falls back to the WANDB_API_KEY env var. |
“” |
Model Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
NetworkConfig |
Network architecture configuration. |
|
|
Optional[str] |
Path to a pretrained checkpoint. Accepts a local file path (.pth, .safetensors) or a HuggingFace repo ID. |
None |
|
bool |
Strict state_dict matching when loading pretrained weights. Missing or unexpected keys raise an error when True. |
True |
|
Precision |
Training precision. Valid options: “bf16”, “fp16”, “fp32”. |
“bf16” |
|
list[int] |
Data-loader input resolution [H, W]. Distinct from model.network.spatial_resolution. |
[224, 224] |
|
FSDPConfig |
Fully Sharded Data Parallel configuration for distributed training. |
|
|
int |
Legacy FSDP shard size used by the model loader. |
8 |
|
LoRAConfig |
LoRA configuration. When enabled, wraps the network with PEFT adapters. Requires transformer_engine=False. |
Network Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
VisualEncoderConfig |
Visual encoder configuration. |
|
|
int |
Output embedding dimension for video-text alignment. |
256 |
|
int |
Number of learnable query tokens in the Q-Former. |
32 |
|
int |
Maximum text token sequence length. |
128 |
|
int |
Number of input video frames. |
8 |
|
list[int] |
Spatial resolution [H, W] for input video frames. |
[224, 224] |
|
TemporalEncodingType |
Type of temporal encoding. Default: “neighboring_token_propagation”. |
“neighboring_token_propagation” |
|
ContrastiveType |
Contrastive loss type. Valid options: “clip”, “siglip”. |
“clip” |
|
Optional[str] |
Path or HuggingFace repo ID for the Q-Former pretrained checkpoint. |
None |
|
QueryPoolingType |
Query pooling method after the Q-Former. Valid options: “avg”, “attention”, “identity”. |
“avg” |
|
bool |
Load pretrained BERT weights for the text encoder. |
False |
|
bool |
Load pretrained weights for the visual encoder from S3 or HuggingFace. |
False |
|
int |
Number of held-out frames for certain training strategies. |
0 |
Visual Encoder Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
VisualEncoderType |
Visual encoder type. |
“eva_vit_g” |
|
int |
Input image size for the visual encoder. |
224 |
|
bool |
Load pretrained visual encoder weights from S3. |
False |
|
bool |
Use FP8 precision with Transformer Engine (requires transformer_engine=true). |
False |
|
bool |
Use Transformer Engine for optimized attention computation. |
True |
|
bool |
Use gradient checkpointing for activations to reduce memory usage. |
False |
|
bool |
Use gradient checkpointing for attention (requires transformer_engine=true). |
False |
FSDP Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable Fully Sharded Data Parallel. |
False |
|
Optional[int] |
FSDP shard group size. None auto-selects one shard per node. |
None |
|
Optional[int] |
FSDP replica group size. None auto-selects. |
None |
Dataset Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
SingleDatasetConfig |
Training dataset configuration. |
|
|
SingleDatasetConfig |
Validation dataset configuration (used during training validation). |
|
|
SingleDatasetConfig |
Test/evaluation dataset configuration (used by the evaluate action). |
|
|
SingleDatasetConfig |
Inference search database configuration (used by the inference action). |
Single Dataset Configuration#
The following parameters apply to each of the four dataset splits: train_dataset,
val_dataset, test_dataset, and inference_dataset.
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
DatasetType |
Dataset class to use. Valid options: “mock”, “vad_r1”, “vad_r1_chunks”, “msrvtt”, “kinetics”, “http”. |
“mock” |
|
Optional[str] |
Path to the metadata JSON or JSONL file. |
None |
|
Optional[str] |
Root directory for video data. |
None |
|
int |
Number of video frames to sample from each video. |
8 |
|
list[int] |
Video frame resolution [H, W]. |
[224, 224] |
|
int |
Batch size per GPU. |
4 |
|
int |
Number of dataloader worker processes. |
4 |
|
bool |
Drop the last incomplete batch when the dataset size is not divisible by batch_size. |
True |
|
int |
Number of batches to prefetch per worker process. |
2 |
|
bool |
Pin memory buffers for faster GPU transfer. |
True |
|
Optional[str] |
Split filter for VadR1 datasets, e.g., “train”, “test”. None means no filtering. |
None |
|
bool |
When caption_field is a list, randomly sample one field per sample instead of always using the first. |
False |
|
dict[str, str] |
Remap video file paths, e.g., {“/old/path/”: “/new/path/”}. |
{} |
|
bool |
Skip dataset entries whose video files are missing. |
True |
|
Any |
Metadata field(s) to use as captions. String or list of strings, e.g., “anomaly_type”. |
“anomaly_type” |
|
Optional[str] |
Glob pattern for video files used by MSRVTTDataset and KineticsDataset. |
None |
|
dict[str, int] |
Mapping from caption text to integer label ID. |
{} |
|
float |
Duration of each temporal chunk in seconds (VadR1ChunksDataset only). |
5.0 |
|
bool |
When True, all normal (non-anomaly) samples share a single label ID instead of per-class labels. |
True |
Dataset Format Reference#
dataset_type |
Metadata Format |
Entry Schema |
Required Config Fields |
|---|---|---|---|
“mock” |
None |
No metadata file needed. Generates random frames using |
|
“vad_r1” |
JSON or JSONL |
Each entry: |
|
“vad_r1_chunks” |
JSON or JSONL |
Each entry: |
|
“msrvtt” |
JSON with video/caption pairs |
Each entry: |
|
“kinetics” |
CSV with youtube_id and label |
Each row: |
|
“http” |
JSON or JSONL |
Each entry: |
|
Training Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
OptimConfig |
Optimizer configuration. |
|
|
LossWeightsConfig |
Per-loss weight configuration. |
|
|
int |
Random seed for reproducibility. |
1234 |
|
int |
Maximum number of training iterations. |
50000 |
|
int |
Number of nodes for distributed training. |
1 |
|
int |
Number of GPUs per node. Use -1 to auto-detect all available GPUs, 0 for CPU only. |
1 |
|
list[int] |
List of GPU device IDs to use. Overrides num_gpus for device selection. |
[0] |
|
int |
Frequency of validation runs, in iterations. |
1000 |
|
int |
Frequency of checkpoint saves, in iterations. |
1000 |
|
float |
Gradient clipping norm. Set to 0.0 to disable gradient clipping. |
0.0 |
|
Precision |
Training precision. Valid options: “bf16”, “fp16”, “fp32”. |
“bf16” |
|
Optional[str] |
Path to a checkpoint to resume training from. |
None |
|
dict[str, Any] |
Dict mapping callback name to parameter overrides. Keys must match CALLBACK_REGISTRY. |
{wandb, clamp_logit_scale, …} |
|
Optional[int] |
Maximum number of validation batches per GPU. None runs the full validation set. |
None |
|
bool |
Freeze the visual encoder weights during training. |
True |
|
bool |
Enable the captioning loss during training. |
True |
|
bool |
Enable the text matching loss during training. |
False |
|
EMAConfig |
Exponential Moving Average configuration. |
|
|
bool |
Enable spectral reparameterization. |
False |
|
DAMPConfig |
DAMP (Decoupled Attention and Momentum Path) training technique configuration. |
|
|
bool |
Restore optimizer and scheduler state when resuming training. |
False |
|
bool |
Strict state_dict matching when resuming from a checkpoint. |
False |
Training Callbacks#
The callbacks field in TrainConfig is a dict mapping callback names to
parameter overrides. Each key must match an entry in the CALLBACK_REGISTRY.
To disable a callback, remove its key from the dict. To customize a callback,
set the key to a dict of parameter overrides:
train:
callbacks:
gradient_clip:
clip_norm: 5.0
iter_speed:
every_n: 100
grad_norm_monitor:
every_n: 200
verbose: true
# Add optional callbacks not included by default:
validation_eval: {}
The default callbacks and their parameters are listed below. The
validation_eval callback is not included by default but can be added
to enable evaluation metrics during training validation.
Callback |
Default Parameters |
Description |
|---|---|---|
“wandb” |
{} |
Logs training metrics to Weights and Biases. |
“clamp_logit_scale” |
{} |
Clamps the logit scale parameter to prevent instability. |
“logit_parameters_monitor” |
{} |
Logs logit scale and bias parameters. |
“iter_speed” |
every_n: 50, save_s3: False |
Logs iteration throughput (samples/sec) every N iterations. |
“gradient_clip” |
clip_norm: 3.0 |
Clips gradients to a maximum L2 norm. |
“grad_norm_monitor” |
every_n: 500, verbose: False |
Logs gradient norms every N iterations. |
“spectral_norm_monitor” |
every_n: 1000, verbose: True |
Logs spectral norms of weight matrices every N iterations. |
“ema” |
{} |
Updates the Exponential Moving Average model shadow weights. |
“log_losses” |
every_n: 50, verbose: True |
Logs all loss components every N iterations. |
“text_frames_visualizer” |
every_n: 500 |
Logs video frame and text caption pairs to Weights and Biases. |
“pca_feature_map_visualizer” |
every_n: 500 |
Logs PCA-projected feature map visualizations to Weights and Biases. |
“validation_eval” |
{} |
Runs full evaluation metrics during training validation. Not included by default; add to enable. |
Optimizer Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
OptimizerType |
Optimizer type. Valid options: “adamw”, “fused_adamw”, “adam”, “sgd”. |
“adamw” |
|
float |
Learning rate. |
1e-05 |
|
float |
Weight decay coefficient. |
1e-05 |
|
list[float] |
Adam and AdamW beta coefficients. |
[0.9, 0.98] |
|
int |
Number of warmup steps for the learning rate scheduler. |
1000 |
|
LRPolicy |
Learning rate schedule policy. Valid options: “cosine”, “linear”, “constant”. |
“cosine” |
|
int |
Number of iterations over which to decay the learning rate (cosine scheduler). |
50000 |
Loss Weights Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
float |
Weight for the contrastive loss term. |
1.0 |
|
float |
Weight for the captioning loss term. |
1.0 |
|
float |
Weight for the text matching loss term. |
1.0 |
LoRA Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable LoRA fine-tuning. |
False |
|
int |
Rank of the low-rank adapter matrices. Higher rank means more trainable parameters. |
8 |
|
int |
Alpha scaling factor for LoRA. Typically set to 2× lora_rank. |
16 |
|
float |
Dropout probability applied to LoRA layers. |
0.1 |
|
LoraBias |
Bias handling for LoRA. Valid options: “none”, “all”, “lora_only”. |
“none” |
|
bool |
Use Rank-Stabilized LoRA for more stable training at higher ranks. |
False |
|
bool |
Use DoRA (Weight-Decomposed Low-Rank Adaptation). |
False |
|
Optional[list[str]] |
Module name patterns to apply LoRA to. |
[“qkv”, “fc1”, “fc2”, “attn.proj”, “query”, “value”, “key”, “dense”, “vision_proj”, “text_proj”, “itm_proj”] |
|
Optional[list[str]] |
Modules to keep fully trainable (bypassing LoRA). |
[“temporal_encoding”, “query_pooling”] |
EMA Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable Exponential Moving Average weight tracking. |
False |
|
float |
EMA decay rate. |
0.9999 |
DAMP Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable DAMP. |
False |
|
float |
DAMP beta coefficient. |
0.1 |
|
DAMPMode |
DAMP mode. Valid options: “const”, “dynamic”. |
“const” |
Evaluation Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
Optional[str] |
Path to the model checkpoint for evaluation. |
None |
|
int |
Maximum number of validation batches to run. -1 runs all batches. |
-1 |
|
int |
Number of GPUs for evaluation. |
1 |
|
ValidationEvalConfig |
Validation evaluation callback configuration. |
|
|
Optional[str] |
Path to load pre-computed eval embeddings from. When set and the file exists, model inference is skipped. |
None |
|
Optional[str] |
Path to save generated eval embeddings to. When set, embeddings are saved after generation (rank 0 only). |
None |
Validation Evaluation Callbacks Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable top-K hit rate classification metrics. |
True |
|
bool |
Enable UMAP embedding visualization. |
False |
|
list[int] |
List of K values for top-K hit rate computation. |
[1, 3, 5, 10] |
|
int |
Maximum number of samples to use during evaluation. |
2000 |
Inference Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
Optional[str] |
Path to the model checkpoint for inference. |
None |
|
QueryConfig |
Query inputs (text and/or video) for similarity search. |
|
|
int |
Number of GPUs for inference. |
1 |
|
int |
Number of nearest-neighbor results to return per query. |
5 |
|
Optional[str] |
Path to load pre-computed search database embeddings from. When set and the file exists, model inference is skipped. |
None |
|
Optional[str] |
Path to save generated search database embeddings to. When set, embeddings are saved after generation. |
None |
Query Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
list[str] |
List of video file paths to use as queries. |
[] |
|
list[str] |
List of text strings to use as queries. |
[] |
Export Configuration#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
Optional[str] |
Path to the model checkpoint for export. |
None |
|
Optional[str] |
Output ONNX file path. If None, the path is auto-derived from the checkpoint path and mode. |
None |
|
ExportMode |
Export mode. Valid options: “video”, “text”, “combined”, “huggingface”. |
“video” |
|
int |
ONNX opset version. |
17 |
|
int |
Batch size for export. Set to -1 for a dynamic batch dimension. |
1 |
|
bool |
Run export on CPU instead of GPU. |
False |
|
bool |
Print verbose ONNX export information. |
False |
|
bool |
Apply onnxsim simplification after export. |
False |
|
Optional[str] |
Output directory for HuggingFace export. If None, auto-derived from checkpoint path. Only used when mode=huggingface. |
None |