ExperimentConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
WandbConfig |
Weights and Biases logging configuration. Auto-disables if no API key is found. |
|
|
ModelConfig |
Model configuration. |
|
|
DatasetConfig |
Dataset configuration. |
|
|
TrainConfig |
Training experiment configuration. |
|
|
EvaluateConfig |
Evaluation experiment configuration. |
|
|
InferenceConfig |
Inference experiment configuration. |
|
|
ExportConfig |
ONNX export experiment configuration. |
|
|
str |
Directory to save results, checkpoints, and logs. |
“/results” |
|
Optional[str] |
Encryption key for model export (TAO compatibility). |
None |
|
str |
Model name identifier. |
“cosmos_embed1” |
WandbConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable Weights and Biases logging. |
False |
|
str |
Weights and Biases project name. |
“cosmos_embed1” |
|
str |
Run group for organizing related runs in the dashboard. |
“” |
|
str |
Run name. Empty string auto-generates a name. |
“” |
|
list[str] |
List of tags for filtering runs in the dashboard. |
[] |
|
bool |
Save a copy of the training code to Weights and Biases. |
False |
|
str |
API key. If empty, falls back to the WANDB_API_KEY env var. |
“” |
ModelConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
NetworkConfig |
Network architecture configuration. |
|
|
Optional[str] |
Path to a pretrained checkpoint. Accepts a local file path (.pth, .safetensors) or a HuggingFace repo ID. |
None |
|
bool |
Strict state_dict matching when loading pretrained weights. Missing or unexpected keys raise an error when True. |
True |
|
Precision |
Training precision. Valid options: “bf16”, “fp16”, “fp32”. |
“bf16” |
|
list[int] |
Data-loader input resolution [H, W]. Distinct from model.network.spatial_resolution. |
[224, 224] |
|
FSDPConfig |
Fully Sharded Data Parallel configuration for distributed training. |
|
|
int |
Legacy FSDP shard size used by the model loader. |
8 |
|
LoRAConfig |
LoRA configuration. When enabled, wraps the network with PEFT adapters. Requires transformer_engine=False. |
DatasetConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
SingleDatasetConfig |
Training dataset configuration. |
|
|
SingleDatasetConfig |
Validation dataset configuration (used during training validation). |
|
|
SingleDatasetConfig |
Test/evaluation dataset configuration (used by the evaluate action). |
|
|
SingleDatasetConfig |
Inference search database configuration (used by the inference action). |
TrainConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
OptimConfig |
Optimizer configuration. |
|
|
LossWeightsConfig |
Per-loss weight configuration. |
|
|
int |
Random seed for reproducibility. |
1234 |
|
int |
Maximum number of training iterations. |
50000 |
|
int |
Number of nodes for distributed training. |
1 |
|
int |
Number of GPUs per node. Use -1 to auto-detect all available GPUs, 0 for CPU only. |
1 |
|
list[int] |
List of GPU device IDs to use. Overrides num_gpus for device selection. |
[0] |
|
int |
Frequency of validation runs, in iterations. |
1000 |
|
int |
Frequency of checkpoint saves, in iterations. |
1000 |
|
float |
Gradient clipping norm. Set to 0.0 to disable gradient clipping. |
0.0 |
|
Precision |
Training precision. Valid options: “bf16”, “fp16”, “fp32”. |
“bf16” |
|
Optional[str] |
Path to a checkpoint to resume training from. |
None |
|
dict[str, Any] |
Dict mapping callback name to parameter overrides. Keys must match CALLBACK_REGISTRY. |
{wandb, clamp_logit_scale, …} |
|
Optional[int] |
Maximum number of validation batches per GPU. None runs the full validation set. |
None |
|
bool |
Freeze the visual encoder weights during training. |
True |
|
bool |
Enable the captioning loss during training. |
True |
|
bool |
Enable the text matching loss during training. |
False |
|
EMAConfig |
Exponential Moving Average configuration. |
|
|
bool |
Enable spectral reparameterization. |
False |
|
DAMPConfig |
DAMP (Decoupled Attention and Momentum Path) training technique configuration. |
|
|
bool |
Restore optimizer and scheduler state when resuming training. |
False |
|
bool |
Strict state_dict matching when resuming from a checkpoint. |
False |
EvaluateConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
Optional[str] |
Path to the model checkpoint for evaluation. |
None |
|
int |
Maximum number of validation batches to run. -1 runs all batches. |
-1 |
|
int |
Number of GPUs for evaluation. |
1 |
|
ValidationEvalConfig |
Validation evaluation callback configuration. |
|
|
Optional[str] |
Path to load pre-computed eval embeddings from. When set and the file exists, model inference is skipped. |
None |
|
Optional[str] |
Path to save generated eval embeddings to. When set, embeddings are saved after generation (rank 0 only). |
None |
InferenceConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
Optional[str] |
Path to the model checkpoint for inference. |
None |
|
QueryConfig |
Query inputs (text and/or video) for similarity search. |
|
|
int |
Number of GPUs for inference. |
1 |
|
int |
Number of nearest-neighbor results to return per query. |
5 |
|
Optional[str] |
Path to load pre-computed search database embeddings from. When set and the file exists, model inference is skipped. |
None |
|
Optional[str] |
Path to save generated search database embeddings to. When set, embeddings are saved after generation. |
None |
ExportConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
Optional[str] |
Path to the model checkpoint for export. |
None |
|
Optional[str] |
Output ONNX file path. If None, the path is auto-derived from the checkpoint path and mode. |
None |
|
ExportMode |
Export mode. Valid options: “video”, “text”, “combined”, “huggingface”. |
“video” |
|
int |
ONNX opset version. |
17 |
|
int |
Batch size for export. Set to -1 for a dynamic batch dimension. |
1 |
|
bool |
Run export on CPU instead of GPU. |
False |
|
bool |
Print verbose ONNX export information. |
False |
|
bool |
Apply onnxsim simplification after export. |
False |
|
Optional[str] |
Output directory for HuggingFace export. If None, auto-derived from checkpoint path. Only used when mode=huggingface. |
None |
NetworkConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
VisualEncoderConfig |
Visual encoder configuration. |
|
|
int |
Output embedding dimension for video-text alignment. |
256 |
|
int |
Number of learnable query tokens in the Q-Former. |
32 |
|
int |
Maximum text token sequence length. |
128 |
|
int |
Number of input video frames. |
8 |
|
list[int] |
Spatial resolution [H, W] for input video frames. |
[224, 224] |
|
TemporalEncodingType |
Type of temporal encoding. Default: “neighboring_token_propagation”. |
“neighboring_token_propagation” |
|
ContrastiveType |
Contrastive loss type. Valid options: “clip”, “siglip”. |
“clip” |
|
Optional[str] |
Path or HuggingFace repo ID for the Q-Former pretrained checkpoint. |
None |
|
QueryPoolingType |
Query pooling method after the Q-Former. Valid options: “avg”, “attention”, “identity”. |
“avg” |
|
bool |
Load pretrained BERT weights for the text encoder. |
False |
|
bool |
Load pretrained weights for the visual encoder from S3 or HuggingFace. |
False |
|
int |
Number of held-out frames for certain training strategies. |
0 |
FSDPConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable Fully Sharded Data Parallel. |
False |
|
Optional[int] |
FSDP shard group size. None auto-selects one shard per node. |
None |
|
Optional[int] |
FSDP replica group size. None auto-selects. |
None |
SingleDatasetConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
DatasetType |
Dataset class to use. Valid options: “mock”, “vad_r1”, “vad_r1_chunks”, “msrvtt”, “kinetics”, “http”. |
“mock” |
|
Optional[str] |
Path to the metadata JSON or JSONL file. |
None |
|
Optional[str] |
Root directory for video data. |
None |
|
int |
Number of video frames to sample from each video. |
8 |
|
list[int] |
Video frame resolution [H, W]. |
[224, 224] |
|
int |
Batch size per GPU. |
4 |
|
int |
Number of dataloader worker processes. |
4 |
|
bool |
Drop the last incomplete batch when the dataset size is not divisible by batch_size. |
True |
|
int |
Number of batches to prefetch per worker process. |
2 |
|
bool |
Pin memory buffers for faster GPU transfer. |
True |
|
Optional[str] |
Split filter for VadR1 datasets, e.g., “train”, “test”. None means no filtering. |
None |
|
bool |
When caption_field is a list, randomly sample one field per sample instead of always using the first. |
False |
|
dict[str, str] |
Remap video file paths, e.g., {“/old/path/”: “/new/path/”}. |
{} |
|
bool |
Skip dataset entries whose video files are missing. |
True |
|
Any |
Metadata field(s) to use as captions. String or list of strings, e.g., “anomaly_type”. |
“anomaly_type” |
|
Optional[str] |
Glob pattern for video files used by MSRVTTDataset and KineticsDataset. |
None |
|
dict[str, int] |
Mapping from caption text to integer label ID. |
{} |
|
float |
Duration of each temporal chunk in seconds (VadR1ChunksDataset only). |
5.0 |
|
bool |
When True, all normal (non-anomaly) samples share a single label ID instead of per-class labels. |
True |
Dataset Format Reference#
dataset_type |
Metadata Format |
Entry Schema |
Required Config Fields |
|---|---|---|---|
“mock” |
None |
No metadata file needed. Generates random frames using |
|
“vad_r1” |
JSON or JSONL |
Each entry: |
|
“vad_r1_chunks” |
JSON or JSONL |
Each entry: |
|
“msrvtt” |
JSON with video/caption pairs |
Each entry: |
|
“kinetics” |
CSV with youtube_id and label |
Each row: |
|
“http” |
JSON or JSONL |
Each entry: |
|
Training Callbacks#
Callback |
Default Parameters |
Description |
|---|---|---|
“wandb” |
{} |
Logs training metrics to Weights and Biases. |
“clamp_logit_scale” |
{} |
Clamps the logit scale parameter to prevent instability. |
“logit_parameters_monitor” |
{} |
Logs logit scale and bias parameters. |
“iter_speed” |
every_n: 50, save_s3: False |
Logs iteration throughput (samples/sec) every N iterations. |
“gradient_clip” |
clip_norm: 3.0 |
Clips gradients to a maximum L2 norm. |
“grad_norm_monitor” |
every_n: 500, verbose: False |
Logs gradient norms every N iterations. |
“spectral_norm_monitor” |
every_n: 1000, verbose: True |
Logs spectral norms of weight matrices every N iterations. |
“ema” |
{} |
Updates the Exponential Moving Average model shadow weights. |
“log_losses” |
every_n: 50, verbose: True |
Logs all loss components every N iterations. |
“text_frames_visualizer” |
every_n: 500 |
Logs video frame and text caption pairs to Weights and Biases. |
“pca_feature_map_visualizer” |
every_n: 500 |
Logs PCA-projected feature map visualizations to Weights and Biases. |
“validation_eval” |
{} |
Runs full evaluation metrics during training validation. Not included by default; add to enable. |
OptimConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
OptimizerType |
Optimizer type. Valid options: “adamw”, “fused_adamw”, “adam”, “sgd”. |
“adamw” |
|
float |
Learning rate. |
1e-05 |
|
float |
Weight decay coefficient. |
1e-05 |
|
list[float] |
Adam and AdamW beta coefficients. |
[0.9, 0.98] |
|
int |
Number of warmup steps for the learning rate scheduler. |
1000 |
|
LRPolicy |
Learning rate schedule policy. Valid options: “cosine”, “linear”, “constant”. |
“cosine” |
|
int |
Number of iterations over which to decay the learning rate (cosine scheduler). |
50000 |
LossWeightsConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
float |
Weight for the contrastive loss term. |
1.0 |
|
float |
Weight for the captioning loss term. |
1.0 |
|
float |
Weight for the text matching loss term. |
1.0 |
LoRAConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable LoRA fine-tuning. |
False |
|
int |
Rank of the low-rank adapter matrices. Higher rank means more trainable parameters. |
8 |
|
int |
Alpha scaling factor for LoRA. Typically set to 2× lora_rank. |
16 |
|
float |
Dropout probability applied to LoRA layers. |
0.1 |
|
LoraBias |
Bias handling for LoRA. Valid options: “none”, “all”, “lora_only”. |
“none” |
|
bool |
Use Rank-Stabilized LoRA for more stable training at higher ranks. |
False |
|
bool |
Use DoRA (Weight-Decomposed Low-Rank Adaptation). |
False |
|
Optional[list[str]] |
Module name patterns to apply LoRA to. |
[“qkv”, “fc1”, “fc2”, “attn.proj”, “query”, “value”, “key”, “dense”, “vision_proj”, “text_proj”, “itm_proj”] |
|
Optional[list[str]] |
Modules to keep fully trainable (bypassing LoRA). |
[“temporal_encoding”, “query_pooling”] |
EMAConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable Exponential Moving Average weight tracking. |
False |
|
float |
EMA decay rate. |
0.9999 |
DAMPConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable DAMP. |
False |
|
float |
DAMP beta coefficient. |
0.1 |
|
DAMPMode |
DAMP mode. Valid options: “const”, “dynamic”. |
“const” |
ValidationEvalConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
bool |
Enable top-K hit rate classification metrics. |
True |
|
bool |
Enable UMAP embedding visualization. |
False |
|
list[int] |
List of K values for top-K hit rate computation. |
[1, 3, 5, 10] |
|
int |
Maximum number of samples to use during evaluation. |
2000 |
QueryConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
list[str] |
List of video file paths to use as queries. |
[] |
|
list[str] |
List of text strings to use as queries. |
[] |
VisualEncoderConfig Fields#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
VisualEncoderType |
Visual encoder type. |
“eva_vit_g” |
|
int |
Input image size for the visual encoder. |
224 |
|
bool |
Load pretrained visual encoder weights from S3. |
False |
|
bool |
Use FP8 precision with Transformer Engine (requires transformer_engine=true). |
False |
|
bool |
Use Transformer Engine for optimized attention computation. |
True |
|
bool |
Use gradient checkpointing for activations to reduce memory usage. |
False |
|
bool |
Use gradient checkpointing for attention (requires transformer_engine=true). |
False |