ExperimentConfig Fields#

Parameter

Type

Description

Default

wandb

WandbConfig

Weights and Biases logging configuration. Auto-disables if no API key is found.

Weights and Biases Logging Configuration

model

ModelConfig

Model configuration.

Model Configuration

dataset

DatasetConfig

Dataset configuration.

Dataset Configuration

train

TrainConfig

Training experiment configuration.

Training Configuration

evaluate

EvaluateConfig

Evaluation experiment configuration.

Evaluation Configuration

inference

InferenceConfig

Inference experiment configuration.

Inference Configuration

export

ExportConfig

ONNX export experiment configuration.

Export Configuration

results_dir

str

Directory to save results, checkpoints, and logs.

“/results”

encryption_key

Optional[str]

Encryption key for model export (TAO compatibility).

None

model_name

str

Model name identifier.

“cosmos_embed1”

WandbConfig Fields#

Parameter

Type

Description

Default

enable

bool

Enable Weights and Biases logging.

False

project

str

Weights and Biases project name.

“cosmos_embed1”

group

str

Run group for organizing related runs in the dashboard.

“”

name

str

Run name. Empty string auto-generates a name.

“”

tags

list[str]

List of tags for filtering runs in the dashboard.

[]

save_code

bool

Save a copy of the training code to Weights and Biases.

False

api_key

str

API key. If empty, falls back to the WANDB_API_KEY env var.

“”

ModelConfig Fields#

Parameter

Type

Description

Default

network

NetworkConfig

Network architecture configuration.

Network Configuration

pretrained_model_path

Optional[str]

Path to a pretrained checkpoint. Accepts a local file path (.pth, .safetensors) or a HuggingFace repo ID.

None

pretrained_model_strict

bool

Strict state_dict matching when loading pretrained weights. Missing or unexpected keys raise an error when True.

True

precision

Precision

Training precision. Valid options: “bf16”, “fp16”, “fp32”.

“bf16”

input_hw

list[int]

Data-loader input resolution [H, W]. Distinct from model.network.spatial_resolution.

[224, 224]

fsdp

FSDPConfig

Fully Sharded Data Parallel configuration for distributed training.

FSDP Configuration

fsdp_shard_size

int

Legacy FSDP shard size used by the model loader.

8

lora

LoRAConfig

LoRA configuration. When enabled, wraps the network with PEFT adapters. Requires transformer_engine=False.

LoRA Configuration

DatasetConfig Fields#

Parameter

Type

Description

Default

train_dataset

SingleDatasetConfig

Training dataset configuration.

Single Dataset Configuration

val_dataset

SingleDatasetConfig

Validation dataset configuration (used during training validation).

Single Dataset Configuration

test_dataset

SingleDatasetConfig

Test/evaluation dataset configuration (used by the evaluate action).

Single Dataset Configuration

inference_dataset

SingleDatasetConfig

Inference search database configuration (used by the inference action).

Single Dataset Configuration

TrainConfig Fields#

Parameter

Type

Description

Default

optim

OptimConfig

Optimizer configuration.

Optimizer Configuration

loss_weights

LossWeightsConfig

Per-loss weight configuration.

Loss Weights Configuration

seed

int

Random seed for reproducibility.

1234

max_iter

int

Maximum number of training iterations.

50000

num_nodes

int

Number of nodes for distributed training.

1

num_gpus

int

Number of GPUs per node. Use -1 to auto-detect all available GPUs, 0 for CPU only.

1

gpu_ids

list[int]

List of GPU device IDs to use. Overrides num_gpus for device selection.

[0]

validation_iter

int

Frequency of validation runs, in iterations.

1000

checkpoint_iter

int

Frequency of checkpoint saves, in iterations.

1000

clip_grad_norm

float

Gradient clipping norm. Set to 0.0 to disable gradient clipping.

0.0

precision

Precision

Training precision. Valid options: “bf16”, “fp16”, “fp32”.

“bf16”

resume_training_checkpoint_path

Optional[str]

Path to a checkpoint to resume training from.

None

callbacks

dict[str, Any]

Dict mapping callback name to parameter overrides. Keys must match CALLBACK_REGISTRY.

{wandb, clamp_logit_scale, …}

max_val_iter

Optional[int]

Maximum number of validation batches per GPU. None runs the full validation set.

None

freeze_visual_encoder

bool

Freeze the visual encoder weights during training.

True

use_captioning_loss

bool

Enable the captioning loss during training.

True

use_text_matching_loss

bool

Enable the text matching loss during training.

False

ema

EMAConfig

Exponential Moving Average configuration.

EMA Configuration

spectral_reparam

bool

Enable spectral reparameterization.

False

damp

DAMPConfig

DAMP (Decoupled Attention and Momentum Path) training technique configuration.

DAMP Configuration

load_training_state

bool

Restore optimizer and scheduler state when resuming training.

False

strict_resume

bool

Strict state_dict matching when resuming from a checkpoint.

False

EvaluateConfig Fields#

Parameter

Type

Description

Default

checkpoint

Optional[str]

Path to the model checkpoint for evaluation.

None

max_val_batches

int

Maximum number of validation batches to run. -1 runs all batches.

-1

num_gpus

int

Number of GPUs for evaluation.

1

callbacks

ValidationEvalConfig

Validation evaluation callback configuration.

Validation Evaluation Callbacks Configuration

load_dataset_pkl

Optional[str]

Path to load pre-computed eval embeddings from. When set and the file exists, model inference is skipped.

None

save_dataset_pkl

Optional[str]

Path to save generated eval embeddings to. When set, embeddings are saved after generation (rank 0 only).

None

InferenceConfig Fields#

Parameter

Type

Description

Default

checkpoint

Optional[str]

Path to the model checkpoint for inference.

None

query

QueryConfig

Query inputs (text and/or video) for similarity search.

Query Configuration

num_gpus

int

Number of GPUs for inference.

1

k

int

Number of nearest-neighbor results to return per query.

5

load_dataset_pkl

Optional[str]

Path to load pre-computed search database embeddings from. When set and the file exists, model inference is skipped.

None

save_dataset_pkl

Optional[str]

Path to save generated search database embeddings to. When set, embeddings are saved after generation.

None

ExportConfig Fields#

Parameter

Type

Description

Default

checkpoint

Optional[str]

Path to the model checkpoint for export.

None

onnx_file

Optional[str]

Output ONNX file path. If None, the path is auto-derived from the checkpoint path and mode.

None

mode

ExportMode

Export mode. Valid options: “video”, “text”, “combined”, “huggingface”.

“video”

opset_version

int

ONNX opset version.

17

batch_size

int

Batch size for export. Set to -1 for a dynamic batch dimension.

1

on_cpu

bool

Run export on CPU instead of GPU.

False

verbose

bool

Print verbose ONNX export information.

False

simplify

bool

Apply onnxsim simplification after export.

False

hf_output_dir

Optional[str]

Output directory for HuggingFace export. If None, auto-derived from checkpoint path. Only used when mode=huggingface.

None

NetworkConfig Fields#

Parameter

Type

Description

Default

visual_encoder

VisualEncoderConfig

Visual encoder configuration.

Visual Encoder Configuration

embed_dim

int

Output embedding dimension for video-text alignment.

256

num_query_tokens

int

Number of learnable query tokens in the Q-Former.

32

max_txt_len

int

Maximum text token sequence length.

128

num_video_frames

int

Number of input video frames.

8

spatial_resolution

list[int]

Spatial resolution [H, W] for input video frames.

[224, 224]

temporal_encoding_type

TemporalEncodingType

Type of temporal encoding. Default: “neighboring_token_propagation”.

“neighboring_token_propagation”

contrastive_type

ContrastiveType

Contrastive loss type. Valid options: “clip”, “siglip”.

“clip”

qformer_pretrain_ckpt

Optional[str]

Path or HuggingFace repo ID for the Q-Former pretrained checkpoint.

None

query_pooling_type

QueryPoolingType

Query pooling method after the Q-Former. Valid options: “avg”, “attention”, “identity”.

“avg”

pretrained_text_encoder

bool

Load pretrained BERT weights for the text encoder.

False

pretrained_visual_encoder

bool

Load pretrained weights for the visual encoder from S3 or HuggingFace.

False

num_heldout_frames

int

Number of held-out frames for certain training strategies.

0

FSDPConfig Fields#

Parameter

Type

Description

Default

enabled

bool

Enable Fully Sharded Data Parallel.

False

shard_size

Optional[int]

FSDP shard group size. None auto-selects one shard per node.

None

replica_size

Optional[int]

FSDP replica group size. None auto-selects.

None

SingleDatasetConfig Fields#

Parameter

Type

Description

Default

dataset_type

DatasetType

Dataset class to use. Valid options: “mock”, “vad_r1”, “vad_r1_chunks”, “msrvtt”, “kinetics”, “http”.

“mock”

metadata

Optional[str]

Path to the metadata JSON or JSONL file.

None

data_root

Optional[str]

Root directory for video data.

None

num_video_frames

int

Number of video frames to sample from each video.

8

resolution

list[int]

Video frame resolution [H, W].

[224, 224]

batch_size

int

Batch size per GPU.

4

workers

int

Number of dataloader worker processes.

4

drop_last

bool

Drop the last incomplete batch when the dataset size is not divisible by batch_size.

True

prefetch_factor

int

Number of batches to prefetch per worker process.

2

pin_memory

bool

Pin memory buffers for faster GPU transfer.

True

split

Optional[str]

Split filter for VadR1 datasets, e.g., “train”, “test”. None means no filtering.

None

random_caption

bool

When caption_field is a list, randomly sample one field per sample instead of always using the first.

False

path_prefix_mapping

dict[str, str]

Remap video file paths, e.g., {“/old/path/”: “/new/path/”}.

{}

skip_missing_files

bool

Skip dataset entries whose video files are missing.

True

caption_field

Any

Metadata field(s) to use as captions. String or list of strings, e.g., “anomaly_type”.

“anomaly_type”

mp4_urls

Optional[str]

Glob pattern for video files used by MSRVTTDataset and KineticsDataset.

None

caption_to_label

dict[str, int]

Mapping from caption text to integer label ID.

{}

chunk_size_sec

float

Duration of each temporal chunk in seconds (VadR1ChunksDataset only).

5.0

shared_normal_label

bool

When True, all normal (non-anomaly) samples share a single label ID instead of per-class labels.

True

Dataset Format Reference#

dataset_type

Metadata Format

Entry Schema

Required Config Fields

“mock”

None

No metadata file needed. Generates random frames using resolution and num_video_frames.

“vad_r1”

JSON or JSONL

Each entry: path (video file path), anomaly_type (caption). Optional: split, start, end, total_frames, what, when, where, why, how.

metadata, data_root

“vad_r1_chunks”

JSON or JSONL

Each entry: video_path, anomaly_type. Optional: split, chunks (list of chunk dicts with start_time_sec, end_time_sec, is_anomaly).

metadata, data_root

“msrvtt”

JSON with video/caption pairs

Each entry: video_id, caption. Video files located via mp4_urls glob pattern.

mp4_urls, metadata

“kinetics”

CSV with youtube_id and label

Each row: youtube_id, label. Video files located via mp4_urls glob pattern.

mp4_urls, metadata

“http”

JSON or JSONL

Each entry: url (HTTP/HTTPS video URL), captions (list of caption strings). Optional: video_id, caption_to_label.

metadata

Training Callbacks#

Callback

Default Parameters

Description

“wandb”

{}

Logs training metrics to Weights and Biases.

“clamp_logit_scale”

{}

Clamps the logit scale parameter to prevent instability.

“logit_parameters_monitor”

{}

Logs logit scale and bias parameters.

“iter_speed”

every_n: 50, save_s3: False

Logs iteration throughput (samples/sec) every N iterations.

“gradient_clip”

clip_norm: 3.0

Clips gradients to a maximum L2 norm.

“grad_norm_monitor”

every_n: 500, verbose: False

Logs gradient norms every N iterations.

“spectral_norm_monitor”

every_n: 1000, verbose: True

Logs spectral norms of weight matrices every N iterations.

“ema”

{}

Updates the Exponential Moving Average model shadow weights.

“log_losses”

every_n: 50, verbose: True

Logs all loss components every N iterations.

“text_frames_visualizer”

every_n: 500

Logs video frame and text caption pairs to Weights and Biases.

“pca_feature_map_visualizer”

every_n: 500

Logs PCA-projected feature map visualizations to Weights and Biases.

“validation_eval”

{}

Runs full evaluation metrics during training validation. Not included by default; add to enable.

OptimConfig Fields#

Parameter

Type

Description

Default

optim

OptimizerType

Optimizer type. Valid options: “adamw”, “fused_adamw”, “adam”, “sgd”.

“adamw”

lr

float

Learning rate.

1e-05

weight_decay

float

Weight decay coefficient.

1e-05

betas

list[float]

Adam and AdamW beta coefficients.

[0.9, 0.98]

warmup_steps

int

Number of warmup steps for the learning rate scheduler.

1000

policy

LRPolicy

Learning rate schedule policy. Valid options: “cosine”, “linear”, “constant”.

“cosine”

lr_decay_iters

int

Number of iterations over which to decay the learning rate (cosine scheduler).

50000

LossWeightsConfig Fields#

Parameter

Type

Description

Default

contrastive_loss

float

Weight for the contrastive loss term.

1.0

captioning_loss

float

Weight for the captioning loss term.

1.0

matching_loss

float

Weight for the text matching loss term.

1.0

LoRAConfig Fields#

Parameter

Type

Description

Default

enabled

bool

Enable LoRA fine-tuning.

False

lora_rank

int

Rank of the low-rank adapter matrices. Higher rank means more trainable parameters.

8

lora_alpha

int

Alpha scaling factor for LoRA. Typically set to 2× lora_rank.

16

lora_dropout

float

Dropout probability applied to LoRA layers.

0.1

bias

LoraBias

Bias handling for LoRA. Valid options: “none”, “all”, “lora_only”.

“none”

use_rslora

bool

Use Rank-Stabilized LoRA for more stable training at higher ranks.

False

use_dora

bool

Use DoRA (Weight-Decomposed Low-Rank Adaptation).

False

target_modules

Optional[list[str]]

Module name patterns to apply LoRA to.

[“qkv”, “fc1”, “fc2”, “attn.proj”, “query”, “value”, “key”, “dense”, “vision_proj”, “text_proj”, “itm_proj”]

modules_to_save

Optional[list[str]]

Modules to keep fully trainable (bypassing LoRA).

[“temporal_encoding”, “query_pooling”]

EMAConfig Fields#

Parameter

Type

Description

Default

enabled

bool

Enable Exponential Moving Average weight tracking.

False

beta

float

EMA decay rate.

0.9999

DAMPConfig Fields#

Parameter

Type

Description

Default

enabled

bool

Enable DAMP.

False

beta

float

DAMP beta coefficient.

0.1

mode

DAMPMode

DAMP mode. Valid options: “const”, “dynamic”.

“const”

ValidationEvalConfig Fields#

Parameter

Type

Description

Default

topk_classification

bool

Enable top-K hit rate classification metrics.

True

embedding_visualization

bool

Enable UMAP embedding visualization.

False

top_k_values

list[int]

List of K values for top-K hit rate computation.

[1, 3, 5, 10]

max_eval_samples

int

Maximum number of samples to use during evaluation.

2000

QueryConfig Fields#

Parameter

Type

Description

Default

input_videos

list[str]

List of video file paths to use as queries.

[]

input_texts

list[str]

List of text strings to use as queries.

[]

VisualEncoderConfig Fields#

Parameter

Type

Description

Default

type

VisualEncoderType

Visual encoder type.

“eva_vit_g”

img_size

int

Input image size for the visual encoder.

224

pretrained

bool

Load pretrained visual encoder weights from S3.

False

use_fp8

bool

Use FP8 precision with Transformer Engine (requires transformer_engine=true).

False

transformer_engine

bool

Use Transformer Engine for optimized attention computation.

True

checkpoint_activations

bool

Use gradient checkpointing for activations to reduce memory usage.

False

checkpoint_attention

bool

Use gradient checkpointing for attention (requires transformer_engine=true).

False