ExperimentConfig Fields#

Parameter	Type	Description	Default
`wandb`	WandbConfig	Weights and Biases logging configuration. Auto-disables if no API key is found.	Weights and Biases Logging Configuration
`model`	ModelConfig	Model configuration.	Model Configuration
`dataset`	DatasetConfig	Dataset configuration.	Dataset Configuration
`train`	TrainConfig	Training experiment configuration.	Training Configuration
`evaluate`	EvaluateConfig	Evaluation experiment configuration.	Evaluation Configuration
`inference`	InferenceConfig	Inference experiment configuration.	Inference Configuration
`export`	ExportConfig	ONNX export experiment configuration.	Export Configuration
`results_dir`	str	Directory to save results, checkpoints, and logs.	“/results”
`encryption_key`	Optional[str]	Encryption key for model export (TAO compatibility).	None
`model_name`	str	Model name identifier.	“cosmos_embed1”

WandbConfig Fields#

Parameter	Type	Description	Default
`enable`	bool	Enable Weights and Biases logging.	False
`project`	str	Weights and Biases project name.	“cosmos_embed1”
`group`	str	Run group for organizing related runs in the dashboard.	“”
`name`	str	Run name. Empty string auto-generates a name.	“”
`tags`	list[str]	List of tags for filtering runs in the dashboard.	[]
`save_code`	bool	Save a copy of the training code to Weights and Biases.	False
`api_key`	str	API key. If empty, falls back to the WANDB_API_KEY env var.	“”

ModelConfig Fields#

Parameter	Type	Description	Default
`network`	NetworkConfig	Network architecture configuration.	Network Configuration
`pretrained_model_path`	Optional[str]	Path to a pretrained checkpoint. Accepts a local file path (.pth, .safetensors) or a HuggingFace repo ID.	None
`pretrained_model_strict`	bool	Strict state_dict matching when loading pretrained weights. Missing or unexpected keys raise an error when True.	True
`precision`	Precision	Training precision. Valid options: “bf16”, “fp16”, “fp32”.	“bf16”
`input_hw`	list[int]	Data-loader input resolution [H, W]. Distinct from model.network.spatial_resolution.	[224, 224]
`fsdp`	FSDPConfig	Fully Sharded Data Parallel configuration for distributed training.	FSDP Configuration
`fsdp_shard_size`	int	Legacy FSDP shard size used by the model loader.	8
`lora`	LoRAConfig	LoRA configuration. When enabled, wraps the network with PEFT adapters. Requires transformer_engine=False.	LoRA Configuration

DatasetConfig Fields#

Parameter	Type	Description	Default
`train_dataset`	SingleDatasetConfig	Training dataset configuration.	Single Dataset Configuration
`val_dataset`	SingleDatasetConfig	Validation dataset configuration (used during training validation).	Single Dataset Configuration
`test_dataset`	SingleDatasetConfig	Test/evaluation dataset configuration (used by the evaluate action).	Single Dataset Configuration
`inference_dataset`	SingleDatasetConfig	Inference search database configuration (used by the inference action).	Single Dataset Configuration

TrainConfig Fields#

Parameter	Type	Description	Default
`optim`	OptimConfig	Optimizer configuration.	Optimizer Configuration
`loss_weights`	LossWeightsConfig	Per-loss weight configuration.	Loss Weights Configuration
`seed`	int	Random seed for reproducibility.	1234
`max_iter`	int	Maximum number of training iterations.	50000
`num_nodes`	int	Number of nodes for distributed training.	1
`num_gpus`	int	Number of GPUs per node. Use -1 to auto-detect all available GPUs, 0 for CPU only.	1
`gpu_ids`	list[int]	List of GPU device IDs to use. Overrides num_gpus for device selection.	[0]
`validation_iter`	int	Frequency of validation runs, in iterations.	1000
`checkpoint_iter`	int	Frequency of checkpoint saves, in iterations.	1000
`clip_grad_norm`	float	Gradient clipping norm. Set to 0.0 to disable gradient clipping.	0.0
`precision`	Precision	Training precision. Valid options: “bf16”, “fp16”, “fp32”.	“bf16”
`resume_training_checkpoint_path`	Optional[str]	Path to a checkpoint to resume training from.	None
`callbacks`	dict[str, Any]	Dict mapping callback name to parameter overrides. Keys must match CALLBACK_REGISTRY.	{wandb, clamp_logit_scale, …}
`max_val_iter`	Optional[int]	Maximum number of validation batches per GPU. None runs the full validation set.	None
`freeze_visual_encoder`	bool	Freeze the visual encoder weights during training.	True
`use_captioning_loss`	bool	Enable the captioning loss during training.	True
`use_text_matching_loss`	bool	Enable the text matching loss during training.	False
`ema`	EMAConfig	Exponential Moving Average configuration.	EMA Configuration
`spectral_reparam`	bool	Enable spectral reparameterization.	False
`damp`	DAMPConfig	DAMP (Decoupled Attention and Momentum Path) training technique configuration.	DAMP Configuration
`load_training_state`	bool	Restore optimizer and scheduler state when resuming training.	False
`strict_resume`	bool	Strict state_dict matching when resuming from a checkpoint.	False

EvaluateConfig Fields#

Parameter	Type	Description	Default
`checkpoint`	Optional[str]	Path to the model checkpoint for evaluation.	None
`max_val_batches`	int	Maximum number of validation batches to run. -1 runs all batches.	-1
`num_gpus`	int	Number of GPUs for evaluation.	1
`callbacks`	ValidationEvalConfig	Validation evaluation callback configuration.	Validation Evaluation Callbacks Configuration
`load_dataset_pkl`	Optional[str]	Path to load pre-computed eval embeddings from. When set and the file exists, model inference is skipped.	None
`save_dataset_pkl`	Optional[str]	Path to save generated eval embeddings to. When set, embeddings are saved after generation (rank 0 only).	None

InferenceConfig Fields#

Parameter	Type	Description	Default
`checkpoint`	Optional[str]	Path to the model checkpoint for inference.	None
`query`	QueryConfig	Query inputs (text and/or video) for similarity search.	Query Configuration
`num_gpus`	int	Number of GPUs for inference.	1
`k`	int	Number of nearest-neighbor results to return per query.	5
`load_dataset_pkl`	Optional[str]	Path to load pre-computed search database embeddings from. When set and the file exists, model inference is skipped.	None
`save_dataset_pkl`	Optional[str]	Path to save generated search database embeddings to. When set, embeddings are saved after generation.	None

ExportConfig Fields#

Parameter	Type	Description	Default
`checkpoint`	Optional[str]	Path to the model checkpoint for export.	None
`onnx_file`	Optional[str]	Output ONNX file path. If None, the path is auto-derived from the checkpoint path and mode.	None
`mode`	ExportMode	Export mode. Valid options: “video”, “text”, “combined”, “huggingface”.	“video”
`opset_version`	int	ONNX opset version.	17
`batch_size`	int	Batch size for export. Set to -1 for a dynamic batch dimension.	1
`on_cpu`	bool	Run export on CPU instead of GPU.	False
`verbose`	bool	Print verbose ONNX export information.	False
`simplify`	bool	Apply onnxsim simplification after export.	False
`hf_output_dir`	Optional[str]	Output directory for HuggingFace export. If None, auto-derived from checkpoint path. Only used when mode=huggingface.	None

NetworkConfig Fields#

Parameter	Type	Description	Default
`visual_encoder`	VisualEncoderConfig	Visual encoder configuration.	Visual Encoder Configuration
`embed_dim`	int	Output embedding dimension for video-text alignment.	256
`num_query_tokens`	int	Number of learnable query tokens in the Q-Former.	32
`max_txt_len`	int	Maximum text token sequence length.	128
`num_video_frames`	int	Number of input video frames.	8
`spatial_resolution`	list[int]	Spatial resolution [H, W] for input video frames.	[224, 224]
`temporal_encoding_type`	TemporalEncodingType	Type of temporal encoding. Default: “neighboring_token_propagation”.	“neighboring_token_propagation”
`contrastive_type`	ContrastiveType	Contrastive loss type. Valid options: “clip”, “siglip”.	“clip”
`qformer_pretrain_ckpt`	Optional[str]	Path or HuggingFace repo ID for the Q-Former pretrained checkpoint.	None
`query_pooling_type`	QueryPoolingType	Query pooling method after the Q-Former. Valid options: “avg”, “attention”, “identity”.	“avg”
`pretrained_text_encoder`	bool	Load pretrained BERT weights for the text encoder.	False
`pretrained_visual_encoder`	bool	Load pretrained weights for the visual encoder from S3 or HuggingFace.	False
`num_heldout_frames`	int	Number of held-out frames for certain training strategies.	0

FSDPConfig Fields#

Parameter	Type	Description	Default
`enabled`	bool	Enable Fully Sharded Data Parallel.	False
`shard_size`	Optional[int]	FSDP shard group size. None auto-selects one shard per node.	None
`replica_size`	Optional[int]	FSDP replica group size. None auto-selects.	None

SingleDatasetConfig Fields#

Parameter	Type	Description	Default
`dataset_type`	DatasetType	Dataset class to use. Valid options: “mock”, “vad_r1”, “vad_r1_chunks”, “msrvtt”, “kinetics”, “http”.	“mock”
`metadata`	Optional[str]	Path to the metadata JSON or JSONL file.	None
`data_root`	Optional[str]	Root directory for video data.	None
`num_video_frames`	int	Number of video frames to sample from each video.	8
`resolution`	list[int]	Video frame resolution [H, W].	[224, 224]
`batch_size`	int	Batch size per GPU.	4
`workers`	int	Number of dataloader worker processes.	4
`drop_last`	bool	Drop the last incomplete batch when the dataset size is not divisible by batch_size.	True
`prefetch_factor`	int	Number of batches to prefetch per worker process.	2
`pin_memory`	bool	Pin memory buffers for faster GPU transfer.	True
`split`	Optional[str]	Split filter for VadR1 datasets, e.g., “train”, “test”. None means no filtering.	None
`random_caption`	bool	When caption_field is a list, randomly sample one field per sample instead of always using the first.	False
`path_prefix_mapping`	dict[str, str]	Remap video file paths, e.g., {“/old/path/”: “/new/path/”}.	{}
`skip_missing_files`	bool	Skip dataset entries whose video files are missing.	True
`caption_field`	Any	Metadata field(s) to use as captions. String or list of strings, e.g., “anomaly_type”.	“anomaly_type”
`mp4_urls`	Optional[str]	Glob pattern for video files used by MSRVTTDataset and KineticsDataset.	None
`caption_to_label`	dict[str, int]	Mapping from caption text to integer label ID.	{}
`chunk_size_sec`	float	Duration of each temporal chunk in seconds (VadR1ChunksDataset only).	5.0
`shared_normal_label`	bool	When True, all normal (non-anomaly) samples share a single label ID instead of per-class labels.	True

Dataset Format Reference#

dataset_type	Metadata Format	Entry Schema	Required Config Fields
“mock”	None	No metadata file needed. Generates random frames using `resolution` and `num_video_frames`.
“vad_r1”	JSON or JSONL	Each entry: `path` (video file path), `anomaly_type` (caption). Optional: `split`, `start`, `end`, `total_frames`, `what`, `when`, `where`, `why`, `how`.	`metadata`, `data_root`
“vad_r1_chunks”	JSON or JSONL	Each entry: `video_path`, `anomaly_type`. Optional: `split`, `chunks` (list of chunk dicts with `start_time_sec`, `end_time_sec`, `is_anomaly`).	`metadata`, `data_root`
“msrvtt”	JSON with video/caption pairs	Each entry: `video_id`, `caption`. Video files located via `mp4_urls` glob pattern.	`mp4_urls`, `metadata`
“kinetics”	CSV with youtube_id and label	Each row: `youtube_id`, `label`. Video files located via `mp4_urls` glob pattern.	`mp4_urls`, `metadata`
“http”	JSON or JSONL	Each entry: `url` (HTTP/HTTPS video URL), `captions` (list of caption strings). Optional: `video_id`, `caption_to_label`.	`metadata`

Training Callbacks#

Callback	Default Parameters	Description
“wandb”	{}	Logs training metrics to Weights and Biases.
“clamp_logit_scale”	{}	Clamps the logit scale parameter to prevent instability.
“logit_parameters_monitor”	{}	Logs logit scale and bias parameters.
“iter_speed”	every_n: 50, save_s3: False	Logs iteration throughput (samples/sec) every N iterations.
“gradient_clip”	clip_norm: 3.0	Clips gradients to a maximum L2 norm.
“grad_norm_monitor”	every_n: 500, verbose: False	Logs gradient norms every N iterations.
“spectral_norm_monitor”	every_n: 1000, verbose: True	Logs spectral norms of weight matrices every N iterations.
“ema”	{}	Updates the Exponential Moving Average model shadow weights.
“log_losses”	every_n: 50, verbose: True	Logs all loss components every N iterations.
“text_frames_visualizer”	every_n: 500	Logs video frame and text caption pairs to Weights and Biases.
“pca_feature_map_visualizer”	every_n: 500	Logs PCA-projected feature map visualizations to Weights and Biases.
“validation_eval”	{}	Runs full evaluation metrics during training validation. Not included by default; add to enable.

OptimConfig Fields#

Parameter	Type	Description	Default
`optim`	OptimizerType	Optimizer type. Valid options: “adamw”, “fused_adamw”, “adam”, “sgd”.	“adamw”
`lr`	float	Learning rate.	1e-05
`weight_decay`	float	Weight decay coefficient.	1e-05
`betas`	list[float]	Adam and AdamW beta coefficients.	[0.9, 0.98]
`warmup_steps`	int	Number of warmup steps for the learning rate scheduler.	1000
`policy`	LRPolicy	Learning rate schedule policy. Valid options: “cosine”, “linear”, “constant”.	“cosine”
`lr_decay_iters`	int	Number of iterations over which to decay the learning rate (cosine scheduler).	50000

LossWeightsConfig Fields#

Parameter	Type	Description	Default
`contrastive_loss`	float	Weight for the contrastive loss term.	1.0
`captioning_loss`	float	Weight for the captioning loss term.	1.0
`matching_loss`	float	Weight for the text matching loss term.	1.0

LoRAConfig Fields#

Parameter	Type	Description	Default
`enabled`	bool	Enable LoRA fine-tuning.	False
`lora_rank`	int	Rank of the low-rank adapter matrices. Higher rank means more trainable parameters.	8
`lora_alpha`	int	Alpha scaling factor for LoRA. Typically set to 2× lora_rank.	16
`lora_dropout`	float	Dropout probability applied to LoRA layers.	0.1
`bias`	LoraBias	Bias handling for LoRA. Valid options: “none”, “all”, “lora_only”.	“none”
`use_rslora`	bool	Use Rank-Stabilized LoRA for more stable training at higher ranks.	False
`use_dora`	bool	Use DoRA (Weight-Decomposed Low-Rank Adaptation).	False
`target_modules`	Optional[list[str]]	Module name patterns to apply LoRA to.	[“qkv”, “fc1”, “fc2”, “attn.proj”, “query”, “value”, “key”, “dense”, “vision_proj”, “text_proj”, “itm_proj”]
`modules_to_save`	Optional[list[str]]	Modules to keep fully trainable (bypassing LoRA).	[“temporal_encoding”, “query_pooling”]

EMAConfig Fields#

Parameter	Type	Description	Default
`enabled`	bool	Enable Exponential Moving Average weight tracking.	False
`beta`	float	EMA decay rate.	0.9999

DAMPConfig Fields#

Parameter	Type	Description	Default
`enabled`	bool	Enable DAMP.	False
`beta`	float	DAMP beta coefficient.	0.1
`mode`	DAMPMode	DAMP mode. Valid options: “const”, “dynamic”.	“const”

ValidationEvalConfig Fields#

Parameter	Type	Description	Default
`topk_classification`	bool	Enable top-K hit rate classification metrics.	True
`embedding_visualization`	bool	Enable UMAP embedding visualization.	False
`top_k_values`	list[int]	List of K values for top-K hit rate computation.	[1, 3, 5, 10]
`max_eval_samples`	int	Maximum number of samples to use during evaluation.	2000

QueryConfig Fields#

Parameter	Type	Description	Default
`input_videos`	list[str]	List of video file paths to use as queries.	[]
`input_texts`	list[str]	List of text strings to use as queries.	[]

VisualEncoderConfig Fields#

Parameter	Type	Description	Default
`type`	VisualEncoderType	Visual encoder type.	“eva_vit_g”
`img_size`	int	Input image size for the visual encoder.	224
`pretrained`	bool	Load pretrained visual encoder weights from S3.	False
`use_fp8`	bool	Use FP8 precision with Transformer Engine (requires transformer_engine=true).	False
`transformer_engine`	bool	Use Transformer Engine for optimized attention computation.	True
`checkpoint_activations`	bool	Use gradient checkpointing for activations to reduce memory usage.	False
`checkpoint_attention`	bool	Use gradient checkpointing for attention (requires transformer_engine=true).	False