Cosmos-Reason#

Cosmos-Reason is a state-of-the-art video-language model included in TAO Toolkit. It supports the following tasks:

train
evaluate
inference

You can invoke these tasks from the FTMS client using the following convention:

tao-client cosmos-rl <sub_task> <args_per_subtask>

Where <args_per_subtask> are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.

Note

Cosmos-RL is currently available only through the TAO Toolkit API and tao-client interfaces. There is no launcher-based interface for VLM models.

Hardware Requirements#

Minimum requirements:

GPUs: 8x A100 GPUs with at least 80 GB GPU memory
Storage: Minimum 200 GB of free disk space (each Cosmos-RL checkpoint when written to disk is ~150 GB)
OS: Ubuntu 22.04+
Driver: NVIDIA Driver 570
CUDA: CUDA 12.8

Recommended configuration for optimal performance:

Multi-node training for large-scale datasets
High-bandwidth storage system for efficient video data access
Multiple CPU cores for parallel data preprocessing

Data Input for Cosmos-RL#

Cosmos-RL expects datasets in the LLaVA format with the following structure:

dataset_folder/
   images.tar.gz or videos.tar.gz    (Video frames or image sequences)
   annotations.json                  (Text annotations in JSON format)

Data format specifications:

Dataset Type: vlm (Vision-Language Model)
Format: llava
Supported Intents: training, evaluation, testing

Annotation format:

The annotations should follow the LLaVA conversation format:

{
  "id": "d460df3a29cc7d208d4d588c63e83579",
  "images": [
    "images/001354.png",
    "som_images/001354.d460df3a29cc7d208d4d588c63e83579.png"
  ],
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nThe first image is the original, and the second is an overlay. Bright numeric IDs are labeled at the center of certain visual objects in the second image.\nBased on pallet positions in Region [0] Region [1] Region [2] Region [3] Region [4] Region [5] Region [6] Region [7] Region [8] Region [9], which one should the transporter at Region [10] retrieve?\nPlease answer with only the integer number of the correct region the number should be one that is both shown in the image and mentioned in this question. Do not include any explanation or extra text."
    },
    {
      "from": "gpt",
      "value": "3"
    }
  ],
  "category": "mcq",
  "normalized_answer": "3"
}

Creating an Training Specification File#

SPECS=$(tao-client cosmos-rl get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

The training specification file for Cosmos-RL includes train, validation, policy, logging, and custom parameters. Here is an example specification file for training a Cosmos-RL model:

train:
  resume: false
  epoch: 10
  compile: false
  train_batch_per_replica: 1
  output_dir: "output"
  optm_lr: 1e-6
  optm_impl: "foreach"
  optm_weight_decay: 0.01
  optm_min_lr_factor: 0.0
  optm_grad_norm_clip: 1.0
  epsilon: 1e-8
  optm_name: "AdamW"
  optm_betas: [0.9, 0.999]
  optm_warmup_epochs: 0
  async_tp_enabled: false
  master_dtype: "float32"
  param_dtype: "bfloat16"
  fsdp_reduce_dtype: "float32"
  fsdp_offload: false
  fsdp_reshard_after_forward: "default"
  sync_weight_interval: 1
  ckpt:
    enable_checkpoint: true
    save_freq_in_epoch: 10
    save_mode: "sync"
    max_keep: 8
    export_safetensors: true
  train_policy:
    type: "sft"
    mini_batch: 4
    enable_dataset_cache: true
    dataloader_num_workers: 8
    dataloader_prefetch_factor: 8
    conversation_column_name: "conversations"
    dataset:
      name: "its"
      test_size: 1
  fp8:
    enable_fp8: false
    fp8_recipe: "dynamic_scaling"
    quant_recipe: "rowwise"

validation:
  enable: true
  freq_in_epoch: 10

policy:
  model_name_or_path: "nvidia/Cosmos-Reason1-7B"
  model_max_length: 4096
  model_gradient_checkpointing: true
  parallelism:
    n_init_replicas: 1
    tp_size: 1
    cp_size: 1
    dp_shard_size: 1
    dp_replicate_size: 1
    pp_size: 1
    cp_rotate_method: "allgather"
  lora:
    r: 8
    lora_alpha: 8
    lora_dropout: 0.0
    target_modules: ["q_proj", "v_proj"]
    use_rslora: false
    modules_to_save: []
    init_lora_weights: true

logging:
  logger: ["console", "tao"]
  project_name: "cosmos-rl"
  experiment_name: "cosmos-rl"

custom:
  dataset:
    annotation_path: "data/sft/annotations.json"
    media_path: "data/sft/train2017"
    system_prompt: ""
  vision:
    fps: 1
    total_pixels: 313600

redis: "12800"
results_dir: "/results"

Training Configuration Parameters#

The following sections detail all available configuration parameters for Cosmos-RL training, organized by configuration group.

ExperimentConfig Fields#

Field	value_type	description	default_value	automl_enabled
`train`	collection	Train config		FALSE
`validation`	collection	Validation config		FALSE
`policy`	collection	Policy config		FALSE
`logging`	collection	Logging config		FALSE
`redis`	string	Redis port for distributed training coordination and interprocess communication in multinode setups	12800
`results_dir`	string	Root folder for all training outputs including checkpoints, logs, and evaluation results	/results
`custom`	collection	Custom config		FALSE

TrainConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`resume`	bool	Resume training from the latest checkpoint in `output_dir`. Restore model weights, optimizer state, and training progress.	FALSE
`epoch`	int	Total number of training epochs (complete passes through the dataset).	10	10	20		TRUE
`compile`	bool	Use PyTorch 2.0 `torch.compile()` for optimized execution. Improves throughput but increases initial compilation time.	FALSE
`train_batch_per_replica`	int	Batch size per GPU replica. Global batch size = `train_batch_per_replica × num_gpus ÷ dp_replicate_size`.	1	1	inf
`output_dir`	string	Folder for saving checkpoints, logs, and training artifacts.	output
`optm_lr`	float	Peak learning rate for optimizer. Actual LR follows warmup and cosine decay schedule.	1e-06	0	inf		TRUE
`optm_impl`	categorical	Optimizer implementation: `fused` (fastest, CUDA only), `foreach` (vectorized), `for-loop` (compatible). See PyTorch documentation for details.	foreach			fused,foreach,for-loop
`optm_weight_decay`	float	L2 regularization coefficient (weight decay) to prevent overfitting. Applied to all parameters except biases and norms.	0.01	0	inf
`optm_min_lr_factor`	float	Minimum learning rate as a fraction of peak LR. For cosine annealing: `min_lr = optm_lr × optm_min_lr_factor`.	0.0	0	inf
`optm_grad_norm_clip`	float	Maximum gradient norm for clipping. Prevents exploding gradients. Set to 0 or negative to disable clipping.	1.0	0	inf
`epsilon`	float	Small constant added to denominator for numerical stability in Adam/AdamW optimizer.	1e-08	0	inf
`optm_name`	categorical	Optimizer algorithm: ‘AdamW’ (Adam with decoupled weight decay, recommended) or ‘Adam’ (original).	AdamW			AdamW,Adam	TRUE
`optm_betas`	list_2	Beta coefficients for Adam/AdamW: [beta1, beta2] for exponential moving averages of gradient and squared gradient.	[0.9, 0.999]				TRUE
`optm_warmup_epochs`	union	Number of epochs for linear learning rate warmup, from 0 to `optm_lr`. Helps stabilize training (recommended: epochs/10).	0	0	inf		TRUE
`async_tp_enabled`	bool	Enable asynchronous Tensor Parallel communication to overlap computation and communication for better throughput.	FALSE
`master_dtype`	categorical	Data type for master weights in optimizer states. Higher precision prevents accumulated rounding errors.	float32			float32,float16,bfloat16
`param_dtype`	categorical	Data type for model parameters and activations during training. `bfloat16` is recommended for stability and memory efficiency.	bfloat16			float32,float16,bfloat16
`fsdp_reduce_dtype`	categorical	Data type for gradient all-reduce in Fully Sharded Data Parallel. `float32` provides better numerical stability.	float32			float32,float16,bfloat16
`fsdp_offload`	bool	Offload FSDP parameters to CPU memory when not in use. Reduces GPU memory but increases overhead.	FALSE
`fsdp_reshard_after_forward`	categorical	Reshard parameters after forward pass. `true` saves memory, `false` improves speed, `default` uses PyTorch heuristic.	default			default,true,false
`sync_weight_interval`	int	Interval in steps for synchronizing weights across data parallel replicas. Higher values reduce communication overhead.	1	1	inf
`ckpt`	collection	Train checkpoint config.					FALSE
`train_policy`	collection	Train policy config.					FALSE
`fp8`	collection	Train FP8 config.					FALSE

ValidationConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`enable`	bool	Run validation during training to monitor model performance on held-out data.	TRUE
`freq_in_epoch`	int	Run validation every N epochs. Takes priority over freq (step-based) if set to a positive value.	10	1	inf

PolicyConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`model_name_or_path`	string	HuggingFace model identifier (e.g., `nvidia/Cosmos-Reason1-7B`) or local path to pretrained model folder.	nvidia/Cosmos-Reason1-7B
`model_max_length`	int	Maximum sequence length in tokens. Sequences longer than this are truncated. Limited by model’s positional encoding.	4096	1	inf
`model_gradient_checkpointing`	bool	Trade compute for memory; recompute activations during backward pass instead of storing. Reduces memory requirements ~40% but increases training time ~20%.	TRUE
`parallelism`	collection	Policy parallelism config.				FALSE
`lora`	collection	LoRA config.				FALSE

LoggingConfig Fields#

Field	value_type	description	default_value	valid_options	automl_enabled
`logger`	list	Logging backends to enable. Options: `console` for `stdout`, `tao` for TAO Toolkit status logging, `wandb` for Weights & Biases.	[‘console’, ‘tao’]	[‘console’, ‘tao’]	FALSE
`project_name`	string	Project name used to organize experiments in logging backends like Weights & Biases.	cosmos-rl
`experiment_name`	string	Unique name for this training run, used for tracking and organizing results across logging backends.	cosmos-rl

CustomConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`dataset`	collection	Dataset config.					FALSE
`vision`	collection	Vision config.					FALSE

TrainCheckpointConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options
`enable_checkpoint`	bool	Indicates whether to save model checkpoints during training for resuming or model deployment.	TRUE
`save_freq_in_epoch`	int	Save checkpoint every N epochs. Takes priority over the step-based frequency if set to a positive value	10	1	inf
`save_mode`	categorical	Checkpoint saving mode: `sync` blocks training until save completes, `async` saves in background.	sync			async,sync
`max_keep`	int	Maximum number of checkpoints to keep. Older checkpoints are automatically deleted. Set to -1 to keep all checkpoints.	8	-1	inf
`export_safetensors`	bool	Export checkpoints in HuggingFace SafeTensors format for easy model sharing and deployment.	TRUE

TrainPolicyConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	categorical	Training policy type: `sft` for Supervised Fine-Tuning or `grpo` for Group Relative Policy Optimization.	sft			sft
`mini_batch`	int	Mini-batch size for gradient accumulation. Global batch is split into mini-batches to reduce memory usage.	4	1	inf
`enable_dataset_cache`	bool	Cache preprocessed dataset samples to disk for faster data loading across training runs.	TRUE
`dataloader_num_workers`	int	Number of parallel worker processes for data loading and preprocessing. Set to 0 for single-threaded loading.	8	0	inf
`dataloader_prefetch_factor`	int	Number of batches loaded in advance per worker. Higher values improve throughput but increase memory usage.	8	1	inf
`conversation_column_name`	string	Name of the dataset column containing conversation data (list of messages with roles and content).	conversations
`dataset`	collection	Dataset config.					FALSE

TrainFP8Config Fields#

Field	value_type	description	default_value	valid_options
`enable_fp8`	bool	Enable FP8 (8-bit floating point) training for 2x memory reduction and faster training on supported GPUs (H100, H200).	FALSE
`fp8_recipe`	categorical	FP8 scaling strategy: `dynamic_scaling` adjusts scale factors per-iteration, `delayed_scaling` updates periodically for stability.	dynamic_scaling	dynamic_scaling,delayed_scaling
`quant_recipe`	categorical	Quantization granularity: `rowwise` computes scale per row (for better accuracy), `tensorwise` uses single scale per tensor (faster).	rowwise	rowwise,tensorwise

PolicyParallelismConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options
`n_init_replicas`	int	Number of model replicas to initialize. Used for advanced multi-model training setups.	1	1	inf
`tp_size`	int	Tensor Parallel size: Splits each layer across N GPUs. Use for models too large for a single GPU. Must be a factor of the total number of GPUs.	1	1	inf
`cp_size`	int	Context Parallel size: Splits long sequences across N GPUs. Enables training with sequences longer than single GPU memory.	1	1	inf
`dp_shard_size`	int	Data Parallel Shard size (FSDP): Shards model parameters across N GPUs. Reduces per-GPU memory. Must multiply with other dimensions to equal the total number of GPUs.	1	1	inf
`dp_replicate_size`	int	Data Parallel Replicate size: Replicates full model across N GPU groups. Increases throughput by processing different batches in parallel.	1	1	inf
`pp_size`	int	Pipeline Parallel size: Splits model layers across N GPUs. Enables training very deep models. Uses 1F1B schedule for efficiency.	1	1	inf
`cp_rotate_method`	categorical	Context Parallel communication pattern: `allgather` (higher bandwidth) or `p2p` (point-to-point, lower memory).	allgather			allgather,p2p

LoraConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`r`	int	LoRA rank: Dimensionality of low-rank adaptation matrices. Higher values increase model capacity but require more memory (must be a power of 2).	8	1	256		TRUE
`lora_alpha`	int	LoRA scaling factor: Controls the magnitude of LoRA updates. Typically set equal to rank r (must be a power of 2).	8	1	1024		TRUE
`lora_dropout`	float	Dropout probability applied to LoRA layers for regularization. Set to 0.0 to disable dropout.	0.0	0.0	0.1		TRUE
`target_modules`	subset_list	Transformer layers to apply LoRA adaptation: q/k/v/o_proj (attention), up/gate/down_proj (MLP). Use `all-linear` for all linear layers. Cannot include `attn.qkv` or `attn.proj` if `modules_to_save` contains `visual`.	[`q_proj`, `v_proj`]			[`q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `gate_proj`, `down_proj`, `attn.qkv`, `attn.proj`, `all-linear`]	TRUE
`use_rslora`	bool	Use Rank-Stabilized LoRA with improved scaling (lora_alpha/sqrt(r) instead of lora_alpha/r). Provides better training stability and performance for higher ranks.	FALSE
`modules_to_save`	optional_list	Additional non-LoRA modules to fine-tune fully. Set to `['visual']` to train vision encoder for VLMs, or leave empty to freeze all non-LoRA parameters.	[]			visual	TRUE
`init_lora_weights`	union	Specifies how to initialize the weights of the adapter layers. Pass TRUE (the default) to use default initialization from the reference implementation from Microsoft, with the LoRA B weight being set to 0. This means that without further training, the LoRA adapter will be a no-op. Pass FALSE to use random initialization of LoRA A and B, meaning that LoRA is not a no-op before training; this setting is intended for debugging purposes. Pass `gaussian` to use Gaussian initialization scaled by the LoRA rank for linear and layers. Pass `loftq` to use LoftQ initialization. Pass `eva` to use a data-driven initialization of Explained Variance Adaptation. EVA initializes LoRA based on the SVD of layer input activations and achieves SOTA performance due to its ability to adapt to the finetuning data. Pass `olora` to use OLoRA initialization. Pass `pissa` to use PiSSA initialization (see https://huggingface.co/papers/2404.02948). Pass `pissa_niter_N` (where N is an integer) to use PiSSA initialization with N iterations (e.g., `pissa_niter_16` for 16 iterations). More iterations may improve initialization quality.	TRUE			TRUE, FALSE, `gaussian`, `loftq`, `eva`, `olora`, `pissa`, `pissa_niter_[number of iters]`

DatasetConfig Fields#

Field	value_type	description	default_value
`annotation_path`	string	Path to the JSON file containing training annotations with conversations and media references.	data/sft/annotations.json
`media_path`	string	Folder containing image and video media files referenced in the annotation file.	data/sft/train2017
`system_prompt`	string	System instruction that provides context for the model’s behavior and role in conversations.

VisionConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`fps`	int	Video sampling rate in frames per second for vision-language models. Higher FPS captures more temporal information but increases memory usage.	1	1	3		TRUE
`total_pixels`	int	Target resolution for vision inputs (width × height). Images and videos are resized to this total pixel count while maintaining aspect ratio.	313600	1	inf

TrainPolicyDatasetConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`name`	string	HuggingFace dataset name or local path to dataset for training (e.g., `its`, `gsm8k`)	its
`test_size`	union	Size of the test set. If float, represents the ratio between 0.0 and 1.0 of the dataset; if int, represents the absolute number of samples.	None	0.0	inf

Training the Model#

Use the following command to run Cosmos-RL training:

TRAIN_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

Evaluating the Model#

Creating an Evaluation Specification File#

EVAL_SPECS=$(tao-client cosmos-rl get-spec --action evaluate --job_type experiment --id $EXPERIMENT_ID)

The evaluation experiment specification file for Cosmos-RL includes evaluate parameters for comprehensive model assessment. Here is an example specification file for evaluating a Cosmos-RL model:

evaluate:
  dataset:
    annotation_path: "path/to/eval_annotations.json"
    media_dir: "path/to/eval_media/"
    system_prompt: "You are a helpful assistant that can answer questions about a street-view CCTV footage. The vehicles that need attention are marked with bounding boxes and IDs."
  model:
    model_name: "nvidia/Cosmos-Reason1-7B"
    save_folder: "cr1_1_zero_shot"
    tokenizer_model_name: "qwen2.5-vl-7b"
    dtype: "bfloat16"
    tp_size: 1
    max_length: 128000
    enable_lora: false
    base_model_path: ""
  evaluation:
    answer_type: "freeform"
    num_processes: 40
    skip_saved: false
    seed: 1
    limit: -1
    total_shard: 1
    shard_id: 0
  vision:
    fps: 4
    total_pixels: 3136000
  generation:
    max_retries: 10
    max_tokens: 1024
    temperature: 0
    repetition_penalty: 1
    presence_penalty: 0
    frequency_penalty: 0
  results:
    save_individual_results: true
    save_confusion_matrix: true
    save_metrics_summary: true

results_dir: "/results"

Evaluation Configuration Parameters#

The following sections detail all available configuration parameters for Cosmos-RL evaluation.

ExperimentConfig Fields (Evaluation)#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`results_dir`	string	Root folder for saving all evaluation outputs, including predictions, metrics, and visualizations.	/results
`evaluate`	collection	Evaluation configuration.					FALSE

EvaluateConfig Fields#

Field	value_type	description	automl_enabled
`dataset`	collection	Dataset configuration for evaluation	FALSE
`model`	collection	Model configuration	FALSE
`evaluation`	collection	Evaluation parameters	FALSE
`vision`	collection	Vision processing configuration	FALSE
`generation`	collection	Generation parameters	FALSE
`results`	collection	Results and output configuration	FALSE

DatasetConfig Fields (Evaluation)#

Field	value_type	description	default_value
`annotation_path`	string	Path to JSON file with evaluation samples containing questions, ground truth answers, and media references.
`media_dir`	string	Optional folder containing image and video files. Leave empty if media paths in annotations are absolute or relative to current folder.
`system_prompt`	string	System instruction prepended to all evaluation prompts to provide context about the task and expected behavior.	You are a helpful assistant that can answer questions about a street-view CCTV footage. The vehicles that need attention are marked with bounding boxes and IDs.

ModelConfig Fields (Evaluation)#

Field	value_type	description	default_value	valid_min	valid_max
`model_name`	string	HuggingFace model ID, local model path, or path to safetensors checkpoint folder for evaluation.	nvidia/Cosmos-Reason1-7B
`save_folder`	string	Subfolder name within `results_dir` to save evaluation outputs, predictions, and metrics.	cr1_1_zero_shot
`tokenizer_model_name`	string	Tokenizer to use for text processing. Options: `qwen2.5-vl-7b`, `qwen2-vl-2b`, `qwen2.5-vl-32b`, `qwen2.5-vl-72b`.	qwen2.5-vl-7b
`dtype`	string	Precision for model weights during inference: `bfloat16` (recommended, for speed) or `float16` (for wider compatibility).	bfloat16
`tp_size`	int	Number of GPUs for Tensor Parallelism. Splits each layer across GPUs for larger models. Set to 1 for single-GPU inference.	1	1	8
`max_length`	int	Maximum total sequence length (prompt + response) in tokens. Must not exceed model’s context window.	128000	1024	1000000
`enable_lora`	bool	Specifies whether to merge LoRA adapter weights into base model before evaluation. Required when evaluating LoRA fine-tuned models.	False
`base_model_path`	string	Path to base pretrained model. Required when `enable_lora` is TRUE to merge LoRA weights from `model_name`.

EvaluationConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max
`answer_type`	string	Expected answer format: `letter` (A/B/C/D choices), `reasoning` (chain-of-thought), or `freeform` (open-ended text).	freeform
`num_processes`	int	Number of parallel worker processes for concurrent evaluation. Higher values speed up evaluation but increase memory usage.	40	1	128
`skip_saved`	bool	Skip re-evaluating samples that already have saved results. Useful for resuming interrupted evaluations.	FALSE
`seed`	int	Random seed for deterministic sampling and generation. Use same seed for reproducible results.	1	0	999999
`limit`	int	Maximum number of samples to evaluate. Set to -1 for full dataset or a positive integer for quick testing or debugging.	-1	-1	999999
`total_shard`	int	Split evaluation across N shards for distributed processing across multiple machines or jobs.	1	1	64
`shard_id`	int	Current shard identifier (0-indexed). Each shard processes a disjoint subset of the evaluation data.	0	0	63

VisionConfig Fields (Evaluation)#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`fps`	int	Downsample video to this frame rate for vision processing. Higher FPS provides more temporal detail but increases compute time.	4	1	30
`total_pixels`	int	Target resolution for vision inputs (width × height). Images and videos are resized to this pixel count while preserving the aspect ratio.	3136000	100000	10000000

GenerationConfig Fields#

Field	value_type	description	default_value	valid_min	valid_max
`max_retries`	int	Maximum retry attempts for failed generations due to errors or timeouts. Useful for handling transient failures.	10	0	50
`max_tokens`	int	Maximum number of new tokens to generate per response. Longer limits allow detailed answers but increase latency.	1024	1	8192
`temperature`	float	Sampling temperature: 0.0 for deterministic greedy decoding, higher values (0.7-1.0) for more creative, diverse outputs.	0	0	2
`repetition_penalty`	float	Penalty for repeating tokens. Values > 1.0 discourage repetition, 1.0 = no penalty, < 1.0 encourages repetition.	1	0.1	2
`presence_penalty`	float	Penalty for tokens that already appear in the sequence. Positive values promote diversity; negative values allow repetition.	0	-2	2
`frequency_penalty`	float	Penalty proportional to token frequency in sequence. Positive values reduce repetitive patterns; negative values allow them.	0	-2	2

ResultsConfig Fields#

Field	value_type	description	default_value
`save_individual_results`	bool	Indicates whether to save an individual JSON file for each sample with question, prediction, ground truth, and metadata for detailed analysis.	TRUE
`save_confusion_matrix`	bool	Indicates whether to generate and save a confusion matrix visualization showing prediction vs ground truth distribution (for classification tasks).	TRUE
`save_metrics_summary`	bool	Indicates whether to save an aggregated metrics summary JSON with accuracy, F1, precision, recall, and other evaluation statistics.	TRUE

Running Evaluation#

To run evaluation with a Cosmos-RL model, use this command:

EVAL_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$EVAL_SPECS")

Running Inference with a Cosmos-RL Model#

Creating an Inference Specification File#

INFERENCE_SPECS=$(tao-client cosmos-rl get-spec --action inference --job_type experiment --id $EXPERIMENT_ID)

The inference experiment specification file for Cosmos-RL includes inference parameters for generating responses to visual content. Here is an example specification file for running inference with a Cosmos-RL model:

inference:
  media: "path/to/video.mp4"
  prompt: "Describe this video."
  fps: 4
  total_pixels: 6422528
  max_new_tokens: 4096

Inference Configuration Parameters#

The following sections detail all available configuration parameters for Cosmos-RL inference.

ExperimentConfig Fields (Inference)#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`inference`	collection	Inference config					FALSE

InferenceConfig Fields#

Field	value_type	description	default_value
`media`	string	Path to input image or video file for inference. Supports common formats (JPG, PNG, MP4, AVI, etc.).
`prompt`	string	Text prompt or question to ask about the media. The model responds based on visual content and this instruction.	Describe this video.
`fps`	int	Video frame sampling rate. Higher FPS provides more temporal information but increases memory and latency.	4
`total_pixels`	int	Target resolution for vision input (width × height). The image or video is resized to this pixel count while maintaining the aspect ratio.	6422528
`max_new_tokens`	int	Maximum number of tokens to generate in the response. Higher values allow longer, more detailed answers.	4096

Running Inference#

The inference tool for Cosmos-RL models can be used to generate text responses based on video content.

INFERENCE_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$INFERENCE_SPECS")

AutoML Support#

Cosmos-RL supports AutoML optimization for the following hyperparameters:

Learning Rate (optm_lr): Automatically optimized learning rate schedules
Training Epochs (epoch): Optimal number of training epochs
Optimizer Selection (optm_name): Choice between AdamW and Adam optimizers
Optimizer Betas (optm_betas): Beta coefficients for momentum
Warmup Epochs (optm_warmup_epochs): Learning rate warmup schedule
LoRA Configuration: Rank (r), alpha (lora_alpha), dropout (lora_dropout), target modules, and modules to save
Vision Processing: FPS sampling rate for video processing

To enable AutoML, configure the experiment with AutoML parameters:

automl_information = {
    "automl_enabled": True,
    "automl_algorithm": "bayesian",
    "automl_max_recommendations": 2,
    "automl_hyperparameters": automl_params
}

Performance Considerations#

Training performance:

Training time varies based on dataset size and hardware configuration.
Multi-GPU training significantly reduces training time.
AutoML experiments may require multiple training runs.
Use compile: true for PyTorch 2.0 optimization (increases initial compilation time).

Memory optimization:

Use FP8 precision (fp8.enable_fp8: true) for memory-efficient training on H100/H200 GPUs.
Enable gradient checkpointing (model_gradient_checkpointing: true) to trade compute for memory.
Adjust batch sizes (train_batch_per_replica) based on available GPU memory.
Use FSDP offloading (fsdp_offload: true) to reduce GPU memory usage.
Consider model sharding using parallelism configurations.

Storage requirements:

Video datasets require significant storage space.
Use compressed formats (tar.gz) for efficient storage.
Enable dataset caching (enable_dataset_cache: true) for faster data loading.
Consider cloud storage for large-scale datasets.

Troubleshooting#

Common issues:

Out of Memory: Reduce train_batch_per_replica, enable FP8 precision, or use gradient checkpointing.
Dataset Format Errors: Ensure annotations follow LLaVA format exactly.
Training Convergence: Adjust learning rates, use warmup epochs, or enable AutoML optimization.
Inference Errors: Verify model checkpoints and input formats.
Slow Data Loading: Increase dataloader_num_workers and dataloader_prefetch_factor.

Best practices:

Start with smaller datasets for initial experimentation.
Use AutoML for optimal hyperparameter selection.
Monitor training metrics regularly through logging backends.
Validate model performance on held-out test sets.
Use appropriate parallelism configurations for your hardware setup.
Enable LoRA for parameter-efficient fine-tuning.
Use mixed precision training (param_dtype: bfloat16) for better performance.

For additional support and troubleshooting, refer to the TAO Toolkit troubleshooting guide.