Cosmos-Reason#

Cosmos-Reason is a state-of-the-art video-language model included in TAO Toolkit. It supports the following tasks:

  • train

  • evaluate

  • inference

You can invoke these tasks from the FTMS client using the following convention:

tao-client cosmos-rl <sub_task> <args_per_subtask>

Where <args_per_subtask> are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.

Note

Cosmos-RL is currently available only through the TAO Toolkit API and tao-client interfaces. There is no launcher-based interface for VLM models.

Hardware Requirements#

Minimum requirements:

  • GPUs: 8x A100 GPUs with at least 80 GB GPU memory

  • Storage: Minimum 200 GB of free disk space (each Cosmos-RL checkpoint when written to disk is ~150 GB)

  • OS: Ubuntu 22.04+

  • Driver: NVIDIA Driver 570

  • CUDA: CUDA 12.8

Recommended configuration for optimal performance:

  • Multi-node training for large-scale datasets

  • High-bandwidth storage system for efficient video data access

  • Multiple CPU cores for parallel data preprocessing

Data Input for Cosmos-RL#

Cosmos-RL expects datasets in the LLaVA format with the following structure:

dataset_folder/
   images.tar.gz or videos.tar.gz    (Video frames or image sequences)
   annotations.json                  (Text annotations in JSON format)

Data format specifications:

  • Dataset Type: vlm (Vision-Language Model)

  • Format: llava

  • Supported Intents: training, evaluation, testing

Annotation format:

The annotations should follow the LLaVA conversation format:

{
  "id": "d460df3a29cc7d208d4d588c63e83579",
  "images": [
    "images/001354.png",
    "som_images/001354.d460df3a29cc7d208d4d588c63e83579.png"
  ],
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nThe first image is the original, and the second is an overlay. Bright numeric IDs are labeled at the center of certain visual objects in the second image.\nBased on pallet positions in Region [0] Region [1] Region [2] Region [3] Region [4] Region [5] Region [6] Region [7] Region [8] Region [9], which one should the transporter at Region [10] retrieve?\nPlease answer with only the integer number of the correct region the number should be one that is both shown in the image and mentioned in this question. Do not include any explanation or extra text."
    },
    {
      "from": "gpt",
      "value": "3"
    }
  ],
  "category": "mcq",
  "normalized_answer": "3"
}

Creating an Training Specification File#

SPECS=$(tao-client cosmos-rl get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

The training specification file for Cosmos-RL includes train, validation, policy, logging, and custom parameters. Here is an example specification file for training a Cosmos-RL model:

train:
  resume: false
  epoch: 10
  compile: false
  train_batch_per_replica: 1
  output_dir: "output"
  optm_lr: 1e-6
  optm_impl: "foreach"
  optm_weight_decay: 0.01
  optm_min_lr_factor: 0.0
  optm_grad_norm_clip: 1.0
  epsilon: 1e-8
  optm_name: "AdamW"
  optm_betas: [0.9, 0.999]
  optm_warmup_epochs: 0
  async_tp_enabled: false
  master_dtype: "float32"
  param_dtype: "bfloat16"
  fsdp_reduce_dtype: "float32"
  fsdp_offload: false
  fsdp_reshard_after_forward: "default"
  sync_weight_interval: 1
  ckpt:
    enable_checkpoint: true
    save_freq_in_epoch: 10
    save_mode: "sync"
    max_keep: 8
    export_safetensors: true
  train_policy:
    type: "sft"
    mini_batch: 4
    enable_dataset_cache: true
    dataloader_num_workers: 8
    dataloader_prefetch_factor: 8
    conversation_column_name: "conversations"
    dataset:
      name: "its"
      test_size: 1
  fp8:
    enable_fp8: false
    fp8_recipe: "dynamic_scaling"
    quant_recipe: "rowwise"

validation:
  enable: true
  freq_in_epoch: 10

policy:
  model_name_or_path: "nvidia/Cosmos-Reason1-7B"
  model_max_length: 4096
  model_gradient_checkpointing: true
  parallelism:
    n_init_replicas: 1
    tp_size: 1
    cp_size: 1
    dp_shard_size: 1
    dp_replicate_size: 1
    pp_size: 1
    cp_rotate_method: "allgather"
  lora:
    r: 8
    lora_alpha: 8
    lora_dropout: 0.0
    target_modules: ["q_proj", "v_proj"]
    use_rslora: false
    modules_to_save: []
    init_lora_weights: true

logging:
  logger: ["console", "tao"]
  project_name: "cosmos-rl"
  experiment_name: "cosmos-rl"

custom:
  dataset:
    annotation_path: "data/sft/annotations.json"
    media_path: "data/sft/train2017"
    system_prompt: ""
  vision:
    fps: 1
    total_pixels: 313600

redis: "12800"
results_dir: "/results"

Training Configuration Parameters#

The following sections detail all available configuration parameters for Cosmos-RL training, organized by configuration group.

ExperimentConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

train

collection

Train config

FALSE

validation

collection

Validation config

FALSE

policy

collection

Policy config

FALSE

logging

collection

Logging config

FALSE

redis

string

Redis port for distributed training coordination and interprocess communication in multinode setups

12800

results_dir

string

Root folder for all training outputs including checkpoints, logs, and evaluation results

/results

custom

collection

Custom config

FALSE

TrainConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

resume

bool

Resume training from the latest checkpoint in output_dir. Restore model weights, optimizer state, and training progress.

FALSE

epoch

int

Total number of training epochs (complete passes through the dataset).

10

10

20

TRUE

compile

bool

Use PyTorch 2.0 torch.compile() for optimized execution. Improves throughput but increases initial compilation time.

FALSE

train_batch_per_replica

int

Batch size per GPU replica. Global batch size = train_batch_per_replica × num_gpus ÷ dp_replicate_size.

1

1

inf

output_dir

string

Folder for saving checkpoints, logs, and training artifacts.

output

optm_lr

float

Peak learning rate for optimizer. Actual LR follows warmup and cosine decay schedule.

1e-06

0

inf

TRUE

optm_impl

categorical

Optimizer implementation: fused (fastest, CUDA only), foreach (vectorized), for-loop (compatible). See PyTorch documentation for details.

foreach

fused,foreach,for-loop

optm_weight_decay

float

L2 regularization coefficient (weight decay) to prevent overfitting. Applied to all parameters except biases and norms.

0.01

0

inf

optm_min_lr_factor

float

Minimum learning rate as a fraction of peak LR. For cosine annealing: min_lr = optm_lr × optm_min_lr_factor.

0.0

0

inf

optm_grad_norm_clip

float

Maximum gradient norm for clipping. Prevents exploding gradients. Set to 0 or negative to disable clipping.

1.0

0

inf

epsilon

float

Small constant added to denominator for numerical stability in Adam/AdamW optimizer.

1e-08

0

inf

optm_name

categorical

Optimizer algorithm: ‘AdamW’ (Adam with decoupled weight decay, recommended) or ‘Adam’ (original).

AdamW

AdamW,Adam

TRUE

optm_betas

list_2

Beta coefficients for Adam/AdamW: [beta1, beta2] for exponential moving averages of gradient and squared gradient.

[0.9, 0.999]

TRUE

optm_warmup_epochs

union

Number of epochs for linear learning rate warmup, from 0 to optm_lr. Helps stabilize training (recommended: epochs/10).

0

0

inf

TRUE

async_tp_enabled

bool

Enable asynchronous Tensor Parallel communication to overlap computation and communication for better throughput.

FALSE

master_dtype

categorical

Data type for master weights in optimizer states. Higher precision prevents accumulated rounding errors.

float32

float32,float16,bfloat16

param_dtype

categorical

Data type for model parameters and activations during training. bfloat16 is recommended for stability and memory efficiency.

bfloat16

float32,float16,bfloat16

fsdp_reduce_dtype

categorical

Data type for gradient all-reduce in Fully Sharded Data Parallel. float32 provides better numerical stability.

float32

float32,float16,bfloat16

fsdp_offload

bool

Offload FSDP parameters to CPU memory when not in use. Reduces GPU memory but increases overhead.

FALSE

fsdp_reshard_after_forward

categorical

Reshard parameters after forward pass. true saves memory, false improves speed, default uses PyTorch heuristic.

default

default,true,false

sync_weight_interval

int

Interval in steps for synchronizing weights across data parallel replicas. Higher values reduce communication overhead.

1

1

inf

ckpt

collection

Train checkpoint config.

FALSE

train_policy

collection

Train policy config.

FALSE

fp8

collection

Train FP8 config.

FALSE

ValidationConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

enable

bool

Run validation during training to monitor model performance on held-out data.

TRUE

freq_in_epoch

int

Run validation every N epochs. Takes priority over freq (step-based) if set to a positive value.

10

1

inf

PolicyConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

model_name_or_path

string

HuggingFace model identifier (e.g., nvidia/Cosmos-Reason1-7B) or local path to pretrained model folder.

nvidia/Cosmos-Reason1-7B

model_max_length

int

Maximum sequence length in tokens. Sequences longer than this are truncated. Limited by model’s positional encoding.

4096

1

inf

model_gradient_checkpointing

bool

Trade compute for memory; recompute activations during backward pass instead of storing. Reduces memory requirements ~40% but increases training time ~20%.

TRUE

parallelism

collection

Policy parallelism config.

FALSE

lora

collection

LoRA config.

FALSE

LoggingConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

logger

list

Logging backends to enable. Options: console for stdout, tao for TAO Toolkit status logging, wandb for Weights & Biases.

[‘console’, ‘tao’]

[‘console’, ‘tao’]

FALSE

project_name

string

Project name used to organize experiments in logging backends like Weights & Biases.

cosmos-rl

experiment_name

string

Unique name for this training run, used for tracking and organizing results across logging backends.

cosmos-rl

CustomConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

dataset

collection

Dataset config.

FALSE

vision

collection

Vision config.

FALSE

TrainCheckpointConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

enable_checkpoint

bool

Indicates whether to save model checkpoints during training for resuming or model deployment.

TRUE

save_freq_in_epoch

int

Save checkpoint every N epochs. Takes priority over the step-based frequency if set to a positive value

10

1

inf

save_mode

categorical

Checkpoint saving mode: sync blocks training until save completes, async saves in background.

sync

async,sync

max_keep

int

Maximum number of checkpoints to keep. Older checkpoints are automatically deleted. Set to -1 to keep all checkpoints.

8

-1

inf

export_safetensors

bool

Export checkpoints in HuggingFace SafeTensors format for easy model sharing and deployment.

TRUE

TrainPolicyConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

categorical

Training policy type: sft for Supervised Fine-Tuning or grpo for Group Relative Policy Optimization.

sft

sft

mini_batch

int

Mini-batch size for gradient accumulation. Global batch is split into mini-batches to reduce memory usage.

4

1

inf

enable_dataset_cache

bool

Cache preprocessed dataset samples to disk for faster data loading across training runs.

TRUE

dataloader_num_workers

int

Number of parallel worker processes for data loading and preprocessing. Set to 0 for single-threaded loading.

8

0

inf

dataloader_prefetch_factor

int

Number of batches loaded in advance per worker. Higher values improve throughput but increase memory usage.

8

1

inf

conversation_column_name

string

Name of the dataset column containing conversation data (list of messages with roles and content).

conversations

dataset

collection

Dataset config.

FALSE

TrainFP8Config Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

enable_fp8

bool

Enable FP8 (8-bit floating point) training for 2x memory reduction and faster training on supported GPUs (H100, H200).

FALSE

fp8_recipe

categorical

FP8 scaling strategy: dynamic_scaling adjusts scale factors per-iteration, delayed_scaling updates periodically for stability.

dynamic_scaling

dynamic_scaling,delayed_scaling

quant_recipe

categorical

Quantization granularity: rowwise computes scale per row (for better accuracy), tensorwise uses single scale per tensor (faster).

rowwise

rowwise,tensorwise

PolicyParallelismConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

n_init_replicas

int

Number of model replicas to initialize. Used for advanced multi-model training setups.

1

1

inf

tp_size

int

Tensor Parallel size: Splits each layer across N GPUs. Use for models too large for a single GPU. Must be a factor of the total number of GPUs.

1

1

inf

cp_size

int

Context Parallel size: Splits long sequences across N GPUs. Enables training with sequences longer than single GPU memory.

1

1

inf

dp_shard_size

int

Data Parallel Shard size (FSDP): Shards model parameters across N GPUs. Reduces per-GPU memory. Must multiply with other dimensions to equal the total number of GPUs.

1

1

inf

dp_replicate_size

int

Data Parallel Replicate size: Replicates full model across N GPU groups. Increases throughput by processing different batches in parallel.

1

1

inf

pp_size

int

Pipeline Parallel size: Splits model layers across N GPUs. Enables training very deep models. Uses 1F1B schedule for efficiency.

1

1

inf

cp_rotate_method

categorical

Context Parallel communication pattern: allgather (higher bandwidth) or p2p (point-to-point, lower memory).

allgather

allgather,p2p

LoraConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

r

int

LoRA rank: Dimensionality of low-rank adaptation matrices. Higher values increase model capacity but require more memory (must be a power of 2).

8

1

256

TRUE

lora_alpha

int

LoRA scaling factor: Controls the magnitude of LoRA updates. Typically set equal to rank r (must be a power of 2).

8

1

1024

TRUE

lora_dropout

float

Dropout probability applied to LoRA layers for regularization. Set to 0.0 to disable dropout.

0.0

0.0

0.1

TRUE

target_modules

subset_list

Transformer layers to apply LoRA adaptation: q/k/v/o_proj (attention), up/gate/down_proj (MLP). Use all-linear for all linear layers. Cannot include attn.qkv or attn.proj if modules_to_save contains visual.

[q_proj, v_proj]

[q_proj, k_proj, v_proj, o_proj, up_proj, gate_proj, down_proj, attn.qkv, attn.proj, all-linear]

TRUE

use_rslora

bool

Use Rank-Stabilized LoRA with improved scaling (lora_alpha/sqrt(r) instead of lora_alpha/r). Provides better training stability and performance for higher ranks.

FALSE

modules_to_save

optional_list

Additional non-LoRA modules to fine-tune fully. Set to ['visual'] to train vision encoder for VLMs, or leave empty to freeze all non-LoRA parameters.

[]

visual

TRUE

init_lora_weights

union

Specifies how to initialize the weights of the adapter layers.

Pass TRUE (the default) to use default initialization from the reference implementation from Microsoft, with the LoRA B weight being set to 0. This means that without further training, the LoRA adapter will be a no-op.

Pass FALSE to use random initialization of LoRA A and B, meaning that LoRA is not a no-op before training; this setting is intended for debugging purposes.

Pass gaussian to use Gaussian initialization scaled by the LoRA rank for linear and layers.

Pass loftq to use LoftQ initialization.

Pass eva to use a data-driven initialization of Explained Variance Adaptation. EVA initializes LoRA based on the SVD of layer input activations and achieves SOTA performance due to its ability to adapt to the finetuning data.

Pass olora to use OLoRA initialization.

Pass pissa to use PiSSA initialization (see https://huggingface.co/papers/2404.02948).

Pass pissa_niter_N (where N is an integer) to use PiSSA initialization with N iterations (e.g., pissa_niter_16 for 16 iterations). More iterations may improve initialization quality.

TRUE

TRUE, FALSE, gaussian, loftq, eva, olora, pissa, pissa_niter_[number of iters]

DatasetConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

annotation_path

string

Path to the JSON file containing training annotations with conversations and media references.

data/sft/annotations.json

media_path

string

Folder containing image and video media files referenced in the annotation file.

data/sft/train2017

system_prompt

string

System instruction that provides context for the model’s behavior and role in conversations.

VisionConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

fps

int

Video sampling rate in frames per second for vision-language models. Higher FPS captures more temporal information but increases memory usage.

1

1

3

TRUE

total_pixels

int

Target resolution for vision inputs (width × height). Images and videos are resized to this total pixel count while maintaining aspect ratio.

313600

1

inf

TrainPolicyDatasetConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

name

string

HuggingFace dataset name or local path to dataset for training (e.g., its, gsm8k)

its

test_size

union

Size of the test set. If float, represents the ratio between 0.0 and 1.0 of the dataset; if int, represents the absolute number of samples.

None

0.0

inf

Training the Model#

Use the following command to run Cosmos-RL training:

TRAIN_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Required arguments:

The only required argument is the experiment ID and specs:

  • --id: The experiment ID to run training on

  • --specs: The training specifications

Optional arguments:

You can set optional arguments to override the option values in the experiment specification file.

  • --action: The action to perform (train)

  • --parent_job_id: Parent job ID for chaining jobs

Multi-Node Training with FTMS

Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.

Verify that your cluster has multiple GPU enabled nodes available for training by running this command:

kubectl get nodes -o wide

The command lists the nodes in your cluster. If it does not list multiple nodes, contact your cluster administrator to get more nodes added to your cluster.

To run a multi-node training job through FTMS, modify these fields in the training job specification:

{
    "train": {
        "num_gpus": 8, // Number of GPUs per node
        "num_nodes": 2 // Number of nodes to use for training
    }
}

If these fields are not specified, FTMS uses the default values of one GPU per node and one node.

Note

The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster. The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.

The latest checkpoint is saved automatically based on the checkpoint configuration. Training automatically resumes from the latest checkpoint if train.resume is set to TRUE.

Evaluating the Model#

Creating an Evaluation Specification File#

EVAL_SPECS=$(tao-client cosmos-rl get-spec --action evaluate --job_type experiment --id $EXPERIMENT_ID)

The evaluation experiment specification file for Cosmos-RL includes evaluate parameters for comprehensive model assessment. Here is an example specification file for evaluating a Cosmos-RL model:

evaluate:
  dataset:
    annotation_path: "path/to/eval_annotations.json"
    media_dir: "path/to/eval_media/"
    system_prompt: "You are a helpful assistant that can answer questions about a street-view CCTV footage. The vehicles that need attention are marked with bounding boxes and IDs."
  model:
    model_name: "nvidia/Cosmos-Reason1-7B"
    save_folder: "cr1_1_zero_shot"
    tokenizer_model_name: "qwen2.5-vl-7b"
    dtype: "bfloat16"
    tp_size: 1
    max_length: 128000
    enable_lora: false
    base_model_path: ""
  evaluation:
    answer_type: "freeform"
    num_processes: 40
    skip_saved: false
    seed: 1
    limit: -1
    total_shard: 1
    shard_id: 0
  vision:
    fps: 4
    total_pixels: 3136000
  generation:
    max_retries: 10
    max_tokens: 1024
    temperature: 0
    repetition_penalty: 1
    presence_penalty: 0
    frequency_penalty: 0
  results:
    save_individual_results: true
    save_confusion_matrix: true
    save_metrics_summary: true

results_dir: "/results"

Evaluation Configuration Parameters#

The following sections detail all available configuration parameters for Cosmos-RL evaluation.

ExperimentConfig Fields (Evaluation)#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

results_dir

string

Root folder for saving all evaluation outputs, including predictions, metrics, and visualizations.

/results

evaluate

collection

Evaluation configuration.

FALSE

EvaluateConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

dataset

collection

Dataset configuration for evaluation

FALSE

model

collection

Model configuration

FALSE

evaluation

collection

Evaluation parameters

FALSE

vision

collection

Vision processing configuration

FALSE

generation

collection

Generation parameters

FALSE

results

collection

Results and output configuration

FALSE

DatasetConfig Fields (Evaluation)#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

annotation_path

string

Path to JSON file with evaluation samples containing questions, ground truth answers, and media references.

media_dir

string

Optional folder containing image and video files. Leave empty if media paths in annotations are absolute or relative to current folder.

system_prompt

string

System instruction prepended to all evaluation prompts to provide context about the task and expected behavior.

You are a helpful assistant that can answer questions about a street-view CCTV footage. The vehicles that need attention are marked with bounding boxes and IDs.

ModelConfig Fields (Evaluation)#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

model_name

string

HuggingFace model ID, local model path, or path to safetensors checkpoint folder for evaluation.

nvidia/Cosmos-Reason1-7B

save_folder

string

Subfolder name within results_dir to save evaluation outputs, predictions, and metrics.

cr1_1_zero_shot

tokenizer_model_name

string

Tokenizer to use for text processing. Options: qwen2.5-vl-7b, qwen2-vl-2b, qwen2.5-vl-32b, qwen2.5-vl-72b.

qwen2.5-vl-7b

dtype

string

Precision for model weights during inference: bfloat16 (recommended, for speed) or float16 (for wider compatibility).

bfloat16

tp_size

int

Number of GPUs for Tensor Parallelism. Splits each layer across GPUs for larger models. Set to 1 for single-GPU inference.

1

1

8

max_length

int

Maximum total sequence length (prompt + response) in tokens. Must not exceed model’s context window.

128000

1024

1000000

enable_lora

bool

Specifies whether to merge LoRA adapter weights into base model before evaluation. Required when evaluating LoRA fine-tuned models.

False

base_model_path

string

Path to base pretrained model. Required when enable_lora is TRUE to merge LoRA weights from model_name.

EvaluationConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

answer_type

string

Expected answer format: letter (A/B/C/D choices), reasoning (chain-of-thought), or freeform (open-ended text).

freeform

num_processes

int

Number of parallel worker processes for concurrent evaluation. Higher values speed up evaluation but increase memory usage.

40

1

128

skip_saved

bool

Skip re-evaluating samples that already have saved results. Useful for resuming interrupted evaluations.

FALSE

seed

int

Random seed for deterministic sampling and generation. Use same seed for reproducible results.

1

0

999999

limit

int

Maximum number of samples to evaluate. Set to -1 for full dataset or a positive integer for quick testing or debugging.

-1

-1

999999

total_shard

int

Split evaluation across N shards for distributed processing across multiple machines or jobs.

1

1

64

shard_id

int

Current shard identifier (0-indexed). Each shard processes a disjoint subset of the evaluation data.

0

0

63

VisionConfig Fields (Evaluation)#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

fps

int

Downsample video to this frame rate for vision processing. Higher FPS provides more temporal detail but increases compute time.

4

1

30

total_pixels

int

Target resolution for vision inputs (width × height). Images and videos are resized to this pixel count while preserving the aspect ratio.

3136000

100000

10000000

GenerationConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

max_retries

int

Maximum retry attempts for failed generations due to errors or timeouts. Useful for handling transient failures.

10

0

50

max_tokens

int

Maximum number of new tokens to generate per response. Longer limits allow detailed answers but increase latency.

1024

1

8192

temperature

float

Sampling temperature: 0.0 for deterministic greedy decoding, higher values (0.7-1.0) for more creative, diverse outputs.

0

0

2

repetition_penalty

float

Penalty for repeating tokens. Values > 1.0 discourage repetition, 1.0 = no penalty, < 1.0 encourages repetition.

1

0.1

2

presence_penalty

float

Penalty for tokens that already appear in the sequence. Positive values promote diversity; negative values allow repetition.

0

-2

2

frequency_penalty

float

Penalty proportional to token frequency in sequence. Positive values reduce repetitive patterns; negative values allow them.

0

-2

2

ResultsConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

save_individual_results

bool

Indicates whether to save an individual JSON file for each sample with question, prediction, ground truth, and metadata for detailed analysis.

TRUE

save_confusion_matrix

bool

Indicates whether to generate and save a confusion matrix visualization showing prediction vs ground truth distribution (for classification tasks).

TRUE

save_metrics_summary

bool

Indicates whether to save an aggregated metrics summary JSON with accuracy, F1, precision, recall, and other evaluation statistics.

TRUE

Running Evaluation#

To run evaluation with a Cosmos-RL model, use this command:

EVAL_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$EVAL_SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Required arguments:

  • --id: Experiment ID to run evaluation on

  • --parent_job_id: Training job ID to use the trained model from

  • --specs: Evaluation specifications

Optional arguments:

  • --action: Action to perform (evaluate)

Running Inference with a Cosmos-RL Model#

Creating an Inference Specification File#

INFERENCE_SPECS=$(tao-client cosmos-rl get-spec --action inference --job_type experiment --id $EXPERIMENT_ID)

The inference experiment specification file for Cosmos-RL includes inference parameters for generating responses to visual content. Here is an example specification file for running inference with a Cosmos-RL model:

inference:
  media: "path/to/video.mp4"
  prompt: "Describe this video."
  fps: 4
  total_pixels: 6422528
  max_new_tokens: 4096

Inference Configuration Parameters#

The following sections detail all available configuration parameters for Cosmos-RL inference.

ExperimentConfig Fields (Inference)#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

inference

collection

Inference config

FALSE

InferenceConfig Fields#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

media

string

Path to input image or video file for inference. Supports common formats (JPG, PNG, MP4, AVI, etc.).

prompt

string

Text prompt or question to ask about the media. The model responds based on visual content and this instruction.

Describe this video.

fps

int

Video frame sampling rate. Higher FPS provides more temporal information but increases memory and latency.

4

total_pixels

int

Target resolution for vision input (width × height). The image or video is resized to this pixel count while maintaining the aspect ratio.

6422528

max_new_tokens

int

Maximum number of tokens to generate in the response. Higher values allow longer, more detailed answers.

4096

Running Inference#

The inference tool for Cosmos-RL models can be used to generate text responses based on video content.

INFERENCE_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$INFERENCE_SPECS")

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

Required arguments:

  • --id: Experiment ID to run inference on

  • --parent_job_id: Training job ID to use the trained model from

  • --specs: Inference specifications

Optional arguments:

  • --action: Action to perform (inference)

AutoML Support#

Cosmos-RL supports AutoML optimization for the following hyperparameters:

  • Learning Rate (optm_lr): Automatically optimized learning rate schedules

  • Training Epochs (epoch): Optimal number of training epochs

  • Optimizer Selection (optm_name): Choice between AdamW and Adam optimizers

  • Optimizer Betas (optm_betas): Beta coefficients for momentum

  • Warmup Epochs (optm_warmup_epochs): Learning rate warmup schedule

  • LoRA Configuration: Rank (r), alpha (lora_alpha), dropout (lora_dropout), target modules, and modules to save

  • Vision Processing: FPS sampling rate for video processing

To enable AutoML, configure the experiment with AutoML parameters:

automl_information = {
    "automl_enabled": True,
    "automl_algorithm": "bayesian",
    "automl_max_recommendations": 2,
    "automl_hyperparameters": automl_params
}

Performance Considerations#

Training performance:

  • Training time varies based on dataset size and hardware configuration.

  • Multi-GPU training significantly reduces training time.

  • AutoML experiments may require multiple training runs.

  • Use compile: true for PyTorch 2.0 optimization (increases initial compilation time).

Memory optimization:

  • Use FP8 precision (fp8.enable_fp8: true) for memory-efficient training on H100/H200 GPUs.

  • Enable gradient checkpointing (model_gradient_checkpointing: true) to trade compute for memory.

  • Adjust batch sizes (train_batch_per_replica) based on available GPU memory.

  • Use FSDP offloading (fsdp_offload: true) to reduce GPU memory usage.

  • Consider model sharding using parallelism configurations.

Storage requirements:

  • Video datasets require significant storage space.

  • Use compressed formats (tar.gz) for efficient storage.

  • Enable dataset caching (enable_dataset_cache: true) for faster data loading.

  • Consider cloud storage for large-scale datasets.

Troubleshooting#

Common issues:

  • Out of Memory: Reduce train_batch_per_replica, enable FP8 precision, or use gradient checkpointing.

  • Dataset Format Errors: Ensure annotations follow LLaVA format exactly.

  • Training Convergence: Adjust learning rates, use warmup epochs, or enable AutoML optimization.

  • Inference Errors: Verify model checkpoints and input formats.

  • Slow Data Loading: Increase dataloader_num_workers and dataloader_prefetch_factor.

Best practices:

  • Start with smaller datasets for initial experimentation.

  • Use AutoML for optimal hyperparameter selection.

  • Monitor training metrics regularly through logging backends.

  • Validate model performance on held-out test sets.

  • Use appropriate parallelism configurations for your hardware setup.

  • Enable LoRA for parameter-efficient fine-tuning.

  • Use mixed precision training (param_dtype: bfloat16) for better performance.

For additional support and troubleshooting, refer to the TAO Toolkit troubleshooting guide.