Cosmos-Reason#
Cosmos-Reason is a state-of-the-art video-language model included in TAO Toolkit. It supports the following tasks:
train
evaluate
inference
You can invoke these tasks from the FTMS client using the following convention:
tao-client cosmos-rl <sub_task> <args_per_subtask>
Where <args_per_subtask> are the command-line arguments required for a given subtask. Each
subtask is explained in detail in the following sections.
Note
Cosmos-RL is currently available only through the TAO Toolkit API and tao-client interfaces.
There is no launcher-based interface for VLM models.
Hardware Requirements#
Minimum requirements:
GPUs: 8x A100 GPUs with at least 80 GB GPU memory
Storage: Minimum 200 GB of free disk space (each Cosmos-RL checkpoint when written to disk is ~150 GB)
OS: Ubuntu 22.04+
Driver: NVIDIA Driver 570
CUDA: CUDA 12.8
Recommended configuration for optimal performance:
Multi-node training for large-scale datasets
High-bandwidth storage system for efficient video data access
Multiple CPU cores for parallel data preprocessing
Data Input for Cosmos-RL#
Cosmos-RL expects datasets in the LLaVA format with the following structure:
dataset_folder/
images.tar.gz or videos.tar.gz (Video frames or image sequences)
annotations.json (Text annotations in JSON format)
Data format specifications:
Dataset Type:
vlm(Vision-Language Model)Format:
llavaSupported Intents:
training,evaluation,testing
Annotation format:
The annotations should follow the LLaVA conversation format:
{
"id": "d460df3a29cc7d208d4d588c63e83579",
"images": [
"images/001354.png",
"som_images/001354.d460df3a29cc7d208d4d588c63e83579.png"
],
"conversations": [
{
"from": "human",
"value": "<image>\nThe first image is the original, and the second is an overlay. Bright numeric IDs are labeled at the center of certain visual objects in the second image.\nBased on pallet positions in Region [0] Region [1] Region [2] Region [3] Region [4] Region [5] Region [6] Region [7] Region [8] Region [9], which one should the transporter at Region [10] retrieve?\nPlease answer with only the integer number of the correct region the number should be one that is both shown in the image and mentioned in this question. Do not include any explanation or extra text."
},
{
"from": "gpt",
"value": "3"
}
],
"category": "mcq",
"normalized_answer": "3"
}
Creating an Training Specification File#
SPECS=$(tao-client cosmos-rl get-spec --action train --job_type experiment --id $EXPERIMENT_ID)
The training specification file for Cosmos-RL includes train, validation, policy, logging, and custom parameters.
Here is an example specification file for training a Cosmos-RL model:
train:
resume: false
epoch: 10
compile: false
train_batch_per_replica: 1
output_dir: "output"
optm_lr: 1e-6
optm_impl: "foreach"
optm_weight_decay: 0.01
optm_min_lr_factor: 0.0
optm_grad_norm_clip: 1.0
epsilon: 1e-8
optm_name: "AdamW"
optm_betas: [0.9, 0.999]
optm_warmup_epochs: 0
async_tp_enabled: false
master_dtype: "float32"
param_dtype: "bfloat16"
fsdp_reduce_dtype: "float32"
fsdp_offload: false
fsdp_reshard_after_forward: "default"
sync_weight_interval: 1
ckpt:
enable_checkpoint: true
save_freq_in_epoch: 10
save_mode: "sync"
max_keep: 8
export_safetensors: true
train_policy:
type: "sft"
mini_batch: 4
enable_dataset_cache: true
dataloader_num_workers: 8
dataloader_prefetch_factor: 8
conversation_column_name: "conversations"
dataset:
name: "its"
test_size: 1
fp8:
enable_fp8: false
fp8_recipe: "dynamic_scaling"
quant_recipe: "rowwise"
validation:
enable: true
freq_in_epoch: 10
policy:
model_name_or_path: "nvidia/Cosmos-Reason1-7B"
model_max_length: 4096
model_gradient_checkpointing: true
parallelism:
n_init_replicas: 1
tp_size: 1
cp_size: 1
dp_shard_size: 1
dp_replicate_size: 1
pp_size: 1
cp_rotate_method: "allgather"
lora:
r: 8
lora_alpha: 8
lora_dropout: 0.0
target_modules: ["q_proj", "v_proj"]
use_rslora: false
modules_to_save: []
init_lora_weights: true
logging:
logger: ["console", "tao"]
project_name: "cosmos-rl"
experiment_name: "cosmos-rl"
custom:
dataset:
annotation_path: "data/sft/annotations.json"
media_path: "data/sft/train2017"
system_prompt: ""
vision:
fps: 1
total_pixels: 313600
redis: "12800"
results_dir: "/results"
Training Configuration Parameters#
The following sections detail all available configuration parameters for Cosmos-RL training, organized by configuration group.
ExperimentConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
collection |
Train config |
FALSE |
||||
|
collection |
Validation config |
FALSE |
||||
|
collection |
Policy config |
FALSE |
||||
|
collection |
Logging config |
FALSE |
||||
|
string |
Redis port for distributed training coordination and interprocess communication in multinode setups |
12800 |
||||
|
string |
Root folder for all training outputs including checkpoints, logs, and evaluation results |
/results |
||||
|
collection |
Custom config |
FALSE |
TrainConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
Resume training from the latest checkpoint in |
FALSE |
||||
|
int |
Total number of training epochs (complete passes through the dataset). |
10 |
10 |
20 |
TRUE |
|
|
bool |
Use PyTorch 2.0 |
FALSE |
||||
|
int |
Batch size per GPU replica. Global batch size = |
1 |
1 |
inf |
||
|
string |
Folder for saving checkpoints, logs, and training artifacts. |
output |
||||
|
float |
Peak learning rate for optimizer. Actual LR follows warmup and cosine decay schedule. |
1e-06 |
0 |
inf |
TRUE |
|
|
categorical |
Optimizer implementation: |
foreach |
fused,foreach,for-loop |
|||
|
float |
L2 regularization coefficient (weight decay) to prevent overfitting. Applied to all parameters except biases and norms. |
0.01 |
0 |
inf |
||
|
float |
Minimum learning rate as a fraction of peak LR. For cosine annealing: |
0.0 |
0 |
inf |
||
|
float |
Maximum gradient norm for clipping. Prevents exploding gradients. Set to 0 or negative to disable clipping. |
1.0 |
0 |
inf |
||
|
float |
Small constant added to denominator for numerical stability in Adam/AdamW optimizer. |
1e-08 |
0 |
inf |
||
|
categorical |
Optimizer algorithm: ‘AdamW’ (Adam with decoupled weight decay, recommended) or ‘Adam’ (original). |
AdamW |
AdamW,Adam |
TRUE |
||
|
list_2 |
Beta coefficients for Adam/AdamW: [beta1, beta2] for exponential moving averages of gradient and squared gradient. |
[0.9, 0.999] |
TRUE |
|||
|
union |
Number of epochs for linear learning rate warmup, from 0 to |
0 |
0 |
inf |
TRUE |
|
|
bool |
Enable asynchronous Tensor Parallel communication to overlap computation and communication for better throughput. |
FALSE |
||||
|
categorical |
Data type for master weights in optimizer states. Higher precision prevents accumulated rounding errors. |
float32 |
float32,float16,bfloat16 |
|||
|
categorical |
Data type for model parameters and activations during training. |
bfloat16 |
float32,float16,bfloat16 |
|||
|
categorical |
Data type for gradient all-reduce in Fully Sharded Data Parallel. |
float32 |
float32,float16,bfloat16 |
|||
|
bool |
Offload FSDP parameters to CPU memory when not in use. Reduces GPU memory but increases overhead. |
FALSE |
||||
|
categorical |
Reshard parameters after forward pass. |
default |
default,true,false |
|||
|
int |
Interval in steps for synchronizing weights across data parallel replicas. Higher values reduce communication overhead. |
1 |
1 |
inf |
||
|
collection |
Train checkpoint config. |
FALSE |
||||
|
collection |
Train policy config. |
FALSE |
||||
|
collection |
Train FP8 config. |
FALSE |
ValidationConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
Run validation during training to monitor model performance on held-out data. |
TRUE |
||||
|
int |
Run validation every N epochs. Takes priority over freq (step-based) if set to a positive value. |
10 |
1 |
inf |
PolicyConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
HuggingFace model identifier (e.g., |
nvidia/Cosmos-Reason1-7B |
||||
|
int |
Maximum sequence length in tokens. Sequences longer than this are truncated. Limited by model’s positional encoding. |
4096 |
1 |
inf |
||
|
bool |
Trade compute for memory; recompute activations during backward pass instead of storing. Reduces memory requirements ~40% but increases training time ~20%. |
TRUE |
||||
|
collection |
Policy parallelism config. |
FALSE |
||||
|
collection |
LoRA config. |
FALSE |
LoggingConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
Logging backends to enable. Options: |
[‘console’, ‘tao’] |
[‘console’, ‘tao’] |
FALSE |
||
|
string |
Project name used to organize experiments in logging backends like Weights & Biases. |
cosmos-rl |
||||
|
string |
Unique name for this training run, used for tracking and organizing results across logging backends. |
cosmos-rl |
CustomConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
collection |
Dataset config. |
FALSE |
||||
|
collection |
Vision config. |
FALSE |
TrainCheckpointConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
Indicates whether to save model checkpoints during training for resuming or model deployment. |
TRUE |
||||
|
int |
Save checkpoint every N epochs. Takes priority over the step-based frequency if set to a positive value |
10 |
1 |
inf |
||
|
categorical |
Checkpoint saving mode: |
sync |
async,sync |
|||
|
int |
Maximum number of checkpoints to keep. Older checkpoints are automatically deleted. Set to -1 to keep all checkpoints. |
8 |
-1 |
inf |
||
|
bool |
Export checkpoints in HuggingFace SafeTensors format for easy model sharing and deployment. |
TRUE |
TrainPolicyConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Training policy type: |
sft |
sft |
|||
|
int |
Mini-batch size for gradient accumulation. Global batch is split into mini-batches to reduce memory usage. |
4 |
1 |
inf |
||
|
bool |
Cache preprocessed dataset samples to disk for faster data loading across training runs. |
TRUE |
||||
|
int |
Number of parallel worker processes for data loading and preprocessing. Set to 0 for single-threaded loading. |
8 |
0 |
inf |
||
|
int |
Number of batches loaded in advance per worker. Higher values improve throughput but increase memory usage. |
8 |
1 |
inf |
||
|
string |
Name of the dataset column containing conversation data (list of messages with roles and content). |
conversations |
||||
|
collection |
Dataset config. |
FALSE |
TrainFP8Config Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
Enable FP8 (8-bit floating point) training for 2x memory reduction and faster training on supported GPUs (H100, H200). |
FALSE |
||||
|
categorical |
FP8 scaling strategy: |
dynamic_scaling |
dynamic_scaling,delayed_scaling |
|||
|
categorical |
Quantization granularity: |
rowwise |
rowwise,tensorwise |
PolicyParallelismConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of model replicas to initialize. Used for advanced multi-model training setups. |
1 |
1 |
inf |
||
|
int |
Tensor Parallel size: Splits each layer across N GPUs. Use for models too large for a single GPU. Must be a factor of the total number of GPUs. |
1 |
1 |
inf |
||
|
int |
Context Parallel size: Splits long sequences across N GPUs. Enables training with sequences longer than single GPU memory. |
1 |
1 |
inf |
||
|
int |
Data Parallel Shard size (FSDP): Shards model parameters across N GPUs. Reduces per-GPU memory. Must multiply with other dimensions to equal the total number of GPUs. |
1 |
1 |
inf |
||
|
int |
Data Parallel Replicate size: Replicates full model across N GPU groups. Increases throughput by processing different batches in parallel. |
1 |
1 |
inf |
||
|
int |
Pipeline Parallel size: Splits model layers across N GPUs. Enables training very deep models. Uses 1F1B schedule for efficiency. |
1 |
1 |
inf |
||
|
categorical |
Context Parallel communication pattern: |
allgather |
allgather,p2p |
LoraConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
LoRA rank: Dimensionality of low-rank adaptation matrices. Higher values increase model capacity but require more memory (must be a power of 2). |
8 |
1 |
256 |
TRUE |
|
|
int |
LoRA scaling factor: Controls the magnitude of LoRA updates. Typically set equal to rank r (must be a power of 2). |
8 |
1 |
1024 |
TRUE |
|
|
float |
Dropout probability applied to LoRA layers for regularization. Set to 0.0 to disable dropout. |
0.0 |
0.0 |
0.1 |
TRUE |
|
|
subset_list |
Transformer layers to apply LoRA adaptation: q/k/v/o_proj (attention), up/gate/down_proj (MLP).
Use |
[ |
[ |
TRUE |
||
|
bool |
Use Rank-Stabilized LoRA with improved scaling (lora_alpha/sqrt(r) instead of lora_alpha/r). Provides better training stability and performance for higher ranks. |
FALSE |
||||
|
optional_list |
Additional non-LoRA modules to fine-tune fully. Set to |
[] |
visual |
TRUE |
||
|
union |
Specifies how to initialize the weights of the adapter layers. Pass TRUE (the default) to use default initialization from the reference implementation from Microsoft, with the LoRA B weight being set to 0. This means that without further training, the LoRA adapter will be a no-op. Pass FALSE to use random initialization of LoRA A and B, meaning that LoRA is not a no-op before training; this setting is intended for debugging purposes. Pass Pass Pass Pass Pass Pass |
TRUE |
TRUE, FALSE, |
DatasetConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to the JSON file containing training annotations with conversations and media references. |
data/sft/annotations.json |
||||
|
string |
Folder containing image and video media files referenced in the annotation file. |
data/sft/train2017 |
||||
|
string |
System instruction that provides context for the model’s behavior and role in conversations. |
VisionConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Video sampling rate in frames per second for vision-language models. Higher FPS captures more temporal information but increases memory usage. |
1 |
1 |
3 |
TRUE |
|
|
int |
Target resolution for vision inputs (width × height). Images and videos are resized to this total pixel count while maintaining aspect ratio. |
313600 |
1 |
inf |
TrainPolicyDatasetConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
HuggingFace dataset name or local path to dataset for training (e.g., |
its |
||||
|
union |
Size of the test set. If float, represents the ratio between 0.0 and 1.0 of the dataset; if int, represents the absolute number of samples. |
None |
0.0 |
inf |
Training the Model#
Use the following command to run Cosmos-RL training:
TRAIN_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Required arguments:
The only required argument is the experiment ID and specs:
--id: The experiment ID to run training on--specs: The training specifications
Optional arguments:
You can set optional arguments to override the option values in the experiment specification file.
--action: The action to perform (train)--parent_job_id: Parent job ID for chaining jobs
Multi-Node Training with FTMS
Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.
Verify that your cluster has multiple GPU enabled nodes available for training by running this command:
kubectl get nodes -o wide
The command lists the nodes in your cluster. If it does not list multiple nodes, contact your cluster administrator to get more nodes added to your cluster.
To run a multi-node training job through FTMS, modify these fields in the training job specification:
{
"train": {
"num_gpus": 8, // Number of GPUs per node
"num_nodes": 2 // Number of nodes to use for training
}
}
If these fields are not specified, FTMS uses the default values of one GPU per node and one node.
Note
The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster.
The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.
The latest checkpoint is saved automatically based on the checkpoint configuration.
Training automatically resumes from the latest checkpoint if train.resume is set to TRUE.
Evaluating the Model#
Creating an Evaluation Specification File#
EVAL_SPECS=$(tao-client cosmos-rl get-spec --action evaluate --job_type experiment --id $EXPERIMENT_ID)
The evaluation experiment specification file for Cosmos-RL includes evaluate parameters for comprehensive model assessment.
Here is an example specification file for evaluating a Cosmos-RL model:
evaluate:
dataset:
annotation_path: "path/to/eval_annotations.json"
media_dir: "path/to/eval_media/"
system_prompt: "You are a helpful assistant that can answer questions about a street-view CCTV footage. The vehicles that need attention are marked with bounding boxes and IDs."
model:
model_name: "nvidia/Cosmos-Reason1-7B"
save_folder: "cr1_1_zero_shot"
tokenizer_model_name: "qwen2.5-vl-7b"
dtype: "bfloat16"
tp_size: 1
max_length: 128000
enable_lora: false
base_model_path: ""
evaluation:
answer_type: "freeform"
num_processes: 40
skip_saved: false
seed: 1
limit: -1
total_shard: 1
shard_id: 0
vision:
fps: 4
total_pixels: 3136000
generation:
max_retries: 10
max_tokens: 1024
temperature: 0
repetition_penalty: 1
presence_penalty: 0
frequency_penalty: 0
results:
save_individual_results: true
save_confusion_matrix: true
save_metrics_summary: true
results_dir: "/results"
Evaluation Configuration Parameters#
The following sections detail all available configuration parameters for Cosmos-RL evaluation.
ExperimentConfig Fields (Evaluation)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Root folder for saving all evaluation outputs, including predictions, metrics, and visualizations. |
/results |
||||
|
collection |
Evaluation configuration. |
FALSE |
EvaluateConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
collection |
Dataset configuration for evaluation |
FALSE |
||||
|
collection |
Model configuration |
FALSE |
||||
|
collection |
Evaluation parameters |
FALSE |
||||
|
collection |
Vision processing configuration |
FALSE |
||||
|
collection |
Generation parameters |
FALSE |
||||
|
collection |
Results and output configuration |
FALSE |
DatasetConfig Fields (Evaluation)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to JSON file with evaluation samples containing questions, ground truth answers, and media references. |
|||||
|
string |
Optional folder containing image and video files. Leave empty if media paths in annotations are absolute or relative to current folder. |
|||||
|
string |
System instruction prepended to all evaluation prompts to provide context about the task and expected behavior. |
You are a helpful assistant that can answer questions about a street-view CCTV footage. The vehicles that need attention are marked with bounding boxes and IDs. |
ModelConfig Fields (Evaluation)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
HuggingFace model ID, local model path, or path to safetensors checkpoint folder for evaluation. |
nvidia/Cosmos-Reason1-7B |
||||
|
string |
Subfolder name within |
cr1_1_zero_shot |
||||
|
string |
Tokenizer to use for text processing. Options: |
qwen2.5-vl-7b |
||||
|
string |
Precision for model weights during inference: |
bfloat16 |
||||
|
int |
Number of GPUs for Tensor Parallelism. Splits each layer across GPUs for larger models. Set to 1 for single-GPU inference. |
1 |
1 |
8 |
||
|
int |
Maximum total sequence length (prompt + response) in tokens. Must not exceed model’s context window. |
128000 |
1024 |
1000000 |
||
|
bool |
Specifies whether to merge LoRA adapter weights into base model before evaluation. Required when evaluating LoRA fine-tuned models. |
False |
||||
|
string |
Path to base pretrained model. Required when |
EvaluationConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Expected answer format: |
freeform |
||||
|
int |
Number of parallel worker processes for concurrent evaluation. Higher values speed up evaluation but increase memory usage. |
40 |
1 |
128 |
||
|
bool |
Skip re-evaluating samples that already have saved results. Useful for resuming interrupted evaluations. |
FALSE |
||||
|
int |
Random seed for deterministic sampling and generation. Use same seed for reproducible results. |
1 |
0 |
999999 |
||
|
int |
Maximum number of samples to evaluate. Set to -1 for full dataset or a positive integer for quick testing or debugging. |
-1 |
-1 |
999999 |
||
|
int |
Split evaluation across N shards for distributed processing across multiple machines or jobs. |
1 |
1 |
64 |
||
|
int |
Current shard identifier (0-indexed). Each shard processes a disjoint subset of the evaluation data. |
0 |
0 |
63 |
VisionConfig Fields (Evaluation)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Downsample video to this frame rate for vision processing. Higher FPS provides more temporal detail but increases compute time. |
4 |
1 |
30 |
||
|
int |
Target resolution for vision inputs (width × height). Images and videos are resized to this pixel count while preserving the aspect ratio. |
3136000 |
100000 |
10000000 |
GenerationConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Maximum retry attempts for failed generations due to errors or timeouts. Useful for handling transient failures. |
10 |
0 |
50 |
||
|
int |
Maximum number of new tokens to generate per response. Longer limits allow detailed answers but increase latency. |
1024 |
1 |
8192 |
||
|
float |
Sampling temperature: 0.0 for deterministic greedy decoding, higher values (0.7-1.0) for more creative, diverse outputs. |
0 |
0 |
2 |
||
|
float |
Penalty for repeating tokens. Values > 1.0 discourage repetition, 1.0 = no penalty, < 1.0 encourages repetition. |
1 |
0.1 |
2 |
||
|
float |
Penalty for tokens that already appear in the sequence. Positive values promote diversity; negative values allow repetition. |
0 |
-2 |
2 |
||
|
float |
Penalty proportional to token frequency in sequence. Positive values reduce repetitive patterns; negative values allow them. |
0 |
-2 |
2 |
ResultsConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
Indicates whether to save an individual JSON file for each sample with question, prediction, ground truth, and metadata for detailed analysis. |
TRUE |
||||
|
bool |
Indicates whether to generate and save a confusion matrix visualization showing prediction vs ground truth distribution (for classification tasks). |
TRUE |
||||
|
bool |
Indicates whether to save an aggregated metrics summary JSON with accuracy, F1, precision, recall, and other evaluation statistics. |
TRUE |
Running Evaluation#
To run evaluation with a Cosmos-RL model, use this command:
EVAL_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$EVAL_SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Required arguments:
--id: Experiment ID to run evaluation on--parent_job_id: Training job ID to use the trained model from--specs: Evaluation specifications
Optional arguments:
--action: Action to perform (evaluate)
Running Inference with a Cosmos-RL Model#
Creating an Inference Specification File#
INFERENCE_SPECS=$(tao-client cosmos-rl get-spec --action inference --job_type experiment --id $EXPERIMENT_ID)
The inference experiment specification file for Cosmos-RL includes inference parameters for generating responses to visual content.
Here is an example specification file for running inference with a Cosmos-RL model:
inference:
media: "path/to/video.mp4"
prompt: "Describe this video."
fps: 4
total_pixels: 6422528
max_new_tokens: 4096
Inference Configuration Parameters#
The following sections detail all available configuration parameters for Cosmos-RL inference.
ExperimentConfig Fields (Inference)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
collection |
Inference config |
FALSE |
InferenceConfig Fields#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to input image or video file for inference. Supports common formats (JPG, PNG, MP4, AVI, etc.). |
|||||
|
string |
Text prompt or question to ask about the media. The model responds based on visual content and this instruction. |
Describe this video. |
||||
|
int |
Video frame sampling rate. Higher FPS provides more temporal information but increases memory and latency. |
4 |
||||
|
int |
Target resolution for vision input (width × height). The image or video is resized to this pixel count while maintaining the aspect ratio. |
6422528 |
||||
|
int |
Maximum number of tokens to generate in the response. Higher values allow longer, more detailed answers. |
4096 |
Running Inference#
The inference tool for Cosmos-RL models can be used to generate text responses based on video content.
INFERENCE_JOB_ID=$(tao-client cosmos-rl experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$INFERENCE_SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Required arguments:
--id: Experiment ID to run inference on--parent_job_id: Training job ID to use the trained model from--specs: Inference specifications
Optional arguments:
--action: Action to perform (inference)
AutoML Support#
Cosmos-RL supports AutoML optimization for the following hyperparameters:
Learning Rate (
optm_lr): Automatically optimized learning rate schedulesTraining Epochs (
epoch): Optimal number of training epochsOptimizer Selection (
optm_name): Choice between AdamW and Adam optimizersOptimizer Betas (
optm_betas): Beta coefficients for momentumWarmup Epochs (
optm_warmup_epochs): Learning rate warmup scheduleLoRA Configuration: Rank (
r), alpha (lora_alpha), dropout (lora_dropout), target modules, and modules to saveVision Processing: FPS sampling rate for video processing
To enable AutoML, configure the experiment with AutoML parameters:
automl_information = {
"automl_enabled": True,
"automl_algorithm": "bayesian",
"automl_max_recommendations": 2,
"automl_hyperparameters": automl_params
}
Performance Considerations#
Training performance:
Training time varies based on dataset size and hardware configuration.
Multi-GPU training significantly reduces training time.
AutoML experiments may require multiple training runs.
Use
compile: truefor PyTorch 2.0 optimization (increases initial compilation time).
Memory optimization:
Use FP8 precision (
fp8.enable_fp8: true) for memory-efficient training on H100/H200 GPUs.Enable gradient checkpointing (
model_gradient_checkpointing: true) to trade compute for memory.Adjust batch sizes (
train_batch_per_replica) based on available GPU memory.Use FSDP offloading (
fsdp_offload: true) to reduce GPU memory usage.Consider model sharding using parallelism configurations.
Storage requirements:
Video datasets require significant storage space.
Use compressed formats (tar.gz) for efficient storage.
Enable dataset caching (
enable_dataset_cache: true) for faster data loading.Consider cloud storage for large-scale datasets.
Troubleshooting#
Common issues:
Out of Memory: Reduce
train_batch_per_replica, enable FP8 precision, or use gradient checkpointing.Dataset Format Errors: Ensure annotations follow LLaVA format exactly.
Training Convergence: Adjust learning rates, use warmup epochs, or enable AutoML optimization.
Inference Errors: Verify model checkpoints and input formats.
Slow Data Loading: Increase
dataloader_num_workersanddataloader_prefetch_factor.
Best practices:
Start with smaller datasets for initial experimentation.
Use AutoML for optimal hyperparameter selection.
Monitor training metrics regularly through logging backends.
Validate model performance on held-out test sets.
Use appropriate parallelism configurations for your hardware setup.
Enable LoRA for parameter-efficient fine-tuning.
Use mixed precision training (
param_dtype: bfloat16) for better performance.
For additional support and troubleshooting, refer to the TAO Toolkit troubleshooting guide.