Reason1 Training Guide#

This page outlines how to train the Cosmos-Reason1-7B model using sample training datasets provided on Hugging Face.

Note

The training steps outlined below will require you to download ~50GB of model and dataset files from Hugging Face. Ensure your ~/.cache directory has enough storage space. Alternatively, you can set the HF_HOME and COSMOS_CACHE environment variables to a directory with sufficient storage space.

Training Configuration#

Job recipes#

Cosmos-Reason1 supports Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training with a broad range of models and parallelisms, which can be configured using a config file. Several pre-defined config files are provided in the configs folder. The following table outlines the policies and required number of GPUs for two pre-defined config files:

Config File	Policy TP	Policy FSDP	Policy PP	Rollout TP	Rollout PP	Num GPUs	Purpose
`cosmos-reason1-7b-tp2-sft.toml`	2	1	1	-	-	2	SFT
`cosmos-reason1-7b-p-fsdp1-tp2-r-tp2-pp1-grpo.toml`	2	1	1	2	1	2 for policy, 2 for rollout	GRPO

You can customize your own training config based on the above recipes. For example, you can change the epoch and train_batch_per_replica to adjust the epochs number and batch size and save_freq to adjust the checkpoint saving interval. To reduce the storage usage, you can reduce the number of max_keep in the checkpoint config, which limits the number of saved checkpoints. You should regularly delete intermediate checkpoints and .tar archives that are not needed for recovery.

GPU Requirements#

For training the Cosmos-Reason1-7B model, GPUs with at least 80GB of memory are required. SFT training requires two or more GPUs, while RL training requires four or more GPUs.

For evaluation or inference with the Cosmos-Reason1-7B model, a single GPU with at least 24GB memory is sufficient.

Supervised Fine-Tuning (SFT)#

SFT training can improve the capability of a model on certain tasks with a similar training dataset-distribution: For example, training with the robovqa dataset can improve model performance on the robotics-focused visual question-answer scenarios.

Note

The nvidia/Cosmos-Reason1-7B model is set as the default base model for SFT, which has already been SFT trained on nvidia/Cosmos-Reason1-SFT-Dataset. We recommend you use your own dataset for SFT exploration.

The following command launches SFT training for nvidia/Cosmos-Reason1-7B with TP=2 on two GPUs:

python tools/launch_all.py --config configs/cosmos-reason1/cosmos-reason1-7b-tp2-sft.toml

After training finishes, the DCP checkpoint will be saved to $output_dir, also with the huggingface-style model saved.

[rank1]:[cosmos] 2025-05-16 06:28:46,019 - cosmos - INFO - [Policy] Step: 95/95, [Policy] Loss: 0.87890625
[rank1]:[cosmos] 2025-05-16 06:28:46,020 - cosmos - INFO - [Policy] Training finished at step 95/95, saving final checkpoint in huggingface safetensors...
[rank0]:[cosmos] 2025-05-16 06:28:45,998 - cosmos - INFO - [Policy] Step: 95/95, [Policy] Loss: 0.87890625
[rank0]:[cosmos] 2025-05-16 06:28:45,999 - cosmos - INFO - [Policy] Training finished at step 95/95, saving final checkpoint in huggingface safetensors...
[rank0]:[cosmos] 2025-05-16 06:28:45,999 - cosmos - INFO - Prepare to exporting safetensors to ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final at rank 0
[rank0]:[cosmos] 2025-05-16 06:28:55,622 - cosmos - INFO - Saved chunk 0 to 00000.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:03,829 - cosmos - INFO - Saved chunk 1 to 00001.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:11,891 - cosmos - INFO - Saved chunk 2 to 00002.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:21,191 - cosmos - INFO - Saved chunk 3 to 00003.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:22,083 - cosmos - INFO -
[rank0]:
[rank0]:Exported safetensors to ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final

You can find the SFT model checkpoint in `outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final:

root@node:~/ws# ls ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final -la
total 16211328
drwxr-xr-x 2 root root       4096 May 16 06:29 .
drwxr-xr-x 5 root root       4096 May 16 06:28 ..
-rw-r--r-- 1 root root 4171800072 May 16 06:28 00000.safetensors
-rw-r--r-- 1 root root 4195052544 May 16 06:29 00001.safetensors
-rw-r--r-- 1 root root 4195052632 May 16 06:29 00002.safetensors
-rw-r--r-- 1 root root 4022509168 May 16 06:29 00003.safetensors
-rw-r--r-- 1 root root        605 May 16 06:29 added_tokens.json
-rw-r--r-- 1 root root       1049 May 16 06:29 chat_template.json
-rw-r--r-- 1 root root       1459 May 16 06:29 config.json
-rw-r--r-- 1 root root    1671853 May 16 06:29 merges.txt
-rw-r--r-- 1 root root      49611 May 16 06:29 model.safetensors.index.json
-rw-r--r-- 1 root root        575 May 16 06:29 preprocessor_config.json
-rw-r--r-- 1 root root        613 May 16 06:29 special_tokens_map.json
-rw-r--r-- 1 root root   11421896 May 16 06:29 tokenizer.json
-rw-r--r-- 1 root root       5776 May 16 06:29 tokenizer_config.json
-rw-r--r-- 1 root root    2776833 May 16 06:29 vocab.json

To evaluate the improved performance of this SFT model, refer to the Evaluation section.

Reinforcement Learning (RL)#

Reinforcement Learning (RL) training can improve the reasoning capability of the model on certain tasks with the reasoning training dataset.

The following command launches GRPO training for nvidia/Cosmos-Reason1-7B with TP=2 and FSDP=1, along with a rollout of TP=2, using a total of four GPUs:

python tools/launch_all.py --config configs/cosmos-reason1/cosmos-reason1-7b-p-fsdp1-tp2-r-tp2-pp1-grpo.toml

After training is done, the huggingface checkpoint gets saved to the directory $output_dir. To evaluate the improved reasoning performance of the RL-trained model, refer to the Evaluation section.

Inference#

You can use the inference.py code snippet to run inference with the Cosmos-Reason1 model.

python tools/eval/inference.py

This code snippet is adopted from the Qwen2.5-VL repo

Next Steps#

Refer to the Evaluation Guide for more details on evaluating the performance of the trained model.