Reason1 Training Guide#

This page outlines how to train the Cosmos-Reason1-7B model using sample training datasets provided on Hugging Face.

Note

The training steps outlined below will require you to download ~50GB of model and dataset files from Hugging Face. Ensure your ~/.cache directory has enough storage space. Alternatively, you can set the HF_HOME and COSMOS_CACHE environment variables to a directory with sufficient storage space.

Supervised Fine-Tuning (SFT)#

Supervised Fine-Tuning (SFT) training can improve the capability of a model on certain tasks with a similar training dataset-distribution: For example, training with the robovqa dataset can improve model performance on the robotics-focused visual question-answer scenarios.

Note

The nvidia/Cosmos-Reason1-7B model is set as the default base model for SFT, which has already been SFT trained on nvidia/Cosmos-Reason1-SFT-Dataset. We recommend you use your own dataset for SFT exploration.

The following command launches SFT training for nvidia/Cosmos-Reason1-7B with TP=2 on two GPUs:

python tools/launch_all.py --config configs/cosmos-reason1/cosmos-reason1-7b-tp2-sft.toml

After training finishes, the DCP checkpoint will be saved to $output_dir, also with the huggingface-style model saved.

[rank1]:[cosmos] 2025-05-16 06:28:46,019 - cosmos - INFO - [Policy] Step: 95/95, [Policy] Loss: 0.87890625
[rank1]:[cosmos] 2025-05-16 06:28:46,020 - cosmos - INFO - [Policy] Training finished at step 95/95, saving final checkpoint in huggingface safetensors...
[rank0]:[cosmos] 2025-05-16 06:28:45,998 - cosmos - INFO - [Policy] Step: 95/95, [Policy] Loss: 0.87890625
[rank0]:[cosmos] 2025-05-16 06:28:45,999 - cosmos - INFO - [Policy] Training finished at step 95/95, saving final checkpoint in huggingface safetensors...
[rank0]:[cosmos] 2025-05-16 06:28:45,999 - cosmos - INFO - Prepare to exporting safetensors to ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final at rank 0
[rank0]:[cosmos] 2025-05-16 06:28:55,622 - cosmos - INFO - Saved chunk 0 to 00000.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:03,829 - cosmos - INFO - Saved chunk 1 to 00001.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:11,891 - cosmos - INFO - Saved chunk 2 to 00002.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:21,191 - cosmos - INFO - Saved chunk 3 to 00003.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:22,083 - cosmos - INFO -
[rank0]:
[rank0]:Exported safetensors to ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final

You can find the SFT model checkpoint in `outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final:

root@node:~/ws# ls ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final -la
total 16211328
drwxr-xr-x 2 root root       4096 May 16 06:29 .
drwxr-xr-x 5 root root       4096 May 16 06:28 ..
-rw-r--r-- 1 root root 4171800072 May 16 06:28 00000.safetensors
-rw-r--r-- 1 root root 4195052544 May 16 06:29 00001.safetensors
-rw-r--r-- 1 root root 4195052632 May 16 06:29 00002.safetensors
-rw-r--r-- 1 root root 4022509168 May 16 06:29 00003.safetensors
-rw-r--r-- 1 root root        605 May 16 06:29 added_tokens.json
-rw-r--r-- 1 root root       1049 May 16 06:29 chat_template.json
-rw-r--r-- 1 root root       1459 May 16 06:29 config.json
-rw-r--r-- 1 root root    1671853 May 16 06:29 merges.txt
-rw-r--r-- 1 root root      49611 May 16 06:29 model.safetensors.index.json
-rw-r--r-- 1 root root        575 May 16 06:29 preprocessor_config.json
-rw-r--r-- 1 root root        613 May 16 06:29 special_tokens_map.json
-rw-r--r-- 1 root root   11421896 May 16 06:29 tokenizer.json
-rw-r--r-- 1 root root       5776 May 16 06:29 tokenizer_config.json
-rw-r--r-- 1 root root    2776833 May 16 06:29 vocab.json

To evaluate the improved performance of this SFT model, refer to the Evaluation section.

Reinforcement Learning (RL)#

Reinforcement Learning (RL) training can improve the reasoning capability of the model on certain tasks with the reasoning training dataset.

The following command launches GRPO training for nvidia/Cosmos-Reason1-7B with TP=2 and FSDP=1, along with a rollout of TP=2, using a total of four GPUs:

python tools/launch_all.py --config configs/cosmos-reason1/cosmos-reason1-7b-p-fsdp1-tp2-r-tp2-pp1-grpo.toml

After training is done, the huggingface checkpoint gets saved to the directory $output_dir. To evaluate the improved reasoning performance of the RL-trained model, refer to the Evaluation section.

Inference#

You can use the inference.py code snippet to run inference with the Cosmos-Reason1 model.

python tools/eval/inference.py

This code snippet is adopted from the Qwen2.5-VL repo

Next Steps#

Refer to the Evaluation Guide for more details on evaluating the performance of the trained model.