Reason1 Training Guide#
This page outlines how to train the Cosmos-Reason1-7B model using sample training datasets provided on Hugging Face.
Note
The training steps outlined below will require you to download ~50GB of model and dataset files from Hugging Face.
Ensure your ~/.cache
directory has enough storage space. Alternatively, you can set the HF_HOME
and
COSMOS_CACHE
environment variables to a directory with sufficient storage space.
Supervised Fine-Tuning (SFT)#
Supervised Fine-Tuning (SFT) training can improve the capability of a model on certain tasks with a similar
training dataset-distribution: For example, training with the robovqa
dataset can improve model
performance on the robotics-focused visual question-answer scenarios.
Note
The nvidia/Cosmos-Reason1-7B
model is set as the default base model for SFT, which has already been SFT trained
on nvidia/Cosmos-Reason1-SFT-Dataset
. We recommend you use your own dataset for SFT exploration.
The following command launches SFT training for nvidia/Cosmos-Reason1-7B
with TP=2
on two GPUs:
python tools/launch_all.py --config configs/cosmos-reason1/cosmos-reason1-7b-tp2-sft.toml
After training finishes, the DCP checkpoint will be saved to $output_dir
, also with the huggingface
-style model saved.
[rank1]:[cosmos] 2025-05-16 06:28:46,019 - cosmos - INFO - [Policy] Step: 95/95, [Policy] Loss: 0.87890625
[rank1]:[cosmos] 2025-05-16 06:28:46,020 - cosmos - INFO - [Policy] Training finished at step 95/95, saving final checkpoint in huggingface safetensors...
[rank0]:[cosmos] 2025-05-16 06:28:45,998 - cosmos - INFO - [Policy] Step: 95/95, [Policy] Loss: 0.87890625
[rank0]:[cosmos] 2025-05-16 06:28:45,999 - cosmos - INFO - [Policy] Training finished at step 95/95, saving final checkpoint in huggingface safetensors...
[rank0]:[cosmos] 2025-05-16 06:28:45,999 - cosmos - INFO - Prepare to exporting safetensors to ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final at rank 0
[rank0]:[cosmos] 2025-05-16 06:28:55,622 - cosmos - INFO - Saved chunk 0 to 00000.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:03,829 - cosmos - INFO - Saved chunk 1 to 00001.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:11,891 - cosmos - INFO - Saved chunk 2 to 00002.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:21,191 - cosmos - INFO - Saved chunk 3 to 00003.safetensors
[rank0]:[cosmos] 2025-05-16 06:29:22,083 - cosmos - INFO -
[rank0]:
[rank0]:Exported safetensors to ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final
You can find the SFT model checkpoint in `outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final:
root@node:~/ws# ls ./outputs/cosmos-reason1-7b-tp2-sft/20250516061336/safetensors/final -la
total 16211328
drwxr-xr-x 2 root root 4096 May 16 06:29 .
drwxr-xr-x 5 root root 4096 May 16 06:28 ..
-rw-r--r-- 1 root root 4171800072 May 16 06:28 00000.safetensors
-rw-r--r-- 1 root root 4195052544 May 16 06:29 00001.safetensors
-rw-r--r-- 1 root root 4195052632 May 16 06:29 00002.safetensors
-rw-r--r-- 1 root root 4022509168 May 16 06:29 00003.safetensors
-rw-r--r-- 1 root root 605 May 16 06:29 added_tokens.json
-rw-r--r-- 1 root root 1049 May 16 06:29 chat_template.json
-rw-r--r-- 1 root root 1459 May 16 06:29 config.json
-rw-r--r-- 1 root root 1671853 May 16 06:29 merges.txt
-rw-r--r-- 1 root root 49611 May 16 06:29 model.safetensors.index.json
-rw-r--r-- 1 root root 575 May 16 06:29 preprocessor_config.json
-rw-r--r-- 1 root root 613 May 16 06:29 special_tokens_map.json
-rw-r--r-- 1 root root 11421896 May 16 06:29 tokenizer.json
-rw-r--r-- 1 root root 5776 May 16 06:29 tokenizer_config.json
-rw-r--r-- 1 root root 2776833 May 16 06:29 vocab.json
To evaluate the improved performance of this SFT model, refer to the Evaluation section.
Reinforcement Learning (RL)#
Reinforcement Learning (RL) training can improve the reasoning capability of the model on certain tasks with the reasoning training dataset.
The following command launches GRPO training for nvidia/Cosmos-Reason1-7B
with TP=2
and FSDP=1
,
along with a rollout of TP=2
, using a total of four GPUs:
python tools/launch_all.py --config configs/cosmos-reason1/cosmos-reason1-7b-p-fsdp1-tp2-r-tp2-pp1-grpo.toml
After training is done, the huggingface
checkpoint gets saved to the directory $output_dir
. To evaluate the
improved reasoning performance of the RL-trained model, refer to the Evaluation section.
Inference#
You can use the inference.py
code snippet to run inference with the Cosmos-Reason1 model.
python tools/eval/inference.py
This code snippet is adopted from the Qwen2.5-VL repo
Next Steps#
Refer to the Evaluation Guide for more details on evaluating the performance of the trained model.