Post-Training Guide#
This guide provides instructions for post-training Cosmos-Reason1 on the SFT/RL datasets using cosmos-rl (refer to the cosmos-rl documentation documentation for more details).
Setup#
Follow the Installation guide to install system dependencies and clone the Cosmos-Reason1 repository from GitHub
Install the redis package:
pkgm install redis-server # or conda install -c conda-forge redis-server
Set up the post-training environment:
cd examples/post_training just install source .venv/bin/activate
Monitoring#
We recommend using wandb to monitor training.
Acquire your WANDB_API_KEY.
Log in to wandb:
uv tool install -U wandb
wandb login
Now, when training you will observe the wandb
link in the logging:
wandb: 🚀 View run at https://wandb.ai/${WANDB_USER_NAME}/${config.logging.project_name}/runs/20250515101157
Training#
Note
Following the below training steps will trigger a download of around 200GB of model and dataset files from Hugging Face.
Ensure that your ~/.cache
directory has enough storage space or that the HF_HOME
and COSMOS_CACHE
environment
variables are set to a directory with enough space.
Supervised Fine-Tuning (SFT)#
SFT training can improve model capability on tasks that have a similar distribution to that of the training dataset:
For example, training with the robovqa
dataset can improve performance with robotics-focused visual question-answering
scenarios.
Configuration#
Configure settings by editing the configs/sft.toml file. The default configuration uses 4 GPUs. Variants include the following:
8 GPUs
[policy.parallelism] dp_shard_size = 8
Training#
Run training as follows:
cosmos-rl --config configs/sft.toml ./tools/dataset/cosmos_sft.py
After training completes, the final output checkpoint can be found in the log:
[rank0]:Exported safetensors to ./outputs/sft/20250516061336/safetensors/final
Reinforcement Learning (RL)#
RL training can improve model reasoning capability on certain tasks with the reasoning training dataset.
Configuration#
Configure settings by editing the configs/rl.toml file. The default configuration uses 4 GPUs. Variants include the following:
8 GPUs
[rollout.parallelism] tp_size = 4 [policy.parallelism] dp_shard_size = 4
Training#
Run training as follows:
cosmos-rl --config configs/rl.toml tools/dataset/cosmos_grpo.py
Similar to SFT training, the final output checkpoint can be found in the log.
Next Steps#
To evaluate the post-trained model, run the Cosmos-Reason1 Benchmark.