Post-Training Guide#

This guide provides instructions for post-training Cosmos-Reason1 on the SFT/RL datasets using cosmos-rl (refer to the cosmos-rl documentation documentation for more details).

Setup#

Follow the Installation guide to install system dependencies and clone the Cosmos-Reason1 repository from GitHub

Install the redis package:

pkgm install redis-server
# or
conda install -c conda-forge redis-server

Set up the post-training environment:

cd examples/post_training
just install
source .venv/bin/activate

Monitoring#

We recommend using wandb to monitor training.

Acquire your WANDB_API_KEY.
Log in to wandb:

uv tool install -U wandb
wandb login

Now, when training you will observe the wandb link in the logging:

wandb: 🚀 View run at https://wandb.ai/${WANDB_USER_NAME}/${config.logging.project_name}/runs/20250515101157

Training#

Note

Following the below training steps will trigger a download of around 200GB of model and dataset files from Hugging Face. Ensure that your ~/.cache directory has enough storage space or that the HF_HOME and COSMOS_CACHE environment variables are set to a directory with enough space.

Supervised Fine-Tuning (SFT)#

SFT training can improve model capability on tasks that have a similar distribution to that of the training dataset: For example, training with the robovqa dataset can improve performance with robotics-focused visual question-answering scenarios.

Configuration#

Configure settings by editing the configs/sft.toml file. The default configuration uses 4 GPUs. Variants include the following:

8 GPUs
```
[policy.parallelism]
dp_shard_size = 8
```

Training#

Run training as follows:

cosmos-rl --config configs/sft.toml ./tools/dataset/cosmos_sft.py

After training completes, the final output checkpoint can be found in the log:

[rank0]:Exported safetensors to ./outputs/sft/20250516061336/safetensors/final

Reinforcement Learning (RL)#

RL training can improve model reasoning capability on certain tasks with the reasoning training dataset.

Configuration#

Configure settings by editing the configs/rl.toml file. The default configuration uses 4 GPUs. Variants include the following:

8 GPUs

[rollout.parallelism]
tp_size = 4

[policy.parallelism]
dp_shard_size = 4

Training#

Run training as follows:

cosmos-rl --config configs/rl.toml tools/dataset/cosmos_grpo.py

Similar to SFT training, the final output checkpoint can be found in the log.

Next Steps#

To evaluate the post-trained model, run the Cosmos-Reason1 Benchmark.