Post-Training Guide#

This guide provides instructions for post-training Cosmos-Reason1 on the SFT/RL datasets using cosmos-rl (refer to the cosmos-rl documentation documentation for more details).

Setup#

  1. Follow the Installation guide to install system dependencies and clone the Cosmos-Reason1 repository from GitHub

  2. Install the redis package:

    pkgm install redis-server
    # or
    conda install -c conda-forge redis-server
    
  3. Set up the post-training environment:

    cd examples/post_training
    just install
    source .venv/bin/activate
    

Monitoring#

We recommend using wandb to monitor training.

  1. Acquire your WANDB_API_KEY.

  2. Log in to wandb:

uv tool install -U wandb
wandb login

Now, when training you will observe the wandb link in the logging:

wandb: 🚀 View run at https://wandb.ai/${WANDB_USER_NAME}/${config.logging.project_name}/runs/20250515101157

Training#

Note

Following the below training steps will trigger a download of around 200GB of model and dataset files from Hugging Face. Ensure that your ~/.cache directory has enough storage space or that the HF_HOME and COSMOS_CACHE environment variables are set to a directory with enough space.

Supervised Fine-Tuning (SFT)#

SFT training can improve model capability on tasks that have a similar distribution to that of the training dataset: For example, training with the robovqa dataset can improve performance with robotics-focused visual question-answering scenarios.

Configuration#

Configure settings by editing the configs/sft.toml file. The default configuration uses 4 GPUs. Variants include the following:

  • 8 GPUs

    [policy.parallelism]
    dp_shard_size = 8
    

Training#

Run training as follows:

cosmos-rl --config configs/sft.toml ./tools/dataset/cosmos_sft.py

After training completes, the final output checkpoint can be found in the log:

[rank0]:Exported safetensors to ./outputs/sft/20250516061336/safetensors/final

Reinforcement Learning (RL)#

RL training can improve model reasoning capability on certain tasks with the reasoning training dataset.

Configuration#

Configure settings by editing the configs/rl.toml file. The default configuration uses 4 GPUs. Variants include the following:

  • 8 GPUs

    [rollout.parallelism]
    tp_size = 4
    
    [policy.parallelism]
    dp_shard_size = 4
    

Training#

Run training as follows:

cosmos-rl --config configs/rl.toml tools/dataset/cosmos_grpo.py

Similar to SFT training, the final output checkpoint can be found in the log.

Next Steps#

To evaluate the post-trained model, run the Cosmos-Reason1 Benchmark.