Post-Training Guide#

This page provides a general guide for post-training the Cosmos-Predict2.5 base model.

Note

Ensure you have completed the steps in the Predict2.5 Installation Guide before running post-training.

Configure Hugging Face#

Model checkpoints are automatically downloaded during post-training if they are not present. Configure Hugging Face as follows:

# Login with your Hugging Face token (required for downloading models)
hf auth login

# Set custom cache directory for HF models
# Default: ~/.cache/huggingface
export HF_HOME=/path/to/your/hf/cache

Tip

Ensure you have sufficient disk space in HF_HOME.

Set the Training Output Directory#

Configure the output directory for training checkpoints and artifacts:

# Set output directory for training checkpoints and artifacts
# Default: /tmp/imaginaire4-output
export IMAGINAIRE_OUTPUT_ROOT=/path/to/your/output/directory

Tip

Ensure you have sufficient disk space in IMAGINAIRE_OUTPUT_ROOT.

Weights & Biases (wandb) Logging#

By default, training will attempt to log metrics to Weights & Biases. You can choose to either enable or disable wandb logging.

Option 1: Enable wandb Logging#

To enable full experiment tracking with wandb:

Create a free account at wandb.ai
Get your API key from https://wandb.ai/authorize
Set the environment variable:
```
export WANDB_API_KEY=your_api_key_here
```

Launch training with the following command:

EXP=your_experiment_name_here

torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train \
  --config=cosmos_predict2/_src/predict2/configs/video2world/config.py -- \
  experiment=${EXP}

Option 2: Disable wandb Logging#

Add job.wandb_mode=disabled to your training command to disable wandb logging:

EXP=your_experiment_name_here

torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train \
  --config=cosmos_predict2/_src/predict2/configs/video2world/config.py -- \
  experiment=${EXP} \
  job.wandb_mode=disabled

Checkpointing#

Training uses two checkpoint formats, each optimized for different use cases:

1. Distributed Checkpoint (DCP) Format#

Primary format for training checkpoints.

Structure: Multi-file directory with sharded model weights
Used for: Saving checkpoints during training, resuming training
Advantages:
- Efficient parallel I/O for multi-GPU training
- Supports FSDP (Fully Sharded Data Parallel)
- Optimized for distributed workloads

Example directory structure:

checkpoints/
├── iter_{NUMBER}/
│   ├── model/
│   │   ├── .metadata
│   │   └── __0_0.distcp
│   ├── optim/
│   ├── scheduler/
│   └── trainer/
└── latest_checkpoint.txt

2. Consolidated PyTorch (.pt) Format#

Single-file format for inference and distribution.

Structure: Single .pt file containing the complete model state
Used for: Inference, model sharing, initial post-training
Advantages:
- Easy to distribute and version control
- Standard PyTorch format
- Simpler for single-GPU workflows

Loading Checkpoints#

The training system supports loading from both formats:

Load DCP checkpoint (for resuming training):

load_path="checkpoints/nvidia/Cosmos-Predict2.5-2B/dcp"

Load consolidated checkpoint (for starting post-training):

load_path="checkpoints/nvidia/Cosmos-Predict2.5-2B/consolidated/model.pt"

Note

When you download pretrained models from Hugging Face, they are typically in consolidated .pt format. The training system will automatically load this format and begin training.

Saving Checkpoints#

All checkpoints saved during training use DCP format. This ensures:

Consistent checkpoint structure across training runs
Optimal performance for distributed training