Post-Training Guide#
This page provides a general guide for post-training the Cosmos-Predict2.5 base model.
Note
Ensure you have completed the steps in the Predict2.5 Installation Guide before running post-training.
Configure Hugging Face#
Model checkpoints are automatically downloaded during post-training if they are not present. Configure Hugging Face as follows:
# Login with your Hugging Face token (required for downloading models)
hf auth login
# Set custom cache directory for HF models
# Default: ~/.cache/huggingface
export HF_HOME=/path/to/your/hf/cache
Tip
Ensure you have sufficient disk space in HF_HOME
.
Set the Training Output Directory#
Configure the output directory for training checkpoints and artifacts:
# Set output directory for training checkpoints and artifacts
# Default: /tmp/imaginaire4-output
export IMAGINAIRE_OUTPUT_ROOT=/path/to/your/output/directory
Tip
Ensure you have sufficient disk space in IMAGINAIRE_OUTPUT_ROOT
.
Weights & Biases (wandb) Logging#
By default, training will attempt to log metrics to Weights & Biases. You can choose to either enable or disable wandb logging.
Option 1: Enable wandb Logging#
To enable full experiment tracking with wandb:
Create a free account at wandb.ai
Get your API key from https://wandb.ai/authorize
Set the environment variable:
export WANDB_API_KEY=your_api_key_here
Launch training with the following command:
EXP=your_experiment_name_here torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train \ --config=cosmos_predict2/_src/predict2/configs/video2world/config.py -- \ experiment=${EXP}
Option 2: Disable wandb Logging#
Add job.wandb_mode=disabled
to your training command to disable wandb logging:
EXP=your_experiment_name_here
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train \
--config=cosmos_predict2/_src/predict2/configs/video2world/config.py -- \
experiment=${EXP} \
job.wandb_mode=disabled
Checkpointing#
Training uses two checkpoint formats, each optimized for different use cases:
1. Distributed Checkpoint (DCP) Format#
Primary format for training checkpoints.
Structure: Multi-file directory with sharded model weights
Used for: Saving checkpoints during training, resuming training
Advantages:
Efficient parallel I/O for multi-GPU training
Supports FSDP (Fully Sharded Data Parallel)
Optimized for distributed workloads
Example directory structure:
checkpoints/
├── iter_{NUMBER}/
│ ├── model/
│ │ ├── .metadata
│ │ └── __0_0.distcp
│ ├── optim/
│ ├── scheduler/
│ └── trainer/
└── latest_checkpoint.txt
2. Consolidated PyTorch (.pt) Format#
Single-file format for inference and distribution.
Structure: Single
.pt
file containing the complete model stateUsed for: Inference, model sharing, initial post-training
Advantages:
Easy to distribute and version control
Standard PyTorch format
Simpler for single-GPU workflows
Loading Checkpoints#
The training system supports loading from both formats:
Load DCP checkpoint (for resuming training):
load_path="checkpoints/nvidia/Cosmos-Predict2.5-2B/dcp"
Load consolidated checkpoint (for starting post-training):
load_path="checkpoints/nvidia/Cosmos-Predict2.5-2B/consolidated/model.pt"
Note
When you download pretrained models from Hugging Face, they are typically in consolidated .pt
format. The training system will automatically load this format and begin training.
Saving Checkpoints#
All checkpoints saved during training use DCP format. This ensures:
Consistent checkpoint structure across training runs
Optimal performance for distributed training