Autoregressive Post-Training Guide#

Post-training a Cosmos Autoregressive-based WFM allows you to train the model to generate videos that are more specific to your Physical AI use case. For example, if you want to generate action sequences for a specific robot, you can post-train the model to generate videos that are more aligned with typical actions/outcomes for that robot.

This page will walk you through post-training of the Cosmos Autoregressive model using the NVIDIA NeMo Framework.

Supported Models#

Currently, the NeMo Framework supports post-training with the Cosmos-1.0-Autoregressive-4B and Cosmos-1.0-Autoregressive-12B models. The Video2World models are currently not supported.

Setup#

Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.
Follow the Installation guide to download the Cosmos-Predict1 repo and set up the conda environment.
Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).
Log in to Hugging Face with the access token:
```
huggingface-cli login
```

Download the Cosmos-Predict1-4B model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_autoregressive_checkpoints.py --model_sizes 4B

Preparing a Dataset#

The first step is to prepare a dataset. Post-training a Cosmos-Predict1 model allows you to generate videos of a specific subject in new environments using a collection of input videos of that same subject as reference material.

You must provide a folder containing a collection of videos in MP4 format with RGB color, preferably 720p. These videos should focus on the subject throughout the entire video so that each video chunk contains the subject.

The Cosmos-Predict1-4B example below uses a sample dataset for post-training.

Post-Training#

Download the Sample Dataset#

This example uses the nvidia/Cosmos-NeMo-Assets sample dataset for post-training. Use the following command to download this dataset:

mkdir -p datasets/cosmos_nemo_assets/

# This command will download the videos for physical AI
huggingface-cli download nvidia/Cosmos-NeMo-Assets --repo-type dataset --local-dir datasets/cosmos_nemo_assets/ --include "*.mp4*"

mv datasets/cosmos_nemo_assets/nemo_diffusion_example_data datasets/cosmos_nemo_assets/videos

Post-Train the Model#

You can post-train the Cosmos-Predict1-4B model using one or multiple GPUs. The steps for each option is described below.

Post-Train with a Single GPU (TP=1)#

The videos in the nvidia/Cosmos-NeMo-Assets_ sample dataset will be scaled to a lower resolution by the dataloader to fit on a single GPU.

export OUTPUT_ROOT=checkpoints # default value
torchrun --nproc_per_node=1 -m cosmos_predict1.autoregressive.train --config=cosmos_predict1/autoregressive/configs/config.py -- experiment=base_4b_example_tealrobotsmall_tp1

To learn more about how the dataloader works and is registered, refer to VideoDataset defined in cosmos_predict1/autoregressive/datasets/video_dataset.py and register_training_data defined in cosmos_predict1/autoregressive/configs/registry.py.

The job config determines where the checkpoints will be saved.

base_4b_example_tealrobotsmall_tp1= LazyDict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="autoregressive_base",
            name="base_4b_example_tealrobotsmall_tp1",
        ),
        ...
    )
)

The checkpoints will be saved to ${OUTPUT_ROOT}/PROJECT/GROUP/NAME. In this example, the file structure is generated as follows:

checkpoints/posttraining/autoregressive_base/base_4b_example_tealrobotsmall_tp1/checkpoints/
├── iter_{NUMBER}.pt

Post-train with 4 GPUs (TP = 4)#

The model can also be post-trained with multiple GPUs using tensor parallelism. Run the following command to execute a post-training job with the nvidia/Cosmos-NeMo-Assets_ sample dataset with higher resolution.

export OUTPUT_ROOT=checkpoints # default value
torchrun --nproc_per_node=4 -m cosmos_predict1.autoregressive.train --config=cosmos_predict1/autoregressive/configs/config.py -- experiment=base_4b_example_tealrobot_tp4

The job config determines where the checkpoints will be saved.

base_4b_example_tealrobotsmall_tp4= LazyDict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="autoregressive_base",
            name="base_4b_example_tealrobotsmall_tp4",
        ),
        ...
    )
)

The checkpoints will be saved to ${OUTPUT_ROOT}/PROJECT/GROUP/NAME. In this example, the file structure is generated as follows:

checkpoints/posttraining/autoregressive_base/base_4b_example_tealrobot_tp4/checkpoints/
├── iter_{NUMBER}.pt
├── iter_{NUMBER}_model_mp_0.pt
├── iter_{NUMBER}_model_mp_1.pt
├── iter_{NUMBER}_model_mp_2.pt
├── iter_{NUMBER}_model_mp_3.pt

Test Inference#

Follow these steps to perform inference with the post-trained model.

Copy the post-trained checkpoint to checkpoints/Cosmos-Predict1-4B-Base_post-trained/model.pt

The following commands apply to TP=1 post-training with 1000 iterations:

# copy checkpoint to the designated location
mkdir checkpoints/Cosmos-Predict1-4B-Base_post-trained/
cp checkpoints/posttraining/autoregressive_base/base_4b_example_tealrobotsmall_tp1/checkpoints/iter_000001000.pt checkpoints/Cosmos-Predict1-4B-Base_post-trained/model.pt

For TP=4 post-training, the checkpoints are sharded and should first be merged into a single checkpoint for inference:

# merge tensor parallel model shards
mkdir checkpoints/Cosmos-Predict1-4B-Base_post-trained/
python scripts/merge_autoregressive_tp_checkpoints.py --checkpoint_path checkpoints/posttraining/autoregressive_base/base_4b_example_tealrobot_tp4/checkpoints/iter_000001000.pt --output_path checkpoints/Cosmos-Predict1-4B-Base_post-trained/model.pt --model_size 4b --tensor_parallel_size 4

Run inference with the post-trained Cosmos-Predict1-4B model model using the --ar_model_dir argument.

NUM_GPUS=<NUM_GPUS>
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/autoregressive/inference/base.py \
   --num_gpus ${NUM_GPUS} \
   --checkpoint_dir checkpoints \
   --ar_model_dir Cosmos-Predict1-4B-Base_post-trained \
   --input_type video \
   --input_image_or_video_path datasets/cosmos_nemo_assets/videos/output_oige_render_view_sub.mp4  \
   --top_p 0.8 \
   --temperature 1.0 \
   --offload_diffusion_decoder \
   --offload_tokenizer \
   --video_save_name autoregressive-4b-post-train

You can also test out other inference options with the --ar_model_dir argument. Refer to the Autoregressive Model Reference page for more examples.