Diffusion Model Post-Training Guide#

The post-training process allows you to train a Cosmos Diffusion model to generate videos that are specific to your Physical AI use case. For example, if you want to generate action sequences for a specific robot, you can post-train the model to generate videos that are more aligned with typical actions/outcomes for that robot.

This page will walk you through the post-training process for the following Cosmos Diffusion models:

  • Cosmos-Predict1-7B-Text2World

  • Cosmos-Predict1-7B-Video2World

  • Cosmos-Predict1-7B-Text2World-Multiview

  • Cosmos-Predict1-7B-Video2World-Multiview

Setup#

  1. Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.

  2. Follow the Installation guide to download the Cosmos-Predict1 repo and set up the conda environment.

  3. Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).

  4. Log in to Hugging Face with the access token:

    huggingface-cli login
    
  5. Download the weights from Hugging Face for the Cosmos model you want to post-train:

    • Cosmos-Predict1-7B-Text2World

    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Text2World
    
    • Cosmos-Predict1-7B-Video2World

    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Video2World
    
    • Cosmos-Predict1-7B-Text2World-Multiview

    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Text2World-Multiview
    
    • Cosmos-Predict1-7B-Video2World-Multiview

    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Video2World-Multiview
    

Preparing a Dataset#

The first step is to prepare a dataset. Post-training a Cosmos-Predict1 model allows you to generate videos of a specific subject in new environments using a collection of input videos of that same subject as reference material.

You must provide a folder containing a collection of videos in MP4 format with RGB color, preferably 720p. These videos should focus on the subject throughout the entire video so that each video chunk contains the subject.

The following post-training examples use sample datasets for post-training.

Post-Training#

Text2World#

Download the Sample Dataset#

This example uses the nvidia/Cosmos-NeMo-Assets sample dataset for post-training. Use the following command to download this dataset:

mkdir -p datasets/cosmos_nemo_assets/

# This command will download the videos for physical AI
huggingface-cli download nvidia/Cosmos-NeMo-Assets --repo-type dataset --local-dir datasets/cosmos_nemo_assets/ --include "*.mp4*"

mv datasets/cosmos_nemo_assets/nemo_diffusion_example_data datasets/cosmos_nemo_assets/videos

Preprocess the Data#

The second step is to preprocess the input videos.

Run the following command to pre-compute the T5-XXL embeddings for the video captions used for post-training:

# The script will use the provided prompt and save the T5-XXL embeddings in pickle format.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/get_t5_embeddings_from_cosmos_nemo_assets.py --dataset_path datasets/cosmos_nemo_assets --prompt "A video of sks teal robot."

This command generates preprocessed data in the following directory structure:

datasets/cosmos_nemo_assets/
├── metas/
│   ├── *.txt
├── videos/
│   ├── *.mp4
├── t5_xxl/
│   ├── *.pickle

Post-Train the Model#

Run the following commands to post-train the Text2World model using the cosmos_nemo_assets dataset.

export OUTPUT_ROOT=checkpoints # default value
torchrun --nproc_per_node=8 -m cosmos_predict1.diffusion.training.train --config=cosmos_predict1/diffusion/training/config/config.py -- experiment=text2world_7b_example_cosmos_nemo_assets

The parameters for the dataloader are specified in the cosmos_predict1/diffusion/training/config/text2world/experiment.py file:

num_frames = 121
example_video_dataset_cosmos_nemo_assets = L(Dataset)(
    dataset_dir="datasets/cosmos_nemo_assets",
    sequence_interval=1,
    num_frames=num_frames,
    video_size=(720, 1280),
    start_frame_interval=1,
)

dataloader_train_cosmos_nemo_assets = L(DataLoader)(
    dataset=example_video_dataset_cosmos_nemo_assets,
    sampler=L(get_sampler)(dataset=example_video_dataset_cosmos_nemo_assets),
    batch_size=1,
    drop_last=True,
)
...

text2world_7b_example_cosmos_nemo_assets = LazyDict(
    dict(
        ...
        dataloader_train=dataloader_train_cosmos_nemo_assets,
        ...
    )
)
...

This file also contains the job configuration:

text2world_7b_example_cosmos_nemo_assets = LazyDict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="diffusion_text2world",
            name="text2world_7b_example_cosmos_nemo_assets",
        ),
        ...
    )
)

The checkpoints will be saved to ${OUTPUT_ROOT}/PROJECT/GROUP/NAME. In this example, the file structure is generated as follows:

checkpoints/posttraining/diffusion_text2world/text2world_7b_example_cosmos_nemo_assets/checkpoints/
├── iter_{NUMBER}_reg_model.pt
├── iter_{NUMBER}_ema_model.pt

Test Inference#

Follow these steps to perform inference with the post-trained model.

  1. Copy the checkpoint file to a new location (in this example, checkpoints/Cosmos-Predict1-7B-Text2World_post-trained/model.pt) so it can be referenced by the inference script.

    mkdir checkpoints/Cosmos-Predict1-7B-Text2World_post-trained/
    cp checkpoints/posttraining/diffusion_text2world/text2world_7b_example_cosmos_nemo_assets/checkpoints/iter_000001000_ema_model.pt checkpoints/Cosmos-Predict1-7B-Text2World_post-trained/model.pt
    

    .. Note:: The above checkpoint file is named iter_000001000_ema_model.pt because it is an EMA (Exponential Moving Average) checkpoint file with 1000 iterations, and will change if post-training is configured differently.

  2. Run inference with the post-trained Text2World model using the --diffusion_transformer_dir argument.

    # Run the video generation command with a single gpu
    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
       --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World_post-trained \
       --prompt "${PROMPT}" \
       --video_save_name diffusion-text2world-7b-post-train \
       --offload_prompt_upsampler
    

You can also test out other Text2World inference options with the --diffusion_transformer_dir argument. Refer to the Model Reference page for more examples.

Video2World#

Download the Sample Dataset#

This example uses a subset of the HD-VILA-100M dataset for post-training. Follow these steps to download the subset:

  1. Download the metadata containing the video URLs and captions:

    mkdir -p datasets/hdvila
    cd datasets/hdvila
    wget https://huggingface.co/datasets/TempoFunk/hdvila-100M/resolve/main/hdvila-100M.jsonl
    
  2. Install the prerequisites required for downloading the sample videos:

    pip install pytubefix ffmpeg
    
  3. Use the download_diffusion_example_data.py script to download the videos and save the corresponding clips, captions, and metadata.

    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_example_data.py --dataset_path datasets/hdvila --N_videos 128 --do_download --do_clip
    

Preprocess the Data#

The second step is to preprocess the input videos.

Run the following command to pre-compute the T5-XXL embeddings for the video captions used for post-training:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/get_t5_embeddings.py --dataset_path datasets/hdvila

This command generates preprocessed data in the following directory structure:

datasets/hdvila/
├── metas/
│   ├── *.json
│   ├── *.txt
├── videos/
│   ├── *.mp4
├── t5_xxl/
│   ├── *.pickle

Post-train the Model#

Run the following commands to post-train the Video2World model using the hdvila dataset.

export OUTPUT_ROOT=checkpoints # default value
torchrun --nproc_per_node=8 -m cosmos_predict1.diffusion.training.train --config=cosmos_predict1/diffusion/training/config/config.py -- experiment=video2world_7b_example_hdvila

The parameters for the dataloader are specified in the cosmos_predict1/diffusion/training/config/video2world/experiment.py file:

num_frames = 121
example_video_dataset = L(Dataset)(
    dataset_dir="datasets/hdvila",
    sequence_interval=1,
    num_frames=num_frames,
    video_size=(720, 1280),
    start_frame_interval=1,
)

dataloader_train = L(DataLoader)(
    dataset=example_video_dataset,
    sampler=L(get_sampler)(dataset=example_video_dataset),
    batch_size=1,
    drop_last=True,
)
...

video2world_7b_example_hdvila = LazyDict(
    dict(
        ...
        dataloader_train=dataloader_train,
        ...
    )
)
...

This file also contains the job configuration:

video2world_7b_example_hdvila = LazyDict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="diffusion_video2world",
            name="video2world_7b_example_hdvila",
        ),
        ...
    )
)

The checkpoints will be saved to ${OUTPUT_ROOT}/PROJECT/GROUP/NAME. In this example, the file structure is generated as follows:

checkpoints/posttraining/diffusion_video2world/video2world_7b_example_hdvila/checkpoints/
├── iter_{NUMBER}_reg_model.pt
├── iter_{NUMBER}_ema_model.pt

Test Inference#

Follow these steps to perform inference with the post-trained model.

  1. Copy the checkpoint file to a new location (in this example, checkpoints/Cosmos-Predict1-7B-Text2World_post-trained/model.pt) so it can be referenced by the inference script.

    mkdir checkpoints/Cosmos-Predict1-7B-Video2World_post-trained/
    cp checkpoints/posttraining/diffusion_video2world/video2world_7b_example_hdvila/checkpoints/iter_000001000_ema_model.pt checkpoints/Cosmos-Predict1-7B-Video2World_post-trained/model.pt
    

    .. Note:: The above checkpoint file is named iter_000001000_ema_model.pt because it is an EMA (Exponential Moving Average) checkpoint file with 1000 iterations, and will change if post-training is configured differently.

  2. Run inference with the post-trained Video2World model using the --diffusion_transformer_dir argument.

    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
       --checkpoint_dir checkpoints \
       --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World_post-trained \
       --input_image_or_video_path assets/diffusion/video2world_input0.jpg \
       --num_input_frames 1 \
       --offload_prompt_upsampler \
       --video_save_name diffusion-video2world-7b-post-trained
    

You can also test out other Video2World inference options with the --diffusion_transformer_dir argument. Refer to the Model Reference page for more examples.

Text2World-Multiview#

Post-Train the Model#

Run the following command to execute an example post-training job with the mock data.

export OUTPUT_ROOT=checkpoints # default value
torchrun --nproc_per_node=8 -m cosmos_predict1.diffusion.training.train --config=cosmos_predict1/diffusion/training/config/config_multiview.py -- experiment=text2world_multiview_7b_example

The parameters for the dataloader are specified in the cosmos_predict1/diffusion/training/config/text2world_multiview/experiment.py file:

text2world_multiview_7b_example = LazyDict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="diffusion_text2world",
            name="text2world_multiview_7b_example",
        ),
        ...
    )
)

The checkpoints will be saved to ${OUTPUT_ROOT}/PROJECT/GROUP/NAME. In this example, the file structure is generated as follows:

checkpoints/posttraining/diffusion_text2world/text2world_multiview_7b_example/checkpoints/
├── iter_{NUMBER}_reg_model.pt
├── iter_{NUMBER}_ema_model.pt

Test Inference#

Follow these steps to perform inference with the post-trained model.

  1. Copy the checkpoint file to a new location (in this example, checkpoints/Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview_post-trained/model.pt) so it can be referenced by the inference script.

    mkdir checkpoints/Cosmos-Predict1-7B-Video2World_post-trained/
    cp checkpoints/posttraining/diffusion_video2world/video2world_7b_example_hdvila/checkpoints/iter_000001000_ema_model.pt checkpoints/Cosmos-Predict1-7B-Video2World_post-trained/model.pt
    

    Note

    The above checkpoint file is named iter_000001000_ema_model.pt because it is an EMA (Exponential Moving Average) checkpoint file with 1000 iterations, and will change if post-training is configured differently.

  2. Run inference with the post-trained Video2World model using the --diffusion_transformer_dir argument.

    # Run the video generation command with a single gpu
    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world_multiview.py \
       --checkpoint_dir checkpoints \
       --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview_post-trained \
       --prompt "${PROMPT}" \
       --prompt_left "${PROMPT_LEFT}" \
       --prompt_right "${PROMPT_RIGHT}" \
       --prompt_back "${PROMPT_BACK}" \
       --prompt_back_left "${PROMPT_BACK_LEFT}" \
       --prompt_back_right "${PROMPT_BACK_RIGHT}" \
       --video_save_name diffusion-text2world-multiview-7b-post-train
    

You can also test out other Text2World-Multiview inference options with the --diffusion_transformer_dir argument. Refer to the Model Reference page for more examples.

Video2World-Multiview#

Post-Train the Model#

Run the following command to execute an example post-training job with the mock data.

export OUTPUT_ROOT=checkpoints # default value
torchrun --nproc_per_node=8 -m cosmos_predict1.diffusion.training.train --config=cosmos_predict1/diffusion/training/config/config_multiview.py -- experiment=video2world_multiview_7b_example

The parameters for the dataloader are specified in the cosmos_predict1/diffusion/training/config/text2world_multiview/experiment.py file:

video2world_multiview_7b_example = LazyDict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="diffusion_video2world",
            name="video2world_multiview_7b_example",
        ),
        ...
    )
)

The checkpoints will be saved to ${OUTPUT_ROOT}/PROJECT/GROUP/NAME. In this example, the file structure is generated as follows:

checkpoints/posttraining/diffusion_video2world/video2world_multiview_7b_example/checkpoints/
├── iter_{NUMBER}_reg_model.pt
├── iter_{NUMBER}_ema_model.pt

Test Inference#

Follow these steps to perform inference with the post-trained model.

  1. Copy the checkpoint file to a new location (in this example, checkpoints/Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview_post-trained/model.pt) so it can be referenced by the inference script.

    # copy checkpoint to the designated location
    mkdir checkpoints/Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview_post-trained/
    cp checkpoints/posttraining/diffusion_video2world/video2world_multiview_7b_example/checkpoints/iter_000001000_ema_model.pt checkpoints/Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview_post-trained/model.pt
    

    .. Note:: The above checkpoint file is named iter_000001000_ema_model.pt because it is an EMA (Exponential Moving Average) checkpoint file with 1000 iterations, and will change if post-training is configured differently.

  2. Run inference with the post-trained Video2World model using the --diffusion_transformer_dir argument.

    # Run the video generation command with a single gpu
    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_multiview.py \
       --checkpoint_dir checkpoints \
       --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview_post-trained \
       --input_image_or_video_path assets/diffusion/video2world_multiview_input1.mp4 \
       --num_input_frames 1 \
       --prompt "${PROMPT}" \
       --prompt_left "${PROMPT_LEFT}" \
       --prompt_right "${PROMPT_RIGHT}" \
       --prompt_back "${PROMPT_BACK}" \
       --prompt_back_left "${PROMPT_BACK_LEFT}" \
       --prompt_back_right "${PROMPT_BACK_RIGHT}" \
       --video_save_name diffusion-video2world-multiview-7b-post-train
    

You can also test out other Video2World-Multiview inference options with the --diffusion_transformer_dir argument. Refer to the Diffusion Model Reference page for more examples.