Predict2 Post-Training Guide#

This page provides instructions for post-training with Cosmos-Predict2-Video2World models. These models can transform a still image or video clip into a longer, animated sequence guided by the text description.

This page provides instructions for three post-training scenarios.

  • Post-training with a custom dataset

  • Post-training from an example pre-trained Video2World checkpoint.

  • Post-training with Cosmos-NeMo-Assets.

Set up the Video2World Model#

  1. Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.

  2. Follow the Installation guide to download the Cosmos-Predict2 repo and set up the environment.

  3. Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).

  4. Log in to Hugging Face with the access token:

    huggingface-cli login
    
  5. Review and accept the Llama-Guard-3-8B terms.

Post-training with a Custom Dataset#

1. Preparing Data#

The post-training data is expected to contain paired prompt and video files. For example, a custom dataset can be saved in a following structure.

Dataset folder format:

datasets/benchmark_train/custom_dataset/
├── metas/
│   ├── *.txt
├── videos/
│   ├── *.mp4

metas folder contains .txt files containing prompts describing the video content. videow folder contains the corresponding .mp4 video files.

After preparing metas and videos folders, run the following command to pre-compute T5-XXL embeddings.

python -m scripts.get_t5_embeddings --dataset_path datasets/benchmark_train/custom_dataset/

This script will create t5_xxl folder under the dataset root where the T5-XXL embeddings are saved as .pickle files.

datasets/benchmark_train/custom_dataset/
├── metas/
│   ├── *.txt
├── videos/
│   ├── *.mp4
├── t5_xxl/
│   ├── *.pickle

2. Creating Configs for Training#

Define dataloader from the prepared dataset.

For example,

# custom dataset example
example_video_dataset = L(Dataset)(
    dataset_dir="datasets/benchmark_train/custom_dataset",
    num_frames=93,
    video_size=(704, 1280),
)

dataloader_train = L(DataLoader)(
    dataset=example_video_dataset,
    sampler=L(get_sampler)(dataset=example_video_dataset),
    batch_size=1,
    drop_last=True,
    num_workers=8,
    pin_memory=True,
)

With the dataloader_train, create a config for a training job. Here’s a post-training example for video2world 2B model.

predict2_video2world_training_2b_custom_data = dict(
    defaults=[
        {"override /model": "predict2_video2world_fsdp_2b"},
        {"override /optimizer": "fusedadamw"},
        {"override /ckpt_type": "standard"},
        {"override /data_val": "mock"},
        "_self_",
    ],
    job=dict(
        project="posttraining",
        group="video2world",
        name="2b_custom_data",
    ),
    model=dict(
        config=dict(
            fsdp_shard_size=8,              # FSDP size
            pipe_config=dict(
                ema=dict(enabled=True),     # enable EMA during training
                guardrail_config=dict(enabled=False),   # disable guardrail during training
            ),
        )
    ),
    model_parallel=dict(
        context_parallel_size=2,            # context parallelism size
    ),
    dataloader_train=dataloader_train,
    trainer=dict(
        distributed_parallelism="fsdp",
        callbacks=dict(
            iter_speed=dict(hit_thres=10),
        ),
        max_iter=1000,                      # maximum number of iterations
    ),
    checkpoint=dict(
        save_iter=200,                      # checkpoints will be saved every 200 iterations.
    ),
)

The config should be registered to ConfigStore.

for _item in [
    # 2b, custom data
    predict2_video2world_training_2b_custom_data,
]:
    # Get the experiment name from the global variable.
    experiment_name = [name.lower() for name, value in globals().items() if value is _item][0]

    cs.store(
        group="experiment",
        package="_global_",
        name=experiment_name,
        node=_item,
    )

2.1. Config System#

In the above config example, it starts by overriding from the registered configs.

    {"override /model": "predict2_video2world_fsdp_2b"},
    {"override /optimizer": "fusedadamw"},
    {"override /ckpt_type": "standard"},
    {"override /data_val": "mock"},

The configuration system is organized as follows:

cosmos_predict2/configs/base/
├── config.py                   # Main configuration class definition
├── defaults/                   # Default configuration groups   ├── callbacks.py            # Training callbacks configurations   ├── checkpoint.py           # Checkpoint saving/loading configurations   ├── data.py                 # Dataset and dataloader configurations   ├── ema.py                  # Exponential Moving Average configurations   ├── model.py                # Model architecture configurations   ├── optimizer.py            # Optimizer configurations   └── scheduler.py            # Learning rate scheduler configurations
└── experiment/                 # Experiment-specific configurations
    ├── cosmos_nemo_assets.py   # Experiments with cosmos_nemo_assets
    └── utils.py                # Utility functions for experiments

The system provides several pre-defined configuration groups that can be mixed and matched:

Model Configurations (defaults/model.py)#

  • predict2_video2world_fsdp_2b: 2B parameter Video2World model with FSDP

  • predict2_video2world_fsdp_14b: 14B parameter Video2World model with FSDP

Optimizer Configurations (defaults/optimizer.py)#

  • fusedadamw: FusedAdamW optimizer with standard settings

  • Custom optimizer configurations for different training scenarios

Scheduler Configurations (defaults/scheduler.py)#

  • constant: Constant learning rate

  • Various learning rate scheduling strategies

Data Configurations (defaults/data.py)#

  • Training and validation dataset configurations

Checkpoint Configurations (defaults/checkpoint.py)#

  • standard: Standard local checkpoint handling

Callback Configurations (defaults/callbacks.py)#

  • basic: Essential training callbacks

  • Performance monitoring and logging callbacks

In addition to the overrided values, the rest of the config setup overwrites or addes the other config details.

3. Run a Training Job.#

Run the following command to execute an example post-training job with the custom data.

EXP=predict2_video2world_training_2b_custom_data
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

The above command will train the entire model. If you are interested in training with LoRA, attach model.config.train_architecture=lora to the training command.

The checkpoints will be saved to checkpoints/PROJECT/GROUP/NAME. In the above example, PROJECT is posttraining, GROUP is video2world, NAME is 2b_custom_data.

checkpoints/posttraining/video2world/2b_custom_data/checkpoints/
├── model/
│   ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt

4. Run Inference on Post-trained Checkpoints#

Cosmos-Predict2-2B-Video2World#

For example, if a posttrained checkpoint with 1000 iterations is to be used, run the following command. Use --dit_path argument to specify the path to the post-trained checkpoint.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python examples/video2world.py \
  --model_size 2B \
  --dit_path "checkpoints/posttraining/video2world/predict2_video2world_training_2b_custom_data/checkpoints/model/iter_000001000.pt" \
  --prompt "A descriptive prompt for physical AI." \
  --input_path "assets/video2world_cosmos_nemo_assets/output_Digit_Lift_movie.jpg" \
  --save_path results/cosmos_nemo_assets/generated_video_from_post-training.mp4

Refer to the Video2World Model Reference for inference run details.

Cosmos-Predict2-14B-Video2World#

The 14B model can be run similarly by changing the --model_size and --dit_path arguments.

Post-Training from an Example Pre-trained Video2World Checkpoint#

1. Preparing Data#

1.1 Download Bridge training dataset#

We use the train/validation splits of the Bridge dataset from IRASim for action-conditioned post-training. To download and prepare the dataset, run the following commands under the cosmos-predict2/ directory:

wget https://lf-robot-opensource.bytetos.com/obj/lab-robot-public/opensource_IRASim_v1/bridge_train_data.tar.gz
mv bridge_train_data.tar.gz datasets/
cd datasets
tar -xvzf bridge_train_data.tar.gz -C .
mv opensource_robotdata/bridge ./

Your dataset directory structure should look like this:

datasets/bridge/
├── annotations/
│   ├── *.json
├── videos/
    ├── *.mp4

Each JSON file in the annotations/ folder contains the end-effector pose and gripper width of the robot arm for each frame in the corresponding video. Specifically, each file includes:

  • state: The end-effector pose of the robot arm at each timestep, represented as [x, y, z, roll, pitch, yaw].

    • (x, y, z) denotes the gripper’s position in world coordinates.

    • (roll, pitch, yaw) describes its orientation in Euler angles.

  • continuous_gripper_state: The width of the gripper at each timestep, indicating whether it is open or closed. A value of 0 means the gripper is open, and 1 means it is closed.

  • action: The gripper’s displacement at each timestep.

    • The first six dimensions represent displacement in (x, y, z, roll, pitch, yaw) within the gripper coordinate frame.

    • The last (seventh) dimension is a binary value indicating whether the gripper should open (1) or close (0).

We use this information as conditioning input for video generation.

2. Post-training#

2.1. Cosmos-Predict2-2B-Video2World#

Run the following command to launch an example post-training job using the Bridge dataset:

torchrun --nproc_per_node=2 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment="action_conditioned_predict2_video2world_2b_training"

See cosmos_predict2/configs/action_conditioned/defaults/data.py to understand how the dataloader is defined. To add action as additional condition, we create new conditioner to support action in cosmos_predict2/configs/action_conditioned/defaults/conditioner.py.

Checkpoint Output Structure#

Checkpoints are saved to the following path:

checkpoints/PROJECT/GROUP/NAME

For the example command above:

  • PROJECT: posttraining

  • GROUP: video2world

  • NAME: action_conditioned_predict2_video2world_2b_training_${now:%Y-%m-%d}_${now:%H-%M-%S}

Configuration Snippet#

Below is a configuration snippet defining the experiment setup:

action_conditioned_predict2_video2world_2b_training = dict(
    defaults=[
        {"override /model": "action_conditioned_predict2_v2w_2b_fsdp"},
        {"override /optimizer": "fusedadamw"},
        {"override /ckpt_type": "standard"},
        {"override /data_train": "bridge_train"},
        "_self_",
    ],
    model=dict(
        config=dict(
            fsdp_shard_size=-1,
        )
    ),
    job=dict(group="debug", name="action_conditioned_predict2_video2world_2b_training_${now:%Y-%m-%d}_${now:%H-%M-%S}"),
)

3. Inference for Bridge#

3.1. Cosmos-Predict2-2B-Video2World#

To run inference using a post-trained checkpoint (e.g., at 1000 iterations), use the command below. Specify the path to the checkpoint using the --dit_path argument:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python examples/action_video2world.py \
  --model_size 2B \
  --dit_path "checkpoints/posttraining/video2world/action_conditioned_predict2_video2world_2b_training_${now:%Y-%m-%d}_${now:%H-%M-%S}/checkpoints/model/iter_000001000.pt" \
  --input_video datasets/bridge/videos/test/13/rgb.mp4 \
  --input_annotation datasets/bridge/annotation/test/13.json \
  --num_conditional_frames 1 \
  --save_path output/generated_video.mp4 \
  --guidance 0 \
  --seed 0 \
  --disable_guardrail \
  --disable_prompt_refiner 

Post-Training with Cosmos-NeMo-Assets#

1. Preparing Data#

1.1 Downloading Cosmos-NeMo-Assets#

The first step is downloading a dataset with videos.

You must provide a folder containing a collection of videos in MP4 format, preferably 720p. These videos should focus on the subject throughout the entire video so that each video chunk contains the subject.

You can use nvidia/Cosmos-NeMo-Assets for post-training.

mkdir -p datasets/benchmark_train/cosmos_nemo_assets/

# This command will download the videos for physical AI
huggingface-cli download nvidia/Cosmos-NeMo-Assets --repo-type dataset --local-dir datasets/benchmark_train/cosmos_nemo_assets/ --include "*.mp4*"

mv datasets/benchmark_train/cosmos_nemo_assets/nemo_diffusion_example_data datasets/benchmark_train/cosmos_nemo_assets/videos

1.2 Preprocessing the Data#

Cosmos-NeMo-Assets comes with a single caption for 4 long videos. Run the following command to pre-compute T5-XXL embeddings for the video caption used for post-training:

# The script will use the provided prompt, save the T5-XXL embeddings in pickle format.
PYTHONPATH=$(pwd) python scripts/get_t5_embeddings_from_cosmos_nemo_assets.py --dataset_path datasets/benchmark_train/cosmos_nemo_assets --prompt "A video of sks teal robot."

Dataset folder format:

datasets/benchmark_train/cosmos_nemo_assets/
├── metas/
│   ├── *.txt
├── videos/
│   ├── *.mp4
├── t5_xxl/
│   ├── *.pickle

2. Post-training#

2.1. Post-training on Cosmos-NeMo-Assets dataset#

Cosmos-Predict2-2B-Video2World#

Run the following command to execute an example post-training job with cosmos_nemo_assets data.

EXP=predict2_video2world_training_2b_cosmos_nemo_assets
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

The model will be post-trained using the cosmos_nemo_assets dataset. See the config predict2_video2world_training_2b_cosmos_nemo_assets defined in cosmos_predict2/configs/base/experiment/cosmos_nemo_assets.py to understand how the dataloader is defined.

# Cosmos-NeMo-Assets example
example_video_dataset_cosmos_nemo_assets = L(Dataset)(
    dataset_dir="datasets/benchmark_train/cosmos_nemo_assets",
    num_frames=93,
    video_size=(704, 1280),
)

dataloader_train_cosmos_nemo_assets = L(DataLoader)(
    dataset=example_video_dataset_cosmos_nemo_assets,
    sampler=L(get_sampler)(dataset=example_video_dataset_cosmos_nemo_assets),
    batch_size=1,
    drop_last=True,
    num_workers=8,
    pin_memory=True,
)

The checkpoints will be saved to checkpoints/PROJECT/GROUP/NAME. In the above example, PROJECT is posttraining, GROUP is video2world, NAME is 2b_cosmos_nemo_assets.

See the job config to understand how they are determined.

predict2_video2world_training_2b_cosmos_nemo_assets = dict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="video2world",
            name="2b_cosmos_nemo_assets",
        ),
        ...
    )
)

The checkpoints will be saved in the below structure.

checkpoints/posttraining/video2world/2b_cosmos_nemo_assets/checkpoints/
├── model/
│   ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt
Cosmos-Predict2-14B-Video2World#

Run the following command to execute an example post-training job with cosmos_nemo_assets data with 4 nodes with 8 GPUs.

EXP=predict2_video2world_training_14b_cosmos_nemo_assets
torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

The above command will train the entire model. If you are interested in training with LoRA, attach model.config.train_architecture=lora to the training command.

The checkpoints will be saved in the below structure.

checkpoints/posttraining/video2world/14b_cosmos_nemo_assets/checkpoints/
├── model/
│   ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt

3. Inference with the Post-trained checkpoint#

3.1 Inference#

Cosmos-Predict2-2B-Video2World#

For example, if a posttrained checkpoint with 1000 iterations is to be used, run the following command. Use --dit_path argument to specify the path to the post-trained checkpoint.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python examples/video2world.py \
  --model_size 2B \
  --dit_path "checkpoints/posttraining/video2world/predict2_video2world_training_2b_cosmos_nemo_assets/checkpoints/model/iter_000001000.pt" \
  --prompt "A video of sks teal robot." \
  --input_path "assets/video2world_cosmos_nemo_assets/output_Digit_Lift_movie.jpg" \
  --save_path results/cosmos_nemo_assets/generated_video_teal_robot.mp4

Refer to the Video2World Model Reference for inference run details.

Cosmos-Predict2-14B-Video2World#

The 14B model can be run similarly by changing the --model_size and --dit_path arguments.