Post-training with Cosmos-NeMo-Assets#

This section provides instructions for post-training Predict2 Video2World models with the Cosmos-NeMo-Assets dataset.

Set up the Video2World Model#

Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.
Follow the Installation guide to download the Cosmos-Predict2 repo and set up the environment.
Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).
Log in to Hugging Face with the access token:
```
huggingface-cli login
```
Review and accept the Llama-Guard-3-8B terms.
Download the model weights for Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World from Hugging Face:
```
python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B 14B
```
Tip

Change the --model_sizes parameter as needed if you only need one of the 2B/14B models. Furthermore, the model download command defaults to the 720P, 16FPS version of the model checkpoints. Refer to the Reference page for customizing which variants to download.

Prepare the Dataset#

Download Cosmos-NeMo-Assets#

The first step is downloading a dataset with videos.

You must provide a folder containing a collection of videos in MP4 format, preferably 720p. These videos should focus on the subject throughout the entire video so that each video chunk contains the subject.

You can use nvidia/Cosmos-NeMo-Assets for post-training.

mkdir -p datasets/cosmos_nemo_assets/

# This command will download the videos for physical AI
huggingface-cli download nvidia/Cosmos-NeMo-Assets --repo-type dataset --local-dir datasets/cosmos_nemo_assets/ --include "*.mp4*"

mv datasets/cosmos_nemo_assets/nemo_diffusion_example_data datasets/cosmos_nemo_assets/videos

Preprocess the Data#

Cosmos-NeMo-Assets comes with a single caption for 4 long videos. Run the following command to pre-compute T5-XXL embeddings for the video caption used for post-training:

# The script will use the provided prompt, save the T5-XXL embeddings in pickle format.
PYTHONPATH=$(pwd) python scripts/get_t5_embeddings_from_cosmos_nemo_assets.py --dataset_path datasets/cosmos_nemo_assets --prompt "A video of sks teal robot."

Dataset folder format:

datasets/cosmos_nemo_assets/
├── metas/
│   ├── *.txt
├── videos/
│   ├── *.mp4
├── t5_xxl/
│   ├── *.pickle

Post-train the Video2World Model#

Cosmos-Predict2-2B-Video2World#

Run the following command to execute an example post-training job with cosmos_nemo_assets data.

EXP=predict2_video2world_training_2b_cosmos_nemo_assets
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

The model will be post-trained using the cosmos_nemo_assets dataset. See the config predict2_video2world_training_2b_cosmos_nemo_assets defined in cosmos_predict2/configs/base/experiment/cosmos_nemo_assets.py to understand how the dataloader is defined.

# Cosmos-NeMo-Assets example
example_video_dataset_cosmos_nemo_assets = L(Dataset)(
    dataset_dir="datasets/cosmos_nemo_assets",
    num_frames=93,
    video_size=(704, 1280),
)

dataloader_train_cosmos_nemo_assets = L(DataLoader)(
    dataset=example_video_dataset_cosmos_nemo_assets,
    sampler=L(get_sampler)(dataset=example_video_dataset_cosmos_nemo_assets),
    batch_size=1,
    drop_last=True,
    num_workers=8,
    pin_memory=True,
)

The checkpoints will be saved to checkpoints/PROJECT/GROUP/NAME. In the above example, PROJECT is posttraining, GROUP is video2world, NAME is 2b_cosmos_nemo_assets.

See the job config to understand how they are determined.

predict2_video2world_training_2b_cosmos_nemo_assets = dict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="video2world",
            name="2b_cosmos_nemo_assets",
        ),
        ...
    )
)

The checkpoints will be saved in the below structure.

checkpoints/posttraining/video2world/2b_cosmos_nemo_assets/checkpoints/
├── model/
│   ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt

Cosmos-Predict2-14B-Video2World#

Run the following command to execute an example post-training job with cosmos_nemo_assets data with 4 nodes with 8 GPUs.

EXP=predict2_video2world_training_14b_cosmos_nemo_assets
torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

The above command will train the entire model. If you are interested in training with LoRA, attach model.config.train_architecture=lora to the training command.

The checkpoints will be saved in the below structure.

checkpoints/posttraining/video2world/14b_cosmos_nemo_assets/checkpoints/
├── model/
│   ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt

Run Inference on Post-Trained Checkpoints#

Cosmos-Predict2-2B-Video2World#

For example, if a posttrained checkpoint with 1000 iterations is to be used, run the following command. Use --dit_path argument to specify the path to the post-trained checkpoint.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python examples/video2world.py \
  --model_size 2B \
  --dit_path "checkpoints/posttraining/video2world/2b_cosmos_nemo_assets/checkpoints/model/iter_000001000.pt" \
  --prompt "A video of sks teal robot." \
  --input_path "assets/video2world_cosmos_nemo_assets/output_Digit_Lift_movie.jpg" \
  --save_path output/cosmos_nemo_assets/generated_video_teal_robot.mp4

Refer to the Video2World Model Reference for inference run details.

Cosmos-Predict2-14B-Video2World#

The 14B model can be run similarly by changing the --model_size and --dit_path arguments.