!— Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. NVIDIA CORPORATION and its licensors retain all intellectual property and proprietary rights in and to this software, related documentation and any modifications thereto. Any use, reproduction, disclosure or distribution of this software and related documentation without an express license agreement from NVIDIA CORPORATION is strictly prohibited. –>

Post-training with GR00T-Dreams-GR1 and GR00T-Dreams-DR0ID Datasets#

This section provides instructions for post-training Predict2 Video2World models with the GR00T-Dreams-GR1 and GR00T-Dreams-DR0ID datasets.

Set up the Video2World Model#

Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.
Follow the Installation guide to download the Cosmos-Predict2 repo and set up the environment.
Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).
Log in to Hugging Face with the access token:
```
huggingface-cli login
```
Review and accept the Llama-Guard-3-8B terms.
Download the model weights for Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World from Hugging Face:
```
python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B 14B
```
Tip

Change the --model_sizes parameter as needed if you only need one of the 2B/14B models. Furthermore, the model download command defaults to the 720P, 16FPS version of the model checkpoints. Refer to the Reference page for customizing which variants to download.

If you want to run the commands in https://github.com/nvidia/GR00T-dreams, after setting up the environment, run the following command to install extra dependencies:

# If use Docker and see ERROR: Cannot install httpcore==1.0.7 because these package versions have conflicting dependencies
# The following command may help resolve the package version conflict:
# grep -v "^h11==" /etc/pip/constraint.txt > /etc/pip/constraint_new.txt && mv /etc/pip/constraint_new.txt /etc/pip/constraint.txt
# grep -v "^httpcore==" /etc/pip/constraint.txt > /etc/pip/constraint_new.txt && mv /etc/pip/constraint_new.txt /etc/pip/constraint.txt
pip install openai tyro numpydantic albumentations tianshou git+https://github.com/facebookresearch/pytorch3d.git

Prepare the Dataset#

Download the DreamGen Bench Training Dataset#

For training on the robotic training datasets from the DreamGen paper, please use the following command to download the GR1 training dataset from https://huggingface.co/datasets/nvidia/GR1-100

under cosmos-predict2/ folder, run:

# This command will download the videos for physical AI

huggingface-cli download nvidia/GR1-100 --repo-type dataset --local-dir datasets/benchmark_train/hf_gr1/ && \
mkdir -p datasets/benchmark_train/gr1/videos && \
mv datasets/benchmark_train/hf_gr1/gr1/*mp4 datasets/benchmark_train/gr1/videos && \
mv datasets/benchmark_train/hf_gr1/metadata.csv datasets/benchmark_train/gr1/

1.2 Preprocessing the Data#

Run the following command to pre-compute T5-XXL embeddings for the video captions used for post-training:

# The script will use the provided prompt, save the T5-XXL embeddings in pickle format.
python -m scripts.get_t5_embeddings_from_groot_dataset --dataset_path datasets/benchmark_train/gr1

Dataset folder format:

datasets/benchmark_train/gr1/
├── metas/
│   ├── *.txt
├── videos/
│   ├── *.mp4
├── t5_xxl/
│   ├── *.pickle

Post-train the Video2World Model#

Cosmos-Predict2-2B-Video2World#

Run the following command to execute an example post-training job with GR1 data.

EXP=predict2_video2world_training_2b_groot_gr1_480
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

The model will be post-trained using the GR1 dataset. See the config predict2_video2world_training_2b_groot_gr1_480 defined in cosmos_predict2/configs/base/experiment/groot.py to understand how the dataloader is defined.

# GROOT example
example_video_dataset_gr1 = L(Dataset)(
    dataset_dir="datasets/benchmark_train/gr1",
    num_frames=93,
    video_size=(432, 768),
)

dataloader_train_gr1 = L(DataLoader)(
    dataset=example_video_dataset_gr1,
    sampler=L(get_sampler)(dataset=example_video_dataset_gr1),
    batch_size=1,
    drop_last=True,
    num_workers=8,
    pin_memory=True,
)

The checkpoints will be saved to checkpoints/PROJECT/GROUP/NAME. In the above example, PROJECT is posttraining, GROUP is video2world, NAME is 2b_groot_gr1_480.

See the job config to understand how they are determined.

predict2_video2world_training_2b_groot_gr1_480 = dict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="video2world",
            name="2b_groot_gr1_480",
        ),
        ...
    )
)

The checkpoints will be saved in the below structure.

checkpoints/posttraining/video2world/2b_groot_gr1_480/checkpoints/
├── model/
│   ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt

Cosmos-Predict2-14B-Video2World#

Run the following command to execute an example post-training job with GR1 data with 4 nodes with 8 GPUs.

EXP=predict2_video2world_training_14b_groot_gr1_480
NVTE_FUSED_ATTN=0 torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

Optionally, you could load the Cosmos-Predict2-14B-Video2World-GR00T-Dreams-GR1 checkpoint for initialization, by appending model.config.model_manager_config.dit_path=checkpoints/nvidia/Cosmos-Predict2-14B-Video2World-Sample-GR00T-Dreams-GR1/model-480p-16fps.pt to the above command.

Perform Inference for GR00T Dreams Checkpoints#

Tip

For more examples of how to use the inference script, refer to the Video2World Model Reference.

Inference with GR1 Checkpoint#

torchrun --nproc_per_node=8 --master_port=12341 \
  -m examples.video2world_gr00t \
  --model_size 14B \
  --gr00t_variant gr1 \
  --prompt "Use the right hand to pick up rubik\'s cube from from the bottom of the three-tiered wooden shelf to to the top of the three-tiered wooden shelf." \
  --input_path assets/sample_gr00t_dreams_gr1/8_Use_the_right_hand_to_pick_up_rubik\'s_cube_from_from_the_bottom_of_the_three-tiered_wooden_shelf_to_to_the_top_of_the_three-tiered_wooden_shelf..png \
  --num_gpus 8 \
  --prompt_prefix "" \
  --save_path output/generated_video_gr1.mp4 \

Inference with DROID Checkpoint#

torchrun --nproc_per_node=8 --master_port=12341 \
  -m examples.video2world_gr00t \
  --model_size 14B \
  --gr00t_variant droid \
  --prompt "A multi-view video shows that a robot pick the lid and put it on the pot The video is split into four views: The top-left view shows the robotic arm from the left side, the top-right view shows it from the right side, the bottom-left view shows a first-person perspective from the robot's end-effector (gripper), and the bottom-right view is a black screen (inactive view). The robot pick the lid and put it on the pot" \
  --input_path assets/sample_gr00t_dreams_droid/episode_000408.png \
  --prompt_prefix "" \
  --num_gpus 8 \
  --save_path output/generated_video_droid.mp4

Perform Inference with DreamGen Benchmark#

Download the DreamGen Benchmark Dataset#

The following command will download the DreamGen Benchmark dataset from https://huggingface.co/datasets/nvidia/EVAL-175

huggingface-cli download nvidia/EVAL-175 --repo-type dataset --local-dir dream_gen_benchmark

Prepare the batch input JSON file:

python -m scripts.prepare_batch_input_json \
  --dataset_path dream_gen_benchmark/gr1_object/ \
  --save_path output/dream_gen_benchmark/cosmos_predict2_14b_gr1_object/ \
  --output_path dream_gen_benchmark/gr1_object/batch_input.json

Perform inference:

python -m examples.video2world_gr00t \
  --model_size 14B \
  --gr00t_variant gr1 \
  --batch_input_json dream_gen_benchmark/gr1_object/batch_input.json \
  --disable_guardrail

Note

For full evaluation without missing videos, it’s better to turn off the guardrail checks (add --disable_guardrail to the command) to make sure all the videos are generated.

Tip

For more examples on how to improve video quality using the Cosmos-Reason1 video critic capability, refer to the Cosmos-Reason1 video critic instructions and Video2World Model Reference.

Perform Inference with Cosmos-Reason1 Rejection Sampling#

Use the following command to perform inference with the GR1 checkpoint and rejection sampling

torchrun --nproc_per_node=8 --master_port=12341 \
  -m examples.video2world_bestofn \
  --model_size 14B \
  --gr00t_variant gr1 \
  --prompt "Use the right hand to pick up rubik\'s cube from from the bottom of the three-tiered wooden shelf to to the top of the three-tiered wooden shelf." \
  --input_path assets/sample_gr00t_dreams_gr1/8_Use_the_right_hand_to_pick_up_rubik\'s_cube_from_from_the_bottom_of_the_three-tiered_wooden_shelf_to_to_the_top_of_the_three-tiered_wooden_shelf..png \
  --num_gpus 8 \
  --num_generations 4 \
  --prompt_prefix "" \
  --disable_guardrail \
  --save_path output/best-of-n-gr00t-gr1