!— Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. NVIDIA CORPORATION and its licensors retain all intellectual property and proprietary rights in and to this software, related documentation and any modifications thereto. Any use, reproduction, disclosure or distribution of this software and related documentation without an express license agreement from NVIDIA CORPORATION is strictly prohibited. –>
Post-training with GR00T-Dreams-GR1 and GR00T-Dreams-DR0ID Datasets#
This section provides instructions for post-training Predict2 Video2World models with the GR00T-Dreams-GR1 and GR00T-Dreams-DR0ID datasets.
Set up the Video2World Model#
Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.
Follow the Installation guide to download the Cosmos-Predict2 repo and set up the environment.
Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).
Log in to Hugging Face with the access token:
huggingface-cli login
Review and accept the Llama-Guard-3-8B terms.
Download the model weights for Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World from Hugging Face:
python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B 14B
Tip
Change the
--model_sizes
parameter as needed if you only need one of the 2B/14B models. Furthermore, the model download command defaults to the 720P, 16FPS version of the model checkpoints. Refer to the Reference page for customizing which variants to download.
If you want to run the commands in https://github.com/nvidia/GR00T-dreams, after setting up the environment, run the following command to install extra dependencies:
# If use Docker and see ERROR: Cannot install httpcore==1.0.7 because these package versions have conflicting dependencies
# The following command may help resolve the package version conflict:
# grep -v "^h11==" /etc/pip/constraint.txt > /etc/pip/constraint_new.txt && mv /etc/pip/constraint_new.txt /etc/pip/constraint.txt
# grep -v "^httpcore==" /etc/pip/constraint.txt > /etc/pip/constraint_new.txt && mv /etc/pip/constraint_new.txt /etc/pip/constraint.txt
pip install openai tyro numpydantic albumentations tianshou git+https://github.com/facebookresearch/pytorch3d.git
Prepare the Dataset#
Download the DreamGen Bench Training Dataset#
For training on the robotic training datasets from the DreamGen paper, please use the following command to download the GR1 training dataset from https://huggingface.co/datasets/nvidia/GR1-100
under cosmos-predict2/
folder, run:
# This command will download the videos for physical AI
huggingface-cli download nvidia/GR1-100 --repo-type dataset --local-dir datasets/benchmark_train/hf_gr1/ && \
mkdir -p datasets/benchmark_train/gr1/videos && \
mv datasets/benchmark_train/hf_gr1/gr1/*mp4 datasets/benchmark_train/gr1/videos && \
mv datasets/benchmark_train/hf_gr1/metadata.csv datasets/benchmark_train/gr1/
1.2 Preprocessing the Data#
Run the following command to pre-compute T5-XXL embeddings for the video captions used for post-training:
# The script will use the provided prompt, save the T5-XXL embeddings in pickle format.
python -m scripts.get_t5_embeddings_from_groot_dataset --dataset_path datasets/benchmark_train/gr1
Dataset folder format:
datasets/benchmark_train/gr1/
├── metas/
│ ├── *.txt
├── videos/
│ ├── *.mp4
├── t5_xxl/
│ ├── *.pickle
Post-train the Video2World Model#
Cosmos-Predict2-2B-Video2World#
Run the following command to execute an example post-training job with GR1
data.
EXP=predict2_video2world_training_2b_groot_gr1_480
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}
The model will be post-trained using the GR1
dataset.
See the config predict2_video2world_training_2b_groot_gr1_480
defined in cosmos_predict2/configs/base/experiment/groot.py
to understand how the dataloader is defined.
# GROOT example
example_video_dataset_gr1 = L(Dataset)(
dataset_dir="datasets/benchmark_train/gr1",
num_frames=93,
video_size=(432, 768),
)
dataloader_train_gr1 = L(DataLoader)(
dataset=example_video_dataset_gr1,
sampler=L(get_sampler)(dataset=example_video_dataset_gr1),
batch_size=1,
drop_last=True,
num_workers=8,
pin_memory=True,
)
The checkpoints will be saved to checkpoints/PROJECT/GROUP/NAME
.
In the above example, PROJECT
is posttraining
, GROUP
is video2world
, NAME
is 2b_groot_gr1_480
.
See the job config to understand how they are determined.
predict2_video2world_training_2b_groot_gr1_480 = dict(
dict(
...
job=dict(
project="posttraining",
group="video2world",
name="2b_groot_gr1_480",
),
...
)
)
The checkpoints will be saved in the below structure.
checkpoints/posttraining/video2world/2b_groot_gr1_480/checkpoints/
├── model/
│ ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt
Cosmos-Predict2-14B-Video2World#
Run the following command to execute an example post-training job with GR1
data with 4 nodes with 8 GPUs.
EXP=predict2_video2world_training_14b_groot_gr1_480
NVTE_FUSED_ATTN=0 torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}
Optionally, you could load the
Cosmos-Predict2-14B-Video2World-GR00T-Dreams-GR1
checkpoint for initialization, by appendingmodel.config.model_manager_config.dit_path=checkpoints/nvidia/Cosmos-Predict2-14B-Video2World-Sample-GR00T-Dreams-GR1/model-480p-16fps.pt
to the above command.
Perform Inference for GR00T Dreams Checkpoints#
Tip
For more examples of how to use the inference script, refer to the Video2World Model Reference.
Inference with GR1 Checkpoint#
torchrun --nproc_per_node=8 --master_port=12341 \
-m examples.video2world_gr00t \
--model_size 14B \
--gr00t_variant gr1 \
--prompt "Use the right hand to pick up rubik\'s cube from from the bottom of the three-tiered wooden shelf to to the top of the three-tiered wooden shelf." \
--input_path assets/sample_gr00t_dreams_gr1/8_Use_the_right_hand_to_pick_up_rubik\'s_cube_from_from_the_bottom_of_the_three-tiered_wooden_shelf_to_to_the_top_of_the_three-tiered_wooden_shelf..png \
--num_gpus 8 \
--prompt_prefix "" \
--save_path output/generated_video_gr1.mp4 \
Inference with DROID Checkpoint#
torchrun --nproc_per_node=8 --master_port=12341 \
-m examples.video2world_gr00t \
--model_size 14B \
--gr00t_variant droid \
--prompt "A multi-view video shows that a robot pick the lid and put it on the pot The video is split into four views: The top-left view shows the robotic arm from the left side, the top-right view shows it from the right side, the bottom-left view shows a first-person perspective from the robot's end-effector (gripper), and the bottom-right view is a black screen (inactive view). The robot pick the lid and put it on the pot" \
--input_path assets/sample_gr00t_dreams_droid/episode_000408.png \
--prompt_prefix "" \
--num_gpus 8 \
--save_path output/generated_video_droid.mp4
Perform Inference with DreamGen Benchmark#
Download the DreamGen Benchmark Dataset#
The following command will download the DreamGen Benchmark dataset from https://huggingface.co/datasets/nvidia/EVAL-175
huggingface-cli download nvidia/EVAL-175 --repo-type dataset --local-dir dream_gen_benchmark
Prepare the batch input JSON file:
python -m scripts.prepare_batch_input_json \
--dataset_path dream_gen_benchmark/gr1_object/ \
--save_path output/dream_gen_benchmark/cosmos_predict2_14b_gr1_object/ \
--output_path dream_gen_benchmark/gr1_object/batch_input.json
Perform inference:
python -m examples.video2world_gr00t \
--model_size 14B \
--gr00t_variant gr1 \
--batch_input_json dream_gen_benchmark/gr1_object/batch_input.json \
--disable_guardrail
Note
For full evaluation without missing videos, it’s better to turn off the guardrail checks (add --disable_guardrail
to the command) to make sure all the videos are generated.
Tip
For more examples on how to improve video quality using the Cosmos-Reason1 video critic capability, refer to the Cosmos-Reason1 video critic instructions and Video2World Model Reference.
Perform Inference with Cosmos-Reason1 Rejection Sampling#
Use the following command to perform inference with the GR1 checkpoint and rejection sampling
torchrun --nproc_per_node=8 --master_port=12341 \
-m examples.video2world_bestofn \
--model_size 14B \
--gr00t_variant gr1 \
--prompt "Use the right hand to pick up rubik\'s cube from from the bottom of the three-tiered wooden shelf to to the top of the three-tiered wooden shelf." \
--input_path assets/sample_gr00t_dreams_gr1/8_Use_the_right_hand_to_pick_up_rubik\'s_cube_from_from_the_bottom_of_the_three-tiered_wooden_shelf_to_to_the_top_of_the_three-tiered_wooden_shelf..png \
--num_gpus 8 \
--num_generations 4 \
--prompt_prefix "" \
--disable_guardrail \
--save_path output/best-of-n-gr00t-gr1