Post-training with AgiBotWorld-Alpha#

This section provides instructions for post-training Predict2 Video2World models with the AgiBotWorld-Alpha dataset.

Set up the Video2World Model#

  1. Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.

  2. Follow the Installation guide to download the Cosmos-Predict2 repo and set up the environment.

  3. Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).

  4. Log in to Hugging Face with the access token:

    huggingface-cli login
    
  5. Review and accept the Llama-Guard-3-8B terms.

  6. Download the model weights for Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World from Hugging Face:

    python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B 14B
    

    Tip

    Change the --model_sizes parameter as needed if you only need one of the 2B/14B models. Furthermore, the model download command defaults to the 720P, 16FPS version of the model checkpoints. Refer to the Reference page for customizing which variants to download.

Prepare the Dataset#

Download and Pre-Process the AgiBotWorld-Alpha Dataset#

We take a subset of AgiBotWorld-Alpha to provide an example post-training job.

  1. Get a Hugging Face access token with Read permission

  2. Login: huggingface-cli login

  3. The AgiBot World COMMUNITY LICENSE AGREEMENT must be submitted. It is required before AgiBotWorld-Alpha can be downloaded.

  4. Download task 327 from AgiBotWorld-Alpha dataset:

# Download, extract, and clean (default behavior)
python scripts/prepare_agibot_fisheye_data.py

# Clean existing data
python scripts/prepare_agibot_fisheye_data.py --delete-only

# Split videos into (5-second) windows
# In this example, we use task_id 327, episode_id 685393 as validation data
python scripts/prepare_agibot_fisheye_data.py --split-only --val_episode_ids 685393

# (Optional) Remove the source data
rm -rf datasets/agibot

Expect to use ~100 GB storage in the data preparation steps. After the processing is done, there will be ~2 GB data remaining in datasets/agibot_head_center_fisheye_color folder.

Dataset folder format:

datasets/agibot_head_center_fisheye_color/
├── train/
│   ├── metas/
│      ├── *.txt
│   ├── videos/
│      ├── *.mp4
├── val/
│   ├── metas/
│      ├── *.txt
│   ├── videos/
│      ├── *.mp4

Preprocess the Data#

Run the following command to pre-compute T5-XXL embeddings for the video caption used for post-training:

# The script will use the provided prompt from the dataset, save the T5-XXL embeddings in pickle format.
python scripts/get_t5_embeddings.py --dataset_path datasets/agibot_head_center_fisheye_color/train
python scripts/get_t5_embeddings.py --dataset_path datasets/agibot_head_center_fisheye_color/val

Dataset folder format:

datasets/agibot_head_center_fisheye_color/
├── train/
│   ├── metas/
│      ├── *.txt
│   ├── videos/
│      ├── *.mp4
│   ├── t5_xxl/
│      ├── *.pickle
├── val/
│   ├── metas/
│      ├── *.txt
│   ├── videos/
│      ├── *.mp4
│   ├── t5_xxl/
│      ├── *.pickle

Post-train the Video2World Model#

Cosmos-Predict2-2B-Video2World#

Run the following command to execute an example post-training job with agibot_head_center_fisheye_color data.

EXP=predict2_video2world_training_2b_agibot_head_center_fisheye_color
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

The model will be post-trained using the agibot_head_center_fisheye_color dataset. See the config predict2_video2world_training_2b_agibot_head_center_fisheye_color defined in cosmos_predict2/configs/base/experiment/agibot_head_center_fisheye_color.py to understand how the dataloader is defined.

# agibot_head_center_fisheye_color example
example_video_dataset_agibot_head_center_fisheye_color = L(Dataset)(
    dataset_dir="datasets/agibot_head_center_fisheye_color",
    num_frames=93,
    video_size=(704, 1280),
)

dataloader_train_agibot_head_center_fisheye_color = L(DataLoader)(
    dataset=example_video_dataset_agibot_head_center_fisheye_color,
    sampler=L(get_sampler)(dataset=example_video_dataset_agibot_head_center_fisheye_color),
    batch_size=1,
    drop_last=True,
    num_workers=8,
    pin_memory=True,
)

The checkpoints will be saved to checkpoints/PROJECT/GROUP/NAME. In the above example, PROJECT is posttraining, GROUP is video2world, NAME is 2b_agibot_head_center_fisheye_color.

See the job config to understand how they are determined.

predict2_video2world_training_2b_agibot_head_center_fisheye_color = dict(
    dict(
        ...
        job=dict(
            project="posttraining",
            group="video2world",
            name="2b_agibot_head_center_fisheye_color",
        ),
        ...
    )
)

The checkpoints will be saved in the below structure.

checkpoints/posttraining/video2world/2b_agibot_head_center_fisheye_color/checkpoints/
├── model/
│   ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt

Resolution and FPS Variants#

Post-training can be done from the provided checkpoints with resolution - [480p, 720p] and fps - [10fps, 16fps] choices. The corresponding config names are with _{RESOLUTION}p_{FPS}fps appended to the default config name. The default config without those postfixes is with 720p 16fps.

# 480p, 10fps
EXP=predict2_video2world_training_2b_agibot_head_center_fisheye_color_480p_10fps
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

# 480p, 16fps
EXP=predict2_video2world_training_2b_agibot_head_center_fisheye_color_480p_16fps
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

# 720p, 10fps
EXP=predict2_video2world_training_2b_agibot_head_center_fisheye_color_720p_10fps
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

# 720p, 16fps
EXP=predict2_video2world_training_2b_agibot_head_center_fisheye_color_720p_16fps
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

Cosmos-Predict2-14B-Video2World#

Run the following command to execute an example post-training job with agibot_head_center_fisheye_color data with 4 nodes with 8 GPUs.

EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color
torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

The above command will train the entire model. If you are interested in training with LoRA, attach model.config.train_architecture=lora to the training command.

The checkpoints will be saved in the below structure.

checkpoints/posttraining/video2world/14b_agibot_head_center_fisheye_color/checkpoints/
├── model/
│   ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt

Resolution and FPS Variants#

Like 2B models, post-training can be done from the provided checkpoints with resolution - [480p, 720p] and fps - [10fps, 16fps] choices. The corresponding config names are with _{RESOLUTION}p_{FPS}fps appended to the default config name. The default config without those postfixes is with 720p 16fps.

# 480p, 10fps
EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color_480p_10fps
EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color
torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

# 480p, 16fps
EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color_480p_16fps
EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color
torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

# 720p, 10fps
EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color_720p_10fps
EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color
torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

# 720p, 16fps
EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color_720p_16fps
EXP=predict2_video2world_training_14b_agibot_head_center_fisheye_color
torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}

Post-training Performance#

The following table shows the expected iteration speed for 2B and 14B Video2World model training on different GPUs. Note that 2B model uses 8 GPUs, while 14B model uses 32 GPUs. 14B model also has 4x lower global batch size, as it uses Context-Parallelism of size 8, while 2B model uses Context-Parallelism of size 2.

GPU Hardware

2B-Video2World

14B-Video2World

NVIDIA B200

6.05 sec

6.27 sec

NVIDIA H100 NVL

10.07 sec

8.72 sec

NVIDIA A100

22.5 sec

22.14 sec

Perform inference with the Post-trained Checkpoint#

Cosmos-Predict2-2B-Video2World#

For example, if a posttrained checkpoint with 1000 iterations is to be used, run the following command. Use --dit_path argument to specify the path to the post-trained checkpoint.

PROMPT_="The video captures a humanoid robot positioned in front of a fruit stand in a supermarket environment. The robot's right arm extends downward, reaching for a shiitake mushroom on the shelf. The arm carefully grasps the mushroom, lifting it towards the robot's body. The surrounding environment includes a shopping cart with a clear plastic bag and a red handle, as well as various fruits and vegetables displayed on the shelves. The robot's task is to retrieve items from the supermarket shelves, and this frame shows the initial step of picking up a shiitake mushroom."

python examples/video2world.py \
  --model_size 2B \
  --dit_path "checkpoints/posttraining/video2world/2b_agibot_head_center_fisheye_color/checkpoints/model/iter_000001000.pt" \
  --prompt "${PROMPT_}" \
  --input_path "datasets/agibot_head_center_fisheye_color/val/videos/task_327_episode_685393_window_0_frame_0-149.mp4" \
  --num_conditional_frames 1 \
  --save_path output/generated_video_2b_agibot_fisheye.mp4

To load EMA weights from the post-trained checkpoint, add argument --load_ema.

python examples/video2world.py \
  --model_size 2B \
  --dit_path "checkpoints/posttraining/video2world/2b_agibot_head_center_fisheye_color/checkpoints/model/iter_000001000.pt" \
  --prompt "${PROMPT_}" \
  --input_path "datasets/agibot_head_center_fisheye_color/val/videos/task_327_episode_685393_window_0_frame_0-149.mp4" \
  --num_conditional_frames 1 \
  --load_ema \
  --save_path output/generated_video_2b_agibot_fisheye_ema.mp4

Tip

For inference run details, refer to the Video2World Model Reference.

Cosmos-Predict2-14B-Video2World#

The 14B model can be run similarly by changing the --model_size and --dit_path arguments.