Quickstart Guide#

This guide provides instructions on running inference with the Cosmos-Transfer2.5/general model.

Note

Ensure you have completed the steps in the Transfer2.5 Installation Guide before running inference.

Hardware Requirements#

The following table shows the GPU memory requirements for different Cosmos-Transfer2.5 models for single-GPU inference:

Model

Required GPU VRAM

Cosmos-Transfer2.5-2B

65.4 GB

Inference Performance#

Segmentation#

The table below shows generation times(*) across different NVIDIA GPU hardware for single-GPU inference:

GPU Hardware

Cosmos-Transfer2.5-2B 93 frame generation time

Cosmos-Transfer2.5-2B E2E time (**)

NVIDIA B200

92.25 sec

186.92

NVIDIA H100 NVL

445.52 sec

895.33

NVIDIA H100 PCIe

264.13 sec

533.58

NVIDIA H20

683.65 sec

1370.39

* Generation times are listed for 720P video with 16FPS with segmentation control input and disabled guardrails. ** E2E time is measured for input video with 121 frames, which results in two 93 frame “chunk” generations.

Edge#

The table below compares base vs. distilled Transfer2.5 Edge inference performance across GPU architectures:

Metric

GPUs

RTX PRO 6000 Blackwell SE

H20

H100 NVL

H200 NVL

B200

B300

Avg. Distilled Model Diffusion Time (s)

1

78.5

176.4

64.5

49.8

24.2

53.2

4

33.7

62.7

27.4

20.4

12.6

25.6

8

25.0

44.0

20.4

16.9

11.1

19.9

Avg. Base Diffusion Time (s)

1

605.7

1374.6

502.6

374.4

179.7

415.5

4

196.1

373.4

154.5

117.0

62.3

127.7

8

118.8

201.5

92.5

82.4

41.8

76.1

Avg. Performance Improvement

1

7.7x

7.8x

7.8x

7.5x

7.4x

7.8x

4

5.8x

6.0x

5.6x

5.7x

5.0x

5.0x

8

4.7x

4.6x

4.5x

4.9x

3.8x

3.8x

Example Inference Command#

Individual control variants can be run on a single GPU:

python examples/inference.py -i assets/robot_example/depth/robot_depth_spec.json -o outputs/depth

For multi-GPU inference on a single control, or to run multiple control variants, use torchrun:

torchrun --nproc_per_node=8 --master_port=12341 examples/inference.py -i assets/robot_example/depth/robot_depth_spec.json -o outputs/depth

For an explanation of all available parameters, run:

python examples/inference.py --help

python examples/inference.py control:edge --help  # for information specific to edge control

Example Parameter Files#

An example parameter file for each individual control variant is provided, along with a multi-control variant:

Variant

Parameter File

Depth

assets/robot_example/depth/robot_depth_spec.json

Edge

assets/robot_example/edge/robot_edge_spec.json

Segmentation

assets/robot_example/seg/robot_seg_spec.json

Blur

assets/robot_example/vis/robot_vis_spec.json

Multi-control

assets/robot_example/multicontrol/robot_multicontrol_spec.json

Distilled/Edge

assets/robot_example/distilled/edge/robot_edge_spec.json

Parameters can be specified as follows:

{
    // Path to the prompt file, use "prompt" to directly specify the prompt
    "prompt_path": "assets/robot_example/robot_prompt.json",

    // Directory to save the generated video
    "output_dir": "outputs/robot_multicontrol",

    // Path to the input video
    "video_path": "assets/robot_example/robot_input.mp4",

    // Inference settings
    "guidance": 3,

    // Depth control settings
    "depth": {
        // Path to the control video
        // If a control is not provided, it will be computed on the fly.
        "control_path": "assets/robot_example/depth/robot_depth.mp4",

        // Control weight for the depth control
        "control_weight": 0.5
    },

    // Edge control settings
    "edge": {
        // Path to the control video
        "control_path": "assets/robot_example/edge/robot_edge.mp4",
        // Default control weight of 1.0 for edge control
    },

    // Seg control settings
    "seg": {
        // Path to the control video
        "control_path": "assets/robot_example/seg/robot_seg.mp4",

        // Control weight for the seg control
        "control_weight": 1.0
    },

    // Blur control settings
    "vis":{
        // Control video computed on the fly
        "control_weight": 0.5
    }
}

Mask Support#

Binary spatiotemporal masks can limit control inputs to specific spatial regions. White pixels indicate where the control is applied; black pixels suppress it. Specify the mask with mask_path in the control settings:

{
    "depth": {
        "control_path": "assets/robot_example/depth/robot_depth.mp4",
        "mask_path": "/path/to/depth/mask.mp4",
        "control_weight": 0.5
    }
}

Distilled Model#

The distilled Transfer2.5 Edge model provides significantly faster inference. To use it, set num_steps to 4 in the JSON configuration and pass --model=edge/distilled on the command line.

Note

The distilled model is intended for short videos (strictly 93 sampled frames).

Example JSON configuration for the distilled Edge model:

{
    "name": "robot_edge",
    "prompt_path": "/path/to/prompt/robot_prompt.txt",
    "video_path": "/path/to/input/robot_input.mp4",
    "guidance": 3,
    "num_steps": 4,
    "edge": {
        "control_path": "/path/to/edge/robot_edge.mp4",
        "control_weight": 1.0
    }
}

Run inference with the distilled model:

# 8 GPUs
torchrun --nproc_per_node=8 --master_port=12341 examples/inference.py \
    -i assets/robot_example/distilled/edge/robot_edge_spec.json \
    -o outputs/distilled/edge \
    --model=edge/distilled

# 1 GPU
python examples/inference.py \
    -i assets/robot_example/distilled/edge/robot_edge_spec.json \
    -o outputs/distilled/edge \
    --model=edge/distilled

Example Output#

The following video shows output from a multiple control variant:

Next Steps#

Refer to the :ref:Transfer2.5 Model Reference <transfer2.5-model-reference> page for more information on running inference with the Auto Multiview model. If you’re ready to start post-training, refer to the :ref:Transfer2.5 Post-Training Guides <transfer2.5-post-training-guides> page.