Quickstart Guide#
This guide provides instructions on running inference with the Cosmos-Transfer2.5/general model.
Note
Ensure you have completed the steps in the Transfer2.5 Installation Guide before running inference.
Hardware Requirements#
The following table shows the GPU memory requirements for different Cosmos-Transfer2.5 models for single-GPU inference:
Model |
Required GPU VRAM |
|---|---|
Cosmos-Transfer2.5-2B |
65.4 GB |
Inference Performance#
Segmentation#
The table below shows generation times(*) across different NVIDIA GPU hardware for single-GPU inference:
GPU Hardware |
Cosmos-Transfer2.5-2B 93 frame generation time |
Cosmos-Transfer2.5-2B E2E time (**) |
|---|---|---|
NVIDIA B200 |
92.25 sec |
186.92 |
NVIDIA H100 NVL |
445.52 sec |
895.33 |
NVIDIA H100 PCIe |
264.13 sec |
533.58 |
NVIDIA H20 |
683.65 sec |
1370.39 |
* Generation times are listed for 720P video with 16FPS with segmentation control input and disabled guardrails. ** E2E time is measured for input video with 121 frames, which results in two 93 frame “chunk” generations.
Edge#
The table below compares base vs. distilled Transfer2.5 Edge inference performance across GPU architectures:
Metric |
GPUs |
RTX PRO 6000 Blackwell SE |
H20 |
H100 NVL |
H200 NVL |
B200 |
B300 |
|---|---|---|---|---|---|---|---|
Avg. Distilled Model Diffusion Time (s) |
1 |
78.5 |
176.4 |
64.5 |
49.8 |
24.2 |
53.2 |
4 |
33.7 |
62.7 |
27.4 |
20.4 |
12.6 |
25.6 |
|
8 |
25.0 |
44.0 |
20.4 |
16.9 |
11.1 |
19.9 |
|
Avg. Base Diffusion Time (s) |
1 |
605.7 |
1374.6 |
502.6 |
374.4 |
179.7 |
415.5 |
4 |
196.1 |
373.4 |
154.5 |
117.0 |
62.3 |
127.7 |
|
8 |
118.8 |
201.5 |
92.5 |
82.4 |
41.8 |
76.1 |
|
Avg. Performance Improvement |
1 |
7.7x |
7.8x |
7.8x |
7.5x |
7.4x |
7.8x |
4 |
5.8x |
6.0x |
5.6x |
5.7x |
5.0x |
5.0x |
|
8 |
4.7x |
4.6x |
4.5x |
4.9x |
3.8x |
3.8x |
Example Inference Command#
Individual control variants can be run on a single GPU:
python examples/inference.py -i assets/robot_example/depth/robot_depth_spec.json -o outputs/depth
For multi-GPU inference on a single control, or to run multiple control variants, use torchrun:
torchrun --nproc_per_node=8 --master_port=12341 examples/inference.py -i assets/robot_example/depth/robot_depth_spec.json -o outputs/depth
For an explanation of all available parameters, run:
python examples/inference.py --help
python examples/inference.py control:edge --help # for information specific to edge control
Example Parameter Files#
An example parameter file for each individual control variant is provided, along with a multi-control variant:
Variant |
Parameter File |
|---|---|
Depth |
|
Edge |
|
Segmentation |
|
Blur |
|
Multi-control |
|
Distilled/Edge |
|
Parameters can be specified as follows:
{
// Path to the prompt file, use "prompt" to directly specify the prompt
"prompt_path": "assets/robot_example/robot_prompt.json",
// Directory to save the generated video
"output_dir": "outputs/robot_multicontrol",
// Path to the input video
"video_path": "assets/robot_example/robot_input.mp4",
// Inference settings
"guidance": 3,
// Depth control settings
"depth": {
// Path to the control video
// If a control is not provided, it will be computed on the fly.
"control_path": "assets/robot_example/depth/robot_depth.mp4",
// Control weight for the depth control
"control_weight": 0.5
},
// Edge control settings
"edge": {
// Path to the control video
"control_path": "assets/robot_example/edge/robot_edge.mp4",
// Default control weight of 1.0 for edge control
},
// Seg control settings
"seg": {
// Path to the control video
"control_path": "assets/robot_example/seg/robot_seg.mp4",
// Control weight for the seg control
"control_weight": 1.0
},
// Blur control settings
"vis":{
// Control video computed on the fly
"control_weight": 0.5
}
}
Mask Support#
Binary spatiotemporal masks can limit control inputs to specific spatial regions. White pixels indicate where the control is applied; black pixels suppress it. Specify the mask with mask_path in the control settings:
{
"depth": {
"control_path": "assets/robot_example/depth/robot_depth.mp4",
"mask_path": "/path/to/depth/mask.mp4",
"control_weight": 0.5
}
}
Distilled Model#
The distilled Transfer2.5 Edge model provides significantly faster inference. To use it, set num_steps to 4 in the JSON configuration and pass --model=edge/distilled on the command line.
Note
The distilled model is intended for short videos (strictly 93 sampled frames).
Example JSON configuration for the distilled Edge model:
{
"name": "robot_edge",
"prompt_path": "/path/to/prompt/robot_prompt.txt",
"video_path": "/path/to/input/robot_input.mp4",
"guidance": 3,
"num_steps": 4,
"edge": {
"control_path": "/path/to/edge/robot_edge.mp4",
"control_weight": 1.0
}
}
Run inference with the distilled model:
# 8 GPUs
torchrun --nproc_per_node=8 --master_port=12341 examples/inference.py \
-i assets/robot_example/distilled/edge/robot_edge_spec.json \
-o outputs/distilled/edge \
--model=edge/distilled
# 1 GPU
python examples/inference.py \
-i assets/robot_example/distilled/edge/robot_edge_spec.json \
-o outputs/distilled/edge \
--model=edge/distilled
Example Output#
The following video shows output from a multiple control variant:
Next Steps#
Refer to the :ref:Transfer2.5 Model Reference <transfer2.5-model-reference> page for more information on running inference with the Auto Multiview model. If you’re ready to start post-training, refer to the :ref:Transfer2.5 Post-Training Guides <transfer2.5-post-training-guides> page.