Quickstart Guide#

This guide provides instructions on running inference with the Cosmos-Predict2.5/base model.

Note

Ensure you have completed the steps in the Predict2.5 Installation Guide before running inference.

Example Inference Command#

Run inference with example the example robotics asset:

python examples/inference.py -i assets/base/robot_pouring.json -o outputs/base_video2world --inference-type=video2world

Tip

To enable multi-GPU inference with 8 GPUs, use torchrun:

torchrun --nproc_per_node=8 examples/inference.py -i assets/base/robot_pouring.json -o outputs/base_video2world --inference-type=video2world

Inference Options#

You can use the --inference argument to specify the type of inference to perform with the Cosmos-Predict2.5/base model:

Text2World: -o outputs/base_text2world --inference-type=text2world
Image2World: -o outputs/base_image2world --inference-type=image2world
Video2World: -o outputs/base_video2world --inference-type=video2world

Use the following command to run all example assets:

torchrun --nproc_per_node=8 examples/inference.py -i assets/base/*.json -o outputs/base

You can use the --model argument to specify the model to use:

2B/post-trained: --model 2B/post-trained

Use the following command to list all available options:

python examples/inference.py --help

Input Parameters#

Input parameters are specified as a JSON file. An example is shown below:

{
  // Inference type: text2world, image2world, video2world
  "inference_type": "video2world",
  // Sample name
  "name": "robot_pouring",
  // Input prompt
  "prompt": "A robotic arm, primarily white with black joints and cables...",
  // Path to the input image/video file (not needed for text2world)
  "input_path": "robot_pouring.mp4"
}

Example Outputs#

text2world/snowy_stop_light#

image2world/robot_welding#

video2world/sand_mining#

Tips#

Multi-GPU#

Context parallelism distributes inference across multiple GPUs, with each GPU generating a subset of the video frames.

The number of GPUs should ideally be a divisor of the number of frames in the generated video.
All GPUs should have the same model capacity and memory.
Context parallelism works best with the 14B model where memory constraints are significant.
Requires NCCL support and proper GPU interconnect for efficient communication.
Significant speedup for video generation while maintaining the same quality.

Prompt Engineering#

For best results with Cosmos models, create detailed prompts that emphasize physical realism, natural laws, and real-world behaviors. Describe specific objects, materials, lighting conditions, and spatial relationships while maintaining logical consistency throughout the scene.

Incorporate photography terminology like composition, lighting setups, and camera settings. Use concrete terms like “natural lighting” or “wide-angle lens” rather than abstract descriptions, unless intentionally aiming for surrealism. Include negative prompts to explicitly specify undesired elements.

The more grounded a prompt is in real-world physics and natural phenomena, the more physically plausible and realistic the generated image will be.

Next Steps#

Refer to the :ref:Predict2.5 Model Reference <predict2.5-model-reference> page for more information on running inference with the Auto Multiview or Robot Action Conditioned models. If you’re ready to start post-training, refer to the :ref:Predict2.5 Post-Training Guides <predict2.5-post-training-guides> page.