Quickstart Guide#

This page will walk you through setting up and running inference with the pre-trained Cosmos-Predict2-2B-Video2World model. You can use this model to transform a still image or video clip into a longer sequence guided by a text description.

Set up the Video2World Model#

Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.
Follow the Installation guide to download the Cosmos-Predict2 repo and set up the environment.
Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).
Log in to Hugging Face with the access token:
```
huggingface-cli login
```
Review and accept the Llama-Guard-3-8B terms.
Download the model weights for Cosmos-Predict2-2B-Video2World from Hugging Face:
```
python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B
```
Note

This command downloads the 720P, 16FPS version of the model checkpoint by default. Refer to the Reference for downloading different checkpoints versions.

Generate a Video from Text Input#

Generate a video from text and image input using the Cosmos-Predict2-2B-Text2World model. To do so, create a text prompt and pass it, along with the image to the text2world.py script:

Note

The sample image input used below is provided as the input0.jpg file in the assets/video2world directory.

PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement.
At first, multiple double-decker buses are parked under the glow of overhead lights, with
a central bus labeled '87D' facing forward and stationary. As the video progresses, the
bus in the middle moves ahead slowly, its headlights brightening the surrounding area
and casting reflections onto adjacent vehicles. The motion creates space in the lineup,
signaling activity within the otherwise quiet station. It then comes to a smooth stop
resuming its position in line. Overhead signage in Chinese characters remains illuminated,
enhancing the vibrant, urban night scene."

python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --num_conditional_frames 1 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b.mp4

The inference output will be saved as outputs/video2world_2b.mp4.

Troubleshooting#

CUDA/GPU Issues

CUDA driver version insufficient: Update NVIDIA drivers to the latest version compatible with CUDA 12.4+.
Out of Memory (OOM) errors: Use 2B models instead of 14B, or reduce the batch size/resolution.
Missing CUDA libraries: Set paths with export CUDA_HOME=$CONDA_PREFIX.

Installation Issues

Conda environment conflicts: Create a fresh environment with conda create -n cosmos-predict2-clean python=3.10 -y.
Flash-attention build failures: Install build tools with apt-get install build-essential.
Transformer Engine linking errors: Reinstall Transformer Engine with pip install --force-reinstall transformer-engine==1.12.0.

If you are experiencing other issues, check the Cosmos-Predict2 GitHub Issues page.

Next Steps#

Follow the Post-Training Guide to post-train a Predict2 model for your physical AI use case or explore all Predict model input/output options in the Model Reference.