Predict2 Quickstart Guide#

This page will walk you through setting up and running inference with the pre-trained Cosmos-Predict2-2B-Video2World model. You can use this model to transform a still image or video clip into a longer sequence guided by a text description.

Set up the Video2World Model#

  1. Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.

  2. Follow the Installation guide to download the Cosmos-Predict2 repo and set up the environment.

  3. Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).

  4. Log in to Hugging Face with the access token:

    huggingface-cli login
    
  5. Review and accept the Llama-Guard-3-8B terms.

  6. Download the model weights for Cosmos-Predict2-2B-Video2World from Hugging Face:

    python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B
    

    Note

    This command downloads the 720P, 16FPS version of the model checkpoint by default. Refer to the Reference for downloading different checkpoints versions.

Generate a Video from Text Input#

Generate a video from text and image input using the Cosmos-Predict2-2B-Text2World model. To do so, create a text prompt and pass it, along with the image to the text2world.py script:

Note

The sample image input used below is provided as the input0.jpg file in the assets/video2world directory.

PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement.
At first, multiple double-decker buses are parked under the glow of overhead lights, with
a central bus labeled '87D' facing forward and stationary. As the video progresses, the
bus in the middle moves ahead slowly, its headlights brightening the surrounding area
and casting reflections onto adjacent vehicles. The motion creates space in the lineup,
signaling activity within the otherwise quiet station. It then comes to a smooth stop
resuming its position in line. Overhead signage in Chinese characters remains illuminated,
enhancing the vibrant, urban night scene."

python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --num_conditional_frames 1 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b.mp4

The inference output will be saved as outputs/video2world_2b.mp4.

Next Steps#

Follow the Post-Training Guide to post-train a Predict2 model for your physical AI use case or explore all Predict model input/output options in the Model Reference.