Predict2 Quickstart Guide#

This page will walk you through setting up and running inference with the pre-trained Cosmos-Predict2-2B-Video2World model.

Set up the Video2World Model#

  1. Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.

  2. Follow the Installation guide to download the Cosmos-Predict2 repo and set up the conda environment.

  3. Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).

  4. Log in to Hugging Face with the access token:

    huggingface-cli login
    
  5. Review and accept the LlamaGuard-7b terms

  6. Download the model weights for Cosmos-Predict2-2B-Video2World from Hugging Face:

    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 2B --model_types Video2World --checkpoint_dir checkpoints
    

Generate a Video from Text Input#

Generate a video from text and image input using the Cosmos-Predict2-2B-Text2World model. To do so, create a text prompt and pass it, along with the image to the text2world.py script:

Note

The sample text and image inputs used below are provided as the input0.txt/input0.jpg files in the asset/video2world directory.

PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple \
double-decker buses are parked under the glow of overhead lights, with a central bus labeled “87D” facing \
forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights \
brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in \
the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its \
position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban \
night scene."

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --input_image_or_video_path assets/video2world/input0.jpg \
    --num_input_frames 1 \
    --diffusion_transformer_dir Cosmos-Predict2-2B-Video2World \
    --offload_prompt_upsampler \
    --disable_prompt_upsampler \
    --prompt "${PROMPT}" \
    --height 432 --width 768 --num_video_frames 81 \
    --num_steps 35 \
    --video_save_name video2world_2b

The inference output will be saved as outputs/video2world_2b.mp4, along with the corresponding prompt at outputs/video2world_2b.txt.

Next Steps#

Follow the [Post-Training Guide] to post-train a Predict2 model for your physical AI use case or explore all Predict model input/output options in the Transfer1 Model Reference.