Diffusion Quickstart Guide#

This page will walk you through setting up and running inference with the pre-trained Cosmos-Predict1-7B-Text2World diffusion model.

Set up the Diffusion Model#

  1. Ensure you have the necessary hardware and software, as outlined on the Prerequisites page.

  2. Follow the Installation guide to download the Cosmos-Predict1 repo and set up the conda environment.

  3. Generate a Hugging Face access token. Set the access token permission to ‘Read’ (the default permission is ‘Fine-grained’).

  4. Log in to Hugging Face with the access token:

    huggingface-cli login
    
  5. Download the model weights for Cosmos-Predict1-7B-Text2World from Hugging Face:

    CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Text2World
    

Generate a Video from Text Input#

Generate a video with text input using the Cosmos-Predict1-7B-Text2World model. To do so, create a text prompt and pass it to the text2world.py script:

PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
    --prompt "${PROMPT}" \
    --offload_prompt_upsampler \
    --video_save_name diffusion-text2world-7b

Note

You can also generate worlds from text and image/video input using the Cosmos Video2World diffusion models. Batch-generation is also available. Refer to the Diffusion Model Reference for model variants and options.

Next Steps#

Get started adapting a Diffusion model for your use case with the Diffusion Model Post-Training Guide or explore all Diffusion model input/output options in the Diffusion Model Reference.