Diffusion Model Reference#

This page details the options available when using Cosmos-Predict1 diffusion models.

Model Types#

There are four model types available for diffusion world generation:

Text2World: World generation from text input
- Models: Cosmos-Predict1-Diffusion-7B-Text2World, Cosmos-Predict1-Diffusion-14B-Text2World
- Inference script: text2world.py (/cosmos1_predict1/models/diffusion/inference/text2world.py)
Video2World: World generation from text and image/video input
- Models: Cosmos-Predict1-Diffusion-7B-Video2World, Cosmos-Predict1-Diffusion-14B-Video2World
- Inference script: video2world.py (/cosmos1_predict1/models/diffusion/inference/video2world.py)
Text2World-Multiview: World generation with multiple views (e.g. different cameras on an autonomous vehicle) from text input
- Model: Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview
- Inference script: text2world_multiview.py (/cosmos1_predict1/models/diffusion/inference/text2world_multiview.py)
Video2World-Multiview: World generation with multiple views (e.g. different cameras on an autonomous vehicle) from video input
- Model: Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview
- Inference script: video2world_multiview.py (/cosmos1_predict1/models/diffusion/inference/video2world_multiview.py)

Sample Commands#

This section contains sample commands for each model type, including single generation, single generation with model offloading, single generation with multi-GPU inference, and batch generation.

Text2World#

Downloading the Model Weights#

Use the following command to download the Cosmos-Predict1-Diffusion-7B-Text2World and Cosmos-Predict1-Diffusion-14B-Text2World model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B 14B --model_types Text2World

Single Generation#

This command runs inference with the 7B model to generate a single video using a text prompt.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
    --prompt "${PROMPT}" \
    --offload_prompt_upsampler \
    --video_save_name diffusion-text2world-7b

Single Generation with Model Offloading#

This command runs inference with the 14B model using various offloading flags. Offloading should be used with low-memory GPUs or when running inference with the 14B model to prevent out-of-memory issues.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-14B-Text2World \
    --prompt "${PROMPT}" \
    --offload_tokenizer \
    --offload_diffusion_transformer \
    --offload_text_encoder_model \
    --offload_prompt_upsampler \
    --offload_guardrail_models
    --video_save_name diffusion-text2world-14b

Single Generation with Multi-GPU Inference#

This command generates a single video using 8 GPUs.

NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/text2world.py \
    --num_gpus ${NUM_GPUS} \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
    --prompt "${PROMPT}" \
    --offload_prompt_upsampler \
    --video_save_name diffusion-text2world-7b-8gpu

Batch Generation#

The --batch_input_path argument allows you to generate multiple videos, one for each text prompt provided. This argument specifies the path to a JSONL file, which contains one prompt per line in the following format:

{"prompt": "prompt1"}
{"prompt": "prompt2"}

Inference is performed as follows:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
    --batch_input_path assets/diffusion/batch_inputs/text2world.jsonl \
    --offload_prompt_upsampler \
    --video_save_folder diffusion-text2world-7b-batch

Video2World#

Downloading the Model Weights#

Use the following command to download the Cosmos-Predict1-Diffusion-7B-Video2World and Cosmos-Predict1-Diffusion-14B-Video2World model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B 14B --model_types Video2World

Single Generation#

This command runs inference with the 7B model to generate a single video using the video2world_input0.jpg image. No text prompt is used.

Note

Since the inference input is an image file, the num_input_frames value is 1.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World \
    --input_image_or_video_path assets/diffusion/video2world_input0.jpg \
    --num_input_frames 1 \
    --offload_prompt_upsampler \
    --video_save_name diffusion-video2world-7b

Single Generation with Model Offloading#

This command runs inference with the 14B model using various offloading flags. Offloading should be used with low-memory GPUs or when running inference with the 14B model to prevent out-of-memory issues.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-14B-Video2World \
    --input_image_or_video_path assets/diffusion/video2world_input0.jpg \
    --num_input_frames 1 \
    --offload_tokenizer \
    --offload_diffusion_transformer \
    --offload_text_encoder_model \
    --offload_prompt_upsampler \
    --offload_guardrail_models
    --video_save_name diffusion-video2world-14b

Single Generation with Multi-GPU Inference#

This command generates a single video using 8 GPUs.

NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/video2world.py \
    --num_gpus ${NUM_GPUS} \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World \
    --input_image_or_video_path assets/diffusion/video2world_input0.jpg \
    --num_input_frames 1 \
    --offload_prompt_upsampler \
    --video_save_name diffusion-video2world-7b

Batch Generation#

The --batch_input_path argument allows you to generate multiple videos, one for each image or video provided. This argument specifies the path to a JSONL file, which contains one video/image input line in the following format:

{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}

Inference is performed as follows (with the number of input frames set to 9):

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World \
    --batch_input_path assets/diffusion/batch_inputs/video2world_ps.jsonl \
    --num_input_frames 9 \
    --offload_prompt_upsampler \
    --video_save_folder diffusion-video2world-7b-batch

Batch Generation without Prompt Upsampling#

If prompt upsampling is disabled using the disable_prompt_upsampler argument, then the JSONL file for batch generation must also include a text prompt for each image/video:

{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}

Inference is performed as follows (with the number of input frames set to 9):

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World \
    --batch_input_path assets/diffusion/batch_inputs/video2world_wo_ps.jsonl \
    --num_input_frames 9 \
    --disable_prompt_upsampler \
    --video_save_folder diffusion-video2world-7b-batch-wo-ps

Text2World-Multiview#

Downloading the Model Weights#

Use the following command to download the Text2World-Sample-AV-Multiview model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Text2World-Sample-AV-Multiview

Multiview Prompts#

The Text2World-Multiview model requires the --prompt input argument, along with one or more of the following: prompt_left, prompt_right, prompt_back, prompt_back_left, prompt_back_right.

The following example demonstrates these text prompts being set in the shell.

PROMPT="The video is captured from a camera mounted on a car. The camera is facing forward. \
The video is taken from the perspective of a vehicle's dashboard camera, showing a straight road flanked by snow-covered trees and a clear sky. \
The road is mostly empty, with no visible traffic or pedestrians. \
The sun is setting, casting a warm glow on the horizon and creating long shadows on the snow. \
The trees are tall and leafless, with some coniferous trees interspersed among the bare deciduous trees. \
The snow on the ground appears undisturbed, suggesting a quiet and peaceful setting."

PROMPT_LEFT="The video is captured from a camera mounted on a car. The camera is facing to the left. \
The video captures a series of images from a moving vehicle, showcasing a winter scene with snow-covered ground and trees. \
The sky is a gradient of blue and orange hues, indicating either sunrise or sunset. \
The trees are tall and predominantly coniferous, with some deciduous trees as well. \
The snow appears undisturbed, suggesting a quiet, possibly early morning setting. \
There are no visible people or animals, and the road is clear of traffic. \
The video has a fisheye lens effect, which gives a wide-angle view of the surroundings."

PROMPT_RIGHT="The video is captured from a camera mounted on a car. The camera is facing to the right. \
The video captures a series of images taken from a moving vehicle, showcasing a winter scene with snow-covered ground and trees. \
The sky is a gradient of blue hues, indicating either dawn or dusk. \
The trees are predominantly coniferous, with some bare deciduous trees. \
The snow appears fresh and undisturbed, suggesting recent snowfall. \
There are no visible people or animals, and the environment is serene and untouched. \
The perspective changes as the vehicle moves, providing different angles of the same landscape."

PROMPT_BACK="The video is captured from a camera mounted on a car. The camera is facing backwards. \
The video captures a sequence of frames showing a road covered in snow, with tire tracks visible on the surface. \
The road is flanked by tall, leafless trees, and the sky is a gradient of pink and blue hues, indicating either sunrise or sunset. \
The lighting conditions suggest it is either early morning or late evening. \
There are no visible signs of people or animals, and the road appears to be in a rural or less populated area. \
The vehicles in the video are moving at a steady pace, and there are no visible traffic signs or markings that stand out."

PROMPT_BACK_LEFT="The video is captured from a camera mounted on a car. The camera is facing the rear left side."

PROMPT_BACK_RIGHT="The video is captured from a camera mounted on a car. The camera is facing the rear right side."

Single Generation#

This command generates a single multiview output using text prompts.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world_multiview.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview \
    --prompt "${PROMPT}" \
    --prompt_left "${PROMPT_LEFT}" \
    --prompt_right "${PROMPT_RIGHT}" \
    --prompt_back "${PROMPT_BACK}" \
    --prompt_back_left "${PROMPT_BACK_LEFT}" \
    --prompt_back_right "${PROMPT_BACK_RIGHT}" \
    --video_save_name diffusion-text2world-multiview-7

Single Generation with Model Offloading#

This command runs inference to generate a single multiview output using various offloading flags. Offloading should be used with low-memory GPUs or to prevent out-of-memory issues.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world_multiview.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview \
    --prompt "${PROMPT}" \
    --prompt_left "${PROMPT_LEFT}" \
    --prompt_right "${PROMPT_RIGHT}" \
    --prompt_back "${PROMPT_BACK}" \
    --prompt_back_left "${PROMPT_BACK_LEFT}" \
    --prompt_back_right "${PROMPT_BACK_RIGHT}" \
    --offload_tokenizer \
    --offload_diffusion_transformer \
    --offload_text_encoder_model \
    --video_save_name diffusion-text2world-multiview-7b

Single Generation with Multi-GPU Inference#

This command generates a single multiview output using 8 GPUs.

NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/text2world_multiview.py \
    --num_gpus ${NUM_GPUS} \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview \
    --prompt "${PROMPT}" \
    --prompt_left "${PROMPT_LEFT}" \
    --prompt_right "${PROMPT_RIGHT}" \
    --prompt_back "${PROMPT_BACK}" \
    --prompt_back_left "${PROMPT_BACK_LEFT}" \
    --prompt_back_right "${PROMPT_BACK_RIGHT}" \
    --offload_prompt_upsampler \
    --video_save_name diffusion-text2world-multiview-7b-8gpu

Batch Generation#

The --batch_input_path argument allows you to generate multiple multiview outputs, one for each set of text prompts provided. This argument specifies the path to a JSONL file, which contains one set of prompts per line in the following format:

{"prompt": "prompt1", "prompt_left": "prompt1_left", "prompt_right": "prompt1_right", "prompt_back": "prompt1_back", "prompt_back_left": "prompt1_back_left", "prompt_back_right": "prompt1_back_right"}
{"prompt": "prompt2", "prompt_left": "prompt2_left", "prompt_right": "prompt2_right", "prompt_back": "prompt2_back", "prompt_back_left": "prompt2_back_left", "prompt_back_right": "prompt2_back_right"}

Inference is performed as follows:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
    --batch_input_path assets/diffusion/batch_inputs/text2world.jsonl \
    --offload_prompt_upsampler \
    --video_save_folder diffusion-text2world-7b-batch

Video2World-Multiview#

Downloading the Model Weights#

Use the following command to download the Cosmos-Predict1-Diffusion-7B-Video2World-Sample-AV-Multiview model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Video2World-Multiview

Single Generation#

This command generates a single multiview output using a single video input.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_multiview.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview \
    --input_image_or_video_path assets/diffusion/video2world_multiview_input1.mp4 \
    --num_input_frames 1 \
    --prompt "${PROMPT}" \
    --prompt_left "${PROMPT_LEFT}" \
    --prompt_right "${PROMPT_RIGHT}" \
    --prompt_back "${PROMPT_BACK}" \
    --prompt_back_left "${PROMPT_BACK_LEFT}" \
    --prompt_back_right "${PROMPT_BACK_RIGHT}" \
    --video_save_name diffusion-video2world-multiview-7b

Single Generation with Model Offloading#

This command runs inference to generate a single multiview output using various offloading flags. Offloading should be used with low-memory GPUs to prevent out-of-memory issues.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_multiview.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-14B-Video2World_multiview \
    --input_image_or_video_path assets/diffusion/video2world_multiview_input1.mp4 \
    --num_input_frames 1 \
    --prompt "${PROMPT}" \
    --prompt_left "${PROMPT_LEFT}" \
    --prompt_right "${PROMPT_RIGHT}" \
    --prompt_back "${PROMPT_BACK}" \
    --prompt_back_left "${PROMPT_BACK_LEFT}" \
    --prompt_back_right "${PROMPT_BACK_RIGHT}" \
    --offload_tokenizer \
    --offload_diffusion_transformer \
    --offload_text_encoder_model \
    --video_save_name diffusion-video2world-multiview-7b

Single Generation with Multi-GPU Inference#

This command generates a single multiview output using 8 GPUs.

NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/text2world_multiview.py \
    --num_gpus ${NUM_GPUS} \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview \
    --prompt "${PROMPT}" \
    --prompt_left "${PROMPT_LEFT}" \
    --prompt_right "${PROMPT_RIGHT}" \
    --prompt_back "${PROMPT_BACK}" \
    --prompt_back_left "${PROMPT_BACK_LEFT}" \
    --prompt_back_right "${PROMPT_BACK_RIGHT}" \
    --offload_prompt_upsampler \
    --video_save_name diffusion-text2world-multiview-7b-8gpu

Batch Generation#

The --batch_input_path argument allows you to generate multiple multiview outputs, one for each set of text prompts provided. This argument specifies the path to a JSONL file, which contains one visual_input value and an optional set of prompts per line in the following format:

{"prompt": "prompt1", "prompt_left": "prompt1_left", "prompt_right": "prompt1_right", "prompt_back": "prompt1_back", "prompt_back_left": "prompt1_back_left", "prompt_back_right": "prompt1_back_right", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "prompt_left": "prompt2_left", "prompt_right": "prompt2_right", "prompt_back": "prompt2_back", "prompt_back_left": "prompt2_back_left", "prompt_back_right": "prompt2_back_right", "visual_input": "path/to/video2.mp4"}

Inference is performed as follows:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_multiview.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview \
    --batch_input_path assets/diffusion/batch_inputs/video2world_multiview.jsonl \
    --num_input_frames 9 \
    --video_save_folder diffusion-video2world-multiview-7b-batch

Arguments#

This section describes the available arguments for diffusion inference scripts.

Common Parameters#

The following table lists parameters that are available for all diffusion inference scripts.

Parameter	Description	Default
`--checkpoint_dir`	Directory containing model weights	“checkpoints”
`--video_save_name`	Output video filename for single video generation	“output”
`--video_save_folder`	Output directory for batch video generation	“outputs/”
`--prompt`	Text prompt for single video generation	None
`--batch_input_path`	Path to JSONL file for batch video generation. Required for batch video generation.	None
`--disable_prompt_upsampler`	Disable automatic prompt enhancement. If this argument is used with Video2World models, a text prompt is required.	False
`--offload_diffusion_transformer`	Offload DiT model after inference; used for low-memory GPUs	False
`--offload_tokenizer`	Offload VAE model after inference; used for low-memory GPUs	False
`--offload_text_encoder_model`	Offload text encoder after inference; used for low-memory GPUs	False
`--offload_prompt_upsampler`	Offload prompt upsampler after inference; used for low-memory GPUs	False
`--offload_guardrail_models`	Offload guardrail models after inference; used for low-memory GPUs	False
`--diffusion_transformer_dir`	Directory containing DiT weights	N/A (the default directory varies by inference script)

Video2World-Specific Parameters#

Parameter	Description	Default
`--input_image_or_video_path`	Input video/image path for single video generation. Required for single video generation.	None
`--num_input_frames`	Number of video frames (1 or 9)	1

Prompting Guidelines#

The input text prompt is the most important parameter under the user’s control when interacting with the model. Rich and descriptive prompts can positively impact the output quality of the model, whereas short and poorly detailed prompts can lead to subpar video generation. Here are some guidelines for creating effective text prompts:

Describe a single, captivating scene: Focus on a single scene to prevent the model from generating videos with unnecessary shot changes.
Limit camera control instructions: The model doesn’t handle prompts involving camera control well, as this feature is still under development.

Prompt upsampler limitations: The current version of the prompt upsampler may sometimes deviate from the original intent of your prompt, adding unwanted details. If this happens, you can disable the upsampler with the --disable_prompt_upsampler flag and edit your prompt manually. We recommend using prompts of around 120 words for optimal quality.

Prompt Example#

The following example demonstrates an effective text prompt for physical AI video generation.

PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

Cosmos-Predict1-Prompt-Upsampler#

The prompt upsampler automatically expands brief prompts into more detailed descriptions (Text2World) or generates detailed prompts based on input images (Video2World).

Text2World#

When enabled (default), the upsampler will:

Ingest your input prompt.
Process it through a finetuned Mistral model to generate a more detailed description.
Use the expanded description for video generation.

This can generate better quality videos by providing more detailed context to the video generation model. To disable this feature, use the --disable_prompt_upsampler flag.

Video2World#

When enabled (default), the upsampler will:

Ingest your input image or video.
Process it through a Pixtral model to generate a detailed description.
Use the generated description for video generation.

Note

The Video2World prompt upsampler does not consider any user-provided text prompt. To disable this feature, use the --disable_prompt_upsampler flag.

Safety Features#

The Cosmos Predict1 models use a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed, and they will be blurred by the guardrail.

For more information, refer to the the Cosmos Guardrail page.