Diffusion Model Reference#
This page details the options available when using Cosmos-Predict1 diffusion models.
Model Types#
There are four model types available for diffusion world generation:
Text2World: World generation from text input
Models:
Cosmos-Predict1-Diffusion-7B-Text2World
,Cosmos-Predict1-Diffusion-14B-Text2World
Inference script:
text2world.py
(/cosmos1_predict1/models/diffusion/inference/text2world.py
)
Video2World: World generation from text and image/video input
Models:
Cosmos-Predict1-Diffusion-7B-Video2World
,Cosmos-Predict1-Diffusion-14B-Video2World
Inference script:
video2world.py
(/cosmos1_predict1/models/diffusion/inference/video2world.py
)
Text2World-Multiview: World generation with multiple views (e.g. different cameras on an autonomous vehicle) from text input
Model:
Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview
Inference script:
text2world_multiview.py
(/cosmos1_predict1/models/diffusion/inference/text2world_multiview.py
)
Video2World-Multiview: World generation with multiple views (e.g. different cameras on an autonomous vehicle) from video input
Model:
Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview
Inference script:
video2world_multiview.py
(/cosmos1_predict1/models/diffusion/inference/video2world_multiview.py
)
Sample Commands#
This section contains sample commands for each model type, including single generation, single generation with model offloading, single generation with multi-GPU inference, and batch generation.
Text2World#
Downloading the Model Weights#
Use the following command to download the Cosmos-Predict1-Diffusion-7B-Text2World
and Cosmos-Predict1-Diffusion-14B-Text2World
model weights from Hugging Face:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B 14B --model_types Text2World
Single Generation#
This command runs inference with the 7B model to generate a single video using a text prompt.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
--prompt "${PROMPT}" \
--offload_prompt_upsampler \
--video_save_name diffusion-text2world-7b
Single Generation with Model Offloading#
This command runs inference with the 14B model using various offloading flags. Offloading should be used with low-memory GPUs or when running inference with the 14B model to prevent out-of-memory issues.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-14B-Text2World \
--prompt "${PROMPT}" \
--offload_tokenizer \
--offload_diffusion_transformer \
--offload_text_encoder_model \
--offload_prompt_upsampler \
--offload_guardrail_models
--video_save_name diffusion-text2world-14b
Single Generation with Multi-GPU Inference#
This command generates a single video using 8 GPUs.
NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/text2world.py \
--num_gpus ${NUM_GPUS} \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
--prompt "${PROMPT}" \
--offload_prompt_upsampler \
--video_save_name diffusion-text2world-7b-8gpu
Batch Generation#
The --batch_input_path
argument allows you to generate multiple videos, one for each text prompt provided. This argument
specifies the path to a JSONL file, which contains one prompt per line in the following format:
{"prompt": "prompt1"}
{"prompt": "prompt2"}
Inference is performed as follows:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
--batch_input_path assets/diffusion/batch_inputs/text2world.jsonl \
--offload_prompt_upsampler \
--video_save_folder diffusion-text2world-7b-batch
Video2World#
Downloading the Model Weights#
Use the following command to download the Cosmos-Predict1-Diffusion-7B-Video2World
and Cosmos-Predict1-Diffusion-14B-Video2World
model weights from Hugging Face:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B 14B --model_types Video2World
Single Generation#
This command runs inference with the 7B model to generate a single video using the video2world_input0.jpg
image.
No text prompt is used.
Note
Since the inference input is an image file, the num_input_frames
value is 1.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Video2World \
--input_image_or_video_path assets/diffusion/video2world_input0.jpg \
--num_input_frames 1 \
--offload_prompt_upsampler \
--video_save_name diffusion-video2world-7b
Single Generation with Model Offloading#
This command runs inference with the 14B model using various offloading flags. Offloading should be used with low-memory GPUs or when running inference with the 14B model to prevent out-of-memory issues.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-14B-Video2World \
--input_image_or_video_path assets/diffusion/video2world_input0.jpg \
--num_input_frames 1 \
--offload_tokenizer \
--offload_diffusion_transformer \
--offload_text_encoder_model \
--offload_prompt_upsampler \
--offload_guardrail_models
--video_save_name diffusion-video2world-14b
Single Generation with Multi-GPU Inference#
This command generates a single video using 8 GPUs.
NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/video2world.py \
--num_gpus ${NUM_GPUS} \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Video2World \
--input_image_or_video_path assets/diffusion/video2world_input0.jpg \
--num_input_frames 1 \
--offload_prompt_upsampler \
--video_save_name diffusion-video2world-7b
Batch Generation#
The --batch_input_path
argument allows you to generate multiple videos, one for each image or video provided. This argument
specifies the path to a JSONL file, which contains one video/image input line in the following format:
{"visual_input": "path/to/video1.mp4"}
{"visual_input": "path/to/video2.mp4"}
Inference is performed as follows (with the number of input frames set to 9):
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Video2World \
--batch_input_path assets/diffusion/batch_inputs/video2world_ps.jsonl \
--num_input_frames 9 \
--offload_prompt_upsampler \
--video_save_folder diffusion-video2world-7b-batch
Batch Generation without Prompt Upsampling#
If prompt upsampling is disabled using the disable_prompt_upsampler
argument, then the JSONL file for batch generation must also include
a text prompt for each image/video:
{"prompt": "prompt1", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "visual_input": "path/to/video2.mp4"}
Inference is performed as follows (with the number of input frames set to 9):
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Video2World \
--batch_input_path assets/diffusion/batch_inputs/video2world_wo_ps.jsonl \
--num_input_frames 9 \
--disable_prompt_upsampler \
--video_save_folder diffusion-video2world-7b-batch-wo-ps
Text2World-Multiview#
Downloading the Model Weights#
Use the following command to download the Text2World-Sample-AV-Multiview
model weights from Hugging Face:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Text2World-Sample-AV-Multiview
Multiview Prompts#
The Text2World-Multiview model requires the --prompt
input argument, along with one or more of the following: prompt_left
,
prompt_right
, prompt_back
, prompt_back_left
, prompt_back_right
.
The following example demonstrates these text prompts being set in the shell.
PROMPT="The video is captured from a camera mounted on a car. The camera is facing forward. \
The video is taken from the perspective of a vehicle's dashboard camera, showing a straight road flanked by snow-covered trees and a clear sky. \
The road is mostly empty, with no visible traffic or pedestrians. \
The sun is setting, casting a warm glow on the horizon and creating long shadows on the snow. \
The trees are tall and leafless, with some coniferous trees interspersed among the bare deciduous trees. \
The snow on the ground appears undisturbed, suggesting a quiet and peaceful setting."
PROMPT_LEFT="The video is captured from a camera mounted on a car. The camera is facing to the left. \
The video captures a series of images from a moving vehicle, showcasing a winter scene with snow-covered ground and trees. \
The sky is a gradient of blue and orange hues, indicating either sunrise or sunset. \
The trees are tall and predominantly coniferous, with some deciduous trees as well. \
The snow appears undisturbed, suggesting a quiet, possibly early morning setting. \
There are no visible people or animals, and the road is clear of traffic. \
The video has a fisheye lens effect, which gives a wide-angle view of the surroundings."
PROMPT_RIGHT="The video is captured from a camera mounted on a car. The camera is facing to the right. \
The video captures a series of images taken from a moving vehicle, showcasing a winter scene with snow-covered ground and trees. \
The sky is a gradient of blue hues, indicating either dawn or dusk. \
The trees are predominantly coniferous, with some bare deciduous trees. \
The snow appears fresh and undisturbed, suggesting recent snowfall. \
There are no visible people or animals, and the environment is serene and untouched. \
The perspective changes as the vehicle moves, providing different angles of the same landscape."
PROMPT_BACK="The video is captured from a camera mounted on a car. The camera is facing backwards. \
The video captures a sequence of frames showing a road covered in snow, with tire tracks visible on the surface. \
The road is flanked by tall, leafless trees, and the sky is a gradient of pink and blue hues, indicating either sunrise or sunset. \
The lighting conditions suggest it is either early morning or late evening. \
There are no visible signs of people or animals, and the road appears to be in a rural or less populated area. \
The vehicles in the video are moving at a steady pace, and there are no visible traffic signs or markings that stand out."
PROMPT_BACK_LEFT="The video is captured from a camera mounted on a car. The camera is facing the rear left side."
PROMPT_BACK_RIGHT="The video is captured from a camera mounted on a car. The camera is facing the rear right side."
Single Generation#
This command generates a single multiview output using text prompts.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world_multiview.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview \
--prompt "${PROMPT}" \
--prompt_left "${PROMPT_LEFT}" \
--prompt_right "${PROMPT_RIGHT}" \
--prompt_back "${PROMPT_BACK}" \
--prompt_back_left "${PROMPT_BACK_LEFT}" \
--prompt_back_right "${PROMPT_BACK_RIGHT}" \
--video_save_name diffusion-text2world-multiview-7
Single Generation with Model Offloading#
This command runs inference to generate a single multiview output using various offloading flags. Offloading should be used with low-memory GPUs or to prevent out-of-memory issues.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world_multiview.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview \
--prompt "${PROMPT}" \
--prompt_left "${PROMPT_LEFT}" \
--prompt_right "${PROMPT_RIGHT}" \
--prompt_back "${PROMPT_BACK}" \
--prompt_back_left "${PROMPT_BACK_LEFT}" \
--prompt_back_right "${PROMPT_BACK_RIGHT}" \
--offload_tokenizer \
--offload_diffusion_transformer \
--offload_text_encoder_model \
--video_save_name diffusion-text2world-multiview-7b
Single Generation with Multi-GPU Inference#
This command generates a single multiview output using 8 GPUs.
NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/text2world_multiview.py \
--num_gpus ${NUM_GPUS} \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview \
--prompt "${PROMPT}" \
--prompt_left "${PROMPT_LEFT}" \
--prompt_right "${PROMPT_RIGHT}" \
--prompt_back "${PROMPT_BACK}" \
--prompt_back_left "${PROMPT_BACK_LEFT}" \
--prompt_back_right "${PROMPT_BACK_RIGHT}" \
--offload_prompt_upsampler \
--video_save_name diffusion-text2world-multiview-7b-8gpu
Batch Generation#
The --batch_input_path
argument allows you to generate multiple multiview outputs, one for each set of text prompts provided. This argument
specifies the path to a JSONL file, which contains one set of prompts per line in the following format:
{"prompt": "prompt1", "prompt_left": "prompt1_left", "prompt_right": "prompt1_right", "prompt_back": "prompt1_back", "prompt_back_left": "prompt1_back_left", "prompt_back_right": "prompt1_back_right"}
{"prompt": "prompt2", "prompt_left": "prompt2_left", "prompt_right": "prompt2_right", "prompt_back": "prompt2_back", "prompt_back_left": "prompt2_back_left", "prompt_back_right": "prompt2_back_right"}
Inference is performed as follows:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
--batch_input_path assets/diffusion/batch_inputs/text2world.jsonl \
--offload_prompt_upsampler \
--video_save_folder diffusion-text2world-7b-batch
Video2World-Multiview#
Downloading the Model Weights#
Use the following command to download the Cosmos-Predict1-Diffusion-7B-Video2World-Sample-AV-Multiview
model weights from Hugging Face:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 7B --model_types Video2World-Multiview
Single Generation#
This command generates a single multiview output using a single video input.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_multiview.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview \
--input_image_or_video_path assets/diffusion/video2world_multiview_input1.mp4 \
--num_input_frames 1 \
--prompt "${PROMPT}" \
--prompt_left "${PROMPT_LEFT}" \
--prompt_right "${PROMPT_RIGHT}" \
--prompt_back "${PROMPT_BACK}" \
--prompt_back_left "${PROMPT_BACK_LEFT}" \
--prompt_back_right "${PROMPT_BACK_RIGHT}" \
--video_save_name diffusion-video2world-multiview-7b
Single Generation with Model Offloading#
This command runs inference to generate a single multiview output using various offloading flags. Offloading should be used with low-memory GPUs to prevent out-of-memory issues.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_multiview.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-14B-Video2World_multiview \
--input_image_or_video_path assets/diffusion/video2world_multiview_input1.mp4 \
--num_input_frames 1 \
--prompt "${PROMPT}" \
--prompt_left "${PROMPT_LEFT}" \
--prompt_right "${PROMPT_RIGHT}" \
--prompt_back "${PROMPT_BACK}" \
--prompt_back_left "${PROMPT_BACK_LEFT}" \
--prompt_back_right "${PROMPT_BACK_RIGHT}" \
--offload_tokenizer \
--offload_diffusion_transformer \
--offload_text_encoder_model \
--video_save_name diffusion-video2world-multiview-7b
Single Generation with Multi-GPU Inference#
This command generates a single multiview output using 8 GPUs.
NUM_GPUS=8
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=${NUM_GPUS} cosmos_predict1/diffusion/inference/text2world_multiview.py \
--num_gpus ${NUM_GPUS} \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World-Sample-AV-Multiview \
--prompt "${PROMPT}" \
--prompt_left "${PROMPT_LEFT}" \
--prompt_right "${PROMPT_RIGHT}" \
--prompt_back "${PROMPT_BACK}" \
--prompt_back_left "${PROMPT_BACK_LEFT}" \
--prompt_back_right "${PROMPT_BACK_RIGHT}" \
--offload_prompt_upsampler \
--video_save_name diffusion-text2world-multiview-7b-8gpu
Batch Generation#
The --batch_input_path
argument allows you to generate multiple multiview outputs, one for each set of text prompts provided. This argument
specifies the path to a JSONL file, which contains one visual_input
value and an optional set of prompts per line in the following format:
{"prompt": "prompt1", "prompt_left": "prompt1_left", "prompt_right": "prompt1_right", "prompt_back": "prompt1_back", "prompt_back_left": "prompt1_back_left", "prompt_back_right": "prompt1_back_right", "visual_input": "path/to/video1.mp4"}
{"prompt": "prompt2", "prompt_left": "prompt2_left", "prompt_right": "prompt2_right", "prompt_back": "prompt2_back", "prompt_back_left": "prompt2_back_left", "prompt_back_right": "prompt2_back_right", "visual_input": "path/to/video2.mp4"}
Inference is performed as follows:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_multiview.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Video2World-Sample-AV-Multiview \
--batch_input_path assets/diffusion/batch_inputs/video2world_multiview.jsonl \
--num_input_frames 9 \
--video_save_folder diffusion-video2world-multiview-7b-batch
Arguments#
This section describes the available arguments for diffusion inference scripts.
Common Parameters#
The following table lists parameters that are available for all diffusion inference scripts.
Parameter |
Description |
Default |
---|---|---|
|
Directory containing model weights |
“checkpoints” |
|
Output video filename for single video generation |
“output” |
|
Output directory for batch video generation |
“outputs/” |
|
Text prompt for single video generation |
None |
|
Path to JSONL file for batch video generation. Required for batch video generation. |
None |
|
Disable automatic prompt enhancement. If this argument is used with Video2World models, a text prompt is required. |
False |
|
Offload DiT model after inference; used for low-memory GPUs |
False |
|
Offload VAE model after inference; used for low-memory GPUs |
False |
|
Offload text encoder after inference; used for low-memory GPUs |
False |
|
Offload prompt upsampler after inference; used for low-memory GPUs |
False |
|
Offload guardrail models after inference; used for low-memory GPUs |
False |
|
Directory containing DiT weights |
N/A (the default directory varies by inference script) |
Video2World-Specific Parameters#
Parameter |
Description |
Default |
---|---|---|
|
Input video/image path for single video generation. Required for single video generation. |
None |
|
Number of video frames (1 or 9) |
1 |
Prompting Guidelines#
The input text prompt is the most important parameter under the user’s control when interacting with the model. Rich and descriptive prompts can positively impact the output quality of the model, whereas short and poorly detailed prompts can lead to subpar video generation. Here are some guidelines for creating effective text prompts:
Describe a single, captivating scene: Focus on a single scene to prevent the model from generating videos with unnecessary shot changes.
Limit camera control instructions: The model doesn’t handle prompts involving camera control well, as this feature is still under development.
Prompt upsampler limitations: The current version of the prompt upsampler may sometimes deviate from the original intent of your prompt, adding unwanted details. If this happens, you can disable the upsampler with the
--disable_prompt_upsampler
flag and edit your prompt manually. We recommend using prompts of around 120 words for optimal quality.
Prompt Example#
The following example demonstrates an effective text prompt for physical AI video generation.
PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
Cosmos-Predict1-Prompt-Upsampler#
The prompt upsampler automatically expands brief prompts into more detailed descriptions (Text2World) or generates detailed prompts based on input images (Video2World).
Text2World#
When enabled (default), the upsampler will:
Ingest your input prompt.
Process it through a finetuned Mistral model to generate a more detailed description.
Use the expanded description for video generation.
This can generate better quality videos by providing more detailed context to the video generation model. To disable this feature, use the --disable_prompt_upsampler
flag.
Video2World#
When enabled (default), the upsampler will:
Ingest your input image or video.
Process it through a Pixtral model to generate a detailed description.
Use the generated description for video generation.
Note
The Video2World prompt upsampler does not consider any user-provided text prompt. To disable this feature, use the --disable_prompt_upsampler
flag.
Safety Features#
The Cosmos Predict1 models use a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed, and they will be blurred by the guardrail.
For more information, refer to the the Cosmos Guardrail page.