Predict2 Model Reference#
This page details the options available when using Cosmos-Predict1 diffusion models.
Model Types#
There are two model types available for world generation:
Text2Image: World generation of images from text input
Models:
Cosmos-Predict2-2B-Text2Image
,Cosmos-Predict2-14B-Text2Image
Inference script:
text2world.py
(/cosmos_predict2/diffusion/inference/text2image.py
)
Video2World: World generation of videos from text and image/video input
Models:
Cosmos-Predict2-2B-Video2World
,Cosmos-Predict2-14B-Video2World
Inference script:
video2world.py
(/cosmos_predict2/diffusion/inference/video2world.py
)
Sample Commands#
This section contains sample commands for each model type, including single generation and single generation with model offloading.
Text2Image#
The Text2Image inference script is located at cosmos_predict2/diffusion/inference/text2image.py
. It requires the text input argument `–prompt.
Downloading the Model Weights#
Use the following command to download the Cosmos-Predict2-2B-Text2Image
and Cosmos-Predict2-14B-Text2Image
model weights from Hugging Face:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 2B 14B --model_types Text2Image --checkpoint_dir checkpoints
Setting Prompts#
The Text2Image examples below utilize both $PROMPT
and $NEGATIVE_PROMPT
inputs. Here is an example of each:
PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple \
double-decker buses are parked under the glow of overhead lights, with a central bus labeled “87D” facing \
forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights \
brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in \
the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its \
position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban \
night scene."
NEGATIVE_PROMPT="The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."
Refer to the Prompting Guidelines for more information on writing an effective text prompt for Transfer2 model inference.
Single Generation with 2B Model#
This command runs inference with the 2B model to generate a single image using both a text prompt and a negative text prompt.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/text2image.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict2-2B-Text2Image \
--offload_prompt_upsampler \
--disable_prompt_upsampler \
--disable_guardrail \
--prompt "${PROMPT}" \
--negative_prompt "${NEGATIVE_PROMPT}" \
--image_save_name text2image_2b
The output is saved as outputs/text2image_2b.jpg
, along with the corresponding prompt at outputs/text2image_2b.txt
.
Single Generation with 14B Model#
This command runs inference with the 14B model to generate a single image using both a text prompt and a negative text prompt.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/text2image.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict2-14B-Text2Image \
--offload_prompt_upsampler \
--disable_prompt_upsampler \
--disable_guardrail \
--prompt "${PROMPT}" \
--negative_prompt "${NEGATIVE_PROMPT}" \
--image_save_name text2image_14b
The output is saved as outputs/text2image_14b.jpg
, along with the corresponding prompt at outputs/text2image_14b.txt
.
Video2World#
Downloading the Model Weights#
Use the following command to download the Cosmos-Predict2-2B-Video2World
and Cosmos-Predict2-14B-Video2World
model weights from Hugging Face:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 2B 14B --model_types Video2World --checkpoint_dir checkpoints
Single Generation with 2B Model#
This command runs inference with the 2B model to generate a single video using a text prompt. No text prompt is used.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/text2image.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict2-2B-Text2Image \
--offload_prompt_upsampler \
--disable_prompt_upsampler \
--disable_guardrail \
--prompt "${PROMPT}" \
--negative_prompt "${NEGATIVE_PROMPT}" \
--image_save_name text2image_2b
Single Generation with 14B Model#
This command runs inference with the 14B model to generate a single video using a text prompt. No text prompt is used.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/text2image.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict2-14B-Text2Image \
--offload_prompt_upsampler \
--disable_prompt_upsampler \
--disable_guardrail \
--prompt "${PROMPT}" \
--negative_prompt "${NEGATIVE_PROMPT}" \
--image_save_name text2image_14b
Arguments#
This section describes the available arguments for diffusion inference scripts.
Common Parameters#
The following table lists parameters that are available for all diffusion inference scripts.
Parameter |
Description |
Default |
---|---|---|
|
Directory containing model weights |
“checkpoints” |
|
Output video filename for single video generation |
“output” |
|
Output directory for batch video generation |
“outputs/” |
|
Text prompt for single video generation |
None |
|
Path to JSONL file for batch video generation. Required for batch video generation. |
None |
|
Disable automatic prompt enhancement. If this argument is used with Video2World models, a text prompt is required. |
False |
|
Offload DiT model after inference; used for low-memory GPUs |
False |
|
Offload VAE model after inference; used for low-memory GPUs |
False |
|
Offload text encoder after inference; used for low-memory GPUs |
False |
|
Offload prompt upsampler after inference; used for low-memory GPUs |
False |
|
Offload guardrail models after inference; used for low-memory GPUs |
False |
|
Directory containing DiT weights |
N/A (the default directory varies by inference script) |
Video2World-Specific Parameters#
Parameter |
Description |
Default |
---|---|---|
|
Input video/image path for single video generation. Required for single video generation. |
None |
|
Number of video frames (1 or 9) |
1 |
Prompting Guidelines#
The input text prompt is the most important parameter under the user’s control when interacting with the model. Rich and descriptive prompts can positively impact the output quality of the model, whereas short and poorly detailed prompts can lead to subpar video generation. Here are some guidelines for creating effective text prompts:
Describe a single, captivating scene: Focus on a single scene to prevent the model from generating videos with unnecessary shot changes.
Limit camera control instructions: The model doesn’t handle prompts involving camera control well, as this feature is still under development.
Prompt upsampler limitations: The current version of the prompt upsampler may sometimes deviate from the original intent of your prompt, adding unwanted details. If this happens, you can disable the upsampler with the
--disable_prompt_upsampler
flag and edit your prompt manually. We recommend using prompts of around 120 words for optimal quality.
Prompt Example#
The following example demonstrates an effective text prompt for physical AI video generation.
PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
Cosmos-Predict1-Prompt-Upsampler#
The prompt upsampler automatically expands brief prompts into more detailed descriptions (Text2World) or generates detailed prompts based on input images (Video2World).
Text2World#
When enabled (default), the upsampler will:
Ingest your input prompt.
Process it through a finetuned Mistral model to generate a more detailed description.
Use the expanded description for video generation.
This can generate better quality videos by providing more detailed context to the video generation model. To disable this feature, use the --disable_prompt_upsampler
flag.
Video2World#
When enabled (default), the upsampler will:
Ingest your input image or video.
Process it through a Pixtral model to generate a detailed description.
Use the generated description for video generation.
Note
The Video2World prompt upsampler does not consider any user-provided text prompt. To disable this feature, use the --disable_prompt_upsampler
flag.
Safety Features#
The Cosmos Predict1 models use a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed, and they will be blurred by the guardrail.
For more information, refer to the the Cosmos Guardrail page.