Predict2 Model Reference#

This page details the options available when using Cosmos-Predict1 diffusion models.

Model Types#

There are two model types available for world generation:

  • Text2Image: World generation of images from text input

    • Models: Cosmos-Predict2-2B-Text2Image, Cosmos-Predict2-14B-Text2Image

    • Inference script: text2world.py (/cosmos_predict2/diffusion/inference/text2image.py)

  • Video2World: World generation of videos from text and image/video input

    • Models: Cosmos-Predict2-2B-Video2World, Cosmos-Predict2-14B-Video2World

    • Inference script: video2world.py (/cosmos_predict2/diffusion/inference/video2world.py)

Sample Commands#

This section contains sample commands for each model type, including single generation and single generation with model offloading.

Text2Image#

The Text2Image inference script is located at cosmos_predict2/diffusion/inference/text2image.py. It requires the text input argument `–prompt.

Downloading the Model Weights#

Use the following command to download the Cosmos-Predict2-2B-Text2Image and Cosmos-Predict2-14B-Text2Image model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 2B 14B --model_types Text2Image --checkpoint_dir checkpoints

Setting Prompts#

The Text2Image examples below utilize both $PROMPT and $NEGATIVE_PROMPT inputs. Here is an example of each:

PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple \
double-decker buses are parked under the glow of overhead lights, with a central bus labeled “87D” facing \
forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights \
brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in \
the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its \
position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban \
night scene."

NEGATIVE_PROMPT="The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."

Refer to the Prompting Guidelines for more information on writing an effective text prompt for Transfer2 model inference.

Single Generation with 2B Model#

This command runs inference with the 2B model to generate a single image using both a text prompt and a negative text prompt.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/text2image.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict2-2B-Text2Image \
    --offload_prompt_upsampler \
    --disable_prompt_upsampler \
    --disable_guardrail \
    --prompt "${PROMPT}" \
    --negative_prompt "${NEGATIVE_PROMPT}" \
    --image_save_name text2image_2b

The output is saved as outputs/text2image_2b.jpg, along with the corresponding prompt at outputs/text2image_2b.txt.

Single Generation with 14B Model#

This command runs inference with the 14B model to generate a single image using both a text prompt and a negative text prompt.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/text2image.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict2-14B-Text2Image \
    --offload_prompt_upsampler \
    --disable_prompt_upsampler \
    --disable_guardrail \
    --prompt "${PROMPT}" \
    --negative_prompt "${NEGATIVE_PROMPT}" \
    --image_save_name text2image_14b

The output is saved as outputs/text2image_14b.jpg, along with the corresponding prompt at outputs/text2image_14b.txt.

Video2World#

Downloading the Model Weights#

Use the following command to download the Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World model weights from Hugging Face:

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python scripts/download_diffusion_checkpoints.py --model_sizes 2B 14B --model_types Video2World --checkpoint_dir checkpoints

Single Generation with 2B Model#

This command runs inference with the 2B model to generate a single video using a text prompt. No text prompt is used.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/text2image.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict2-2B-Text2Image \
    --offload_prompt_upsampler \
    --disable_prompt_upsampler \
    --disable_guardrail \
    --prompt "${PROMPT}" \
    --negative_prompt "${NEGATIVE_PROMPT}" \
    --image_save_name text2image_2b

Single Generation with 14B Model#

This command runs inference with the 14B model to generate a single video using a text prompt. No text prompt is used.

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict2/diffusion/inference/text2image.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict2-14B-Text2Image \
    --offload_prompt_upsampler \
    --disable_prompt_upsampler \
    --disable_guardrail \
    --prompt "${PROMPT}" \
    --negative_prompt "${NEGATIVE_PROMPT}" \
    --image_save_name text2image_14b

Arguments#

This section describes the available arguments for diffusion inference scripts.

Common Parameters#

The following table lists parameters that are available for all diffusion inference scripts.

Parameter

Description

Default

--checkpoint_dir

Directory containing model weights

“checkpoints”

--video_save_name

Output video filename for single video generation

“output”

--video_save_folder

Output directory for batch video generation

“outputs/”

--prompt

Text prompt for single video generation

None

--batch_input_path

Path to JSONL file for batch video generation. Required for batch video generation.

None

--disable_prompt_upsampler

Disable automatic prompt enhancement. If this argument is used with Video2World models, a text prompt is required.

False

--offload_diffusion_transformer

Offload DiT model after inference; used for low-memory GPUs

False

--offload_tokenizer

Offload VAE model after inference; used for low-memory GPUs

False

--offload_text_encoder_model

Offload text encoder after inference; used for low-memory GPUs

False

--offload_prompt_upsampler

Offload prompt upsampler after inference; used for low-memory GPUs

False

--offload_guardrail_models

Offload guardrail models after inference; used for low-memory GPUs

False

--diffusion_transformer_dir

Directory containing DiT weights

N/A (the default directory varies by inference script)

Video2World-Specific Parameters#

Parameter

Description

Default

--input_image_or_video_path

Input video/image path for single video generation. Required for single video generation.

None

--num_input_frames

Number of video frames (1 or 9)

1

Prompting Guidelines#

The input text prompt is the most important parameter under the user’s control when interacting with the model. Rich and descriptive prompts can positively impact the output quality of the model, whereas short and poorly detailed prompts can lead to subpar video generation. Here are some guidelines for creating effective text prompts:

  1. Describe a single, captivating scene: Focus on a single scene to prevent the model from generating videos with unnecessary shot changes.

  2. Limit camera control instructions: The model doesn’t handle prompts involving camera control well, as this feature is still under development.

  1. Prompt upsampler limitations: The current version of the prompt upsampler may sometimes deviate from the original intent of your prompt, adding unwanted details. If this happens, you can disable the upsampler with the --disable_prompt_upsampler flag and edit your prompt manually. We recommend using prompts of around 120 words for optimal quality.

Prompt Example#

The following example demonstrates an effective text prompt for physical AI video generation.

PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

Cosmos-Predict1-Prompt-Upsampler#

The prompt upsampler automatically expands brief prompts into more detailed descriptions (Text2World) or generates detailed prompts based on input images (Video2World).

Text2World#

When enabled (default), the upsampler will:

  1. Ingest your input prompt.

  2. Process it through a finetuned Mistral model to generate a more detailed description.

  3. Use the expanded description for video generation.

This can generate better quality videos by providing more detailed context to the video generation model. To disable this feature, use the --disable_prompt_upsampler flag.

Video2World#

When enabled (default), the upsampler will:

  1. Ingest your input image or video.

  2. Process it through a Pixtral model to generate a detailed description.

  3. Use the generated description for video generation.

Note

The Video2World prompt upsampler does not consider any user-provided text prompt. To disable this feature, use the --disable_prompt_upsampler flag.

Safety Features#

The Cosmos Predict1 models use a built-in safety guardrail system that cannot be disabled. Generating human faces is not allowed, and they will be blurred by the guardrail.

For more information, refer to the the Cosmos Guardrail page.