Predict2 Model Reference#

This page details the options available when using Cosmos-Predict1 diffusion models.

Model Types#

There are two model types available for world generation:

Text2Image: World generation of images from text input
- Cosmos-Predict2-2B-Text2Image,
- Cosmos-Predict2-14B-Text2Image
Video2World: World generation of videos from text and image/video input
- Cosmos-Predict2-2B-Video2World,
- Cosmos-Predict2-14B-Video2World
Text2World: Combines the Text2Image and Video2World models to generate videos directly from text prompts.

Text2Image#

Cosmos-Predict2 provides two models for text-to-image generation: Cosmos-Predict2-2B-Text2Image and Cosmos-Predict2-14B-Text2Image. These models can transform natural language descriptions into high-quality images through progressive diffusion guided by the text prompt.

The inference script is examples/text2image.py. It requires the input argument --prompt (text input). To see the complete list of available arguments, run:

python -m examples.text2image --help

Download Commands#

Cosmos-Predict2-2B-Text2Image:

python -m scripts.download_checkpoints --model_types text2image --model_sizes 2B

Cosmos-Predict2-14B-Text2Image:

python -m scripts.download_checkpoints --model_types text2image --model_sizes 14B

Examples#

Single Image Generation#

This is a basic example for running inference on the 2B model with a single prompt. The output is saved to outputs/text2image_2b.jpg.

# Set the input prompt
PROMPT="A well-worn broom sweeps across a dusty wooden floor, its bristles gathering crumbs and flecks of debris in swift, rhythmic strokes. Dust motes dance in the sunbeams filtering through the window, glowing momentarily before settling. The quiet swish of straw brushing wood is interrupted only by the occasional creak of old floorboards. With each pass, the floor grows cleaner, restoring a sense of quiet order to the humble room."
# Run text2image generation
python -m examples.text2image \
    --prompt "${PROMPT}" \
    --model_size 2B \
    --save_path outputs/text2image_2b.jpg

The 14B model can be run similarly by changing the model size parameter.

Batch Image Generation#

For generating multiple images with different prompts, you can use a JSON file with batch inputs. The JSON file should contain an array of objects, where each object has:

prompt: The text prompt describing the desired image (required)
output_image: The path where the generated image should be saved (required)

An example can be found in assets/text2image/batch_example.json:

[
  {
    "prompt": "A well-worn broom sweeps across a dusty wooden floor, its bristles gathering crumbs and flecks of debris in swift, rhythmic strokes. Dust motes dance in the sunbeams filtering through the window, glowing momentarily before settling. The quiet swish of straw brushing wood is interrupted only by the occasional creak of old floorboards. With each pass, the floor grows cleaner, restoring a sense of quiet order to the humble room.",
    "output_image": "output/sweeping-broom-sunlit-floor.jpg"
  },
  {
    "prompt": "A laundry machine whirs to life, tumbling colorful clothes behind the foggy glass door. Suds begin to form in a frothy dance, clinging to fabric as the drum spins. The gentle thud of shifting clothes creates a steady rhythm, like a heartbeat of the home. Outside the machine, a quiet calm fills the room, anticipation building for the softness and warmth of freshly laundered garments.",
    "output_image": "output/laundry-machine-spinning-clothes.jpg"
  },
  {
    "prompt": "A robotic arm tightens a bolt beneath the hood of a car, its tool head rotating with practiced torque. The metal-on-metal sound clicks into place, and the arm pauses briefly before retracting with a soft hydraulic hiss. Overhead lights reflect off the glossy vehicle surface, while scattered tools and screens blink in the background—a garage scene reimagined through the lens of precision engineering.",
    "output_image": "output/robotic-arm-car-assembly.jpg"
  }
]

Specify the input via the --batch_input_json argument:

# Run batch text2image generation
python -m examples.text2image \
    --model_size 2B \
    --batch_input_json assets/text2image/batch_example.json

This will generate three separate images according to the prompts specified in the JSON file, with each output saved to its corresponding path.

API Documentation#

The predict2_text2image.py script supports the following command-line arguments:

Input and output parameters:

--prompt: Text prompt describing the image to generate (default: predefined example prompt)
--negative_prompt: Text describing what to avoid in the generated image (default: empty)
--save_path: Path to save the generated image (default: “generated_image.jpg”)
--batch_input_json: Path to JSON file containing batch inputs, where each entry should have ‘prompt’ and ‘output_image’ fields

Model selection:

--model_size: Size of the model to use (choices: “2B”, “14B”, default: “2B”)

Performance optimization:

--seed: Random seed for reproducible results (default: 0)
--use_cuda_graphs: Use CUDA Graphs for inference acceleration

Content safety:

--disable_guardrail: Disable guardrail checks on prompts (by default, guardrails are enabled to filter harmful content)

Prompt Engineering Tips#

For best results with Cosmos models, create detailed prompts that emphasize physical realism, natural laws, and real-world behaviors. Describe specific objects, materials, lighting conditions, and spatial relationships while maintaining logical consistency throughout the scene.

Incorporate photography terminology like composition, lighting setups, and camera settings. Use concrete terms like “natural lighting” or “wide-angle lens” rather than abstract descriptions, unless intentionally aiming for surrealism. Include negative prompts to explicitly specify undesired elements.

The more grounded a prompt is in real-world physics and natural phenomena, the more physically plausible and realistic the gen

Video2World#

Cosmos-Predict2 provides two models for generating videos from a combination of text and visual inputs: Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World. These models can transform a still image or video clip into a longer, animated sequence guided by the text description.

The inference script is located at examples/video2world.py. It requires input arguments:

--input_path: input image or video
--prompt: text prompt

By default the checkpoint you downloaded from the Quick Start Guide section are for 720P and 16FPS. If you instead want to change the behavior to 480P and 10FPS, for example, you need to download the corresponding checkpoint and pass --fps 10 --resolution 480.

For a complete list of available arguments and options:

python -m examples.video2world --help

Examples#

Single Video Generation#

This is a basic example for running inference on the 2B model with a single image. The output is saved to output/video2world_2b.mp4.

# Set the input prompt
PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple double-decker buses are parked under the glow of overhead lights, with a central bus labeled '87D' facing forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban night scene."
# Run video2world generation
python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --num_conditional_frames 1 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b.mp4

The 14B model can be run similarly by changing the model size parameter. For GPUs with lower memory limit, it may also make sense to offload guardrail and prompt refiner models.

# Set the input prompt
PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple double-decker buses are parked under the glow of overhead lights, with a central bus labeled '87D' facing forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban night scene."
# Run video2world generation
python -m examples.video2world \
    --model_size 14B \
    --input_path assets/video2world/input0.jpg \
    --num_conditional_frames 1 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_14b.mp4 \
    --offload_guardrail \
    --offload_prompt_refiner

Batch Video Generation#

For generating multiple videos with different inputs and prompts, you can use a JSON file with batch inputs. The JSON file should contain an array of objects, where each object has:

input_video: The path to the input image or video (required)
prompt: The text prompt describing the desired video (required)
output_video: The path where the generated video should be saved (required)

An example can be found in assets/video2world/batch_example.json:

[
  {
    "input_video": "assets/video2world/input0.jpg",
    "prompt": "A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple double-decker buses are parked under the glow of overhead lights, with a central bus labeled '87D' facing forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban night scene.",
    "output_video": "output/bus-terminal-night-movement.mp4"
  },
  {
    "input_video": "assets/video2world/input1.jpg",
    "prompt": "As the red light shifts to green, the red bus at the intersection begins to move forward, its headlights cutting through the falling snow. The snowy tire tracks deepen as the vehicle inches ahead, casting fresh lines onto the slushy road. Around it, streetlights glow warmer, illuminating the drifting flakes and wet reflections on the asphalt. Other cars behind start to edge forward, their beams joining the scene. The stillness of the urban street transitions into motion as the quiet snowfall is punctuated by the slow advance of traffic through the frosty city corridor.",
    "output_video": "output/snowy-intersection-traffic.mp4"
  },
  {
    "input_video": "assets/video2world/input2.jpg",
    "prompt": "In the later moments of the video, the female worker in the front, dressed in a white coat and hairnet, performs a repetitive yet precise task. She scoops golden granular material from a wide jar and steadily pours it into the next empty glass bottle on the conveyor belt. Her hand moves with practiced control as she aligns the scoop over each container, ensuring an even fill. The sequence highlights her focused attention and consistent motion, capturing the shift from preparation to active material handling as the production line advances bottle by bottle.",
    "output_video": "output/factory-worker-bottle-filling.mp4"
  }
]

Specify the input via the --batch_input_json argument:

# Run batch video2world generation
python -m examples.video2world \
    --model_size 2B \
    --batch_input_json assets/video2world/batch_example.json

This will generate three separate videos according to the inputs and prompts specified in the JSON file, with each output saved to its corresponding path.

Multi-Frame Video Conditioning#

Video2World models support two types of conditioning on visual input:

Single-frame conditioning (default): Uses 1 frame from an image or video for conditioning
Multi-frame conditioning: Uses the last 5 consecutive frames from a video for enhanced temporal consistency

Using multiple frames as conditioning input can provide better temporal coherence in the generated video by giving the model more context about the motion present in the original sequence.

Multi-frame conditioning is particularly effective when:

Preservation of specific motion patterns from the input video is desired
The input contains complex or distinctive movements that should be maintained
Stronger visual coherence between the input and output videos is needed
Extending or transforming an existing video clip is the goal

For 5-frame conditioning, the input must be a video file, not a still image. Specify the number of conditional frames with the --num_conditional_frames 5 argument:

# Set the input prompt
PROMPT="A point-of-view video shot from inside a vehicle, capturing a quiet suburban street bathed in bright sunlight. The road is lined with parked cars on both sides, and buildings, likely residential or small businesses, are visible across the street. A STOP sign is prominently displayed near the center of the intersection. The sky is clear and blue, with the sun shining brightly overhead, casting long shadows on the pavement. On the left side of the street, several vehicles are parked, including a van with some text on its side. Across the street, a white van is parked near two trash bins, and a red SUV is parked further down. The buildings on either side have a mix of architectural styles, with some featuring flat roofs and others with sloped roofs. Overhead, numerous power lines stretch across the street, and a few trees are visible in the background, partially obscuring the view of the buildings. As the video progresses, a white car truck makes a right turn into the adjacent opposite lane. The ego vehicle slows down and comes to a stop, waiting until the car fully enters the opposite lane before proceeding. The pedestrian keeps walking on the street. The other vehicles remain stationary, parked along the curb. The scene remains static otherwise, with no significant changes in the environment or additional objects entering the frame. By the end of the video, the white car truck has moved out of the camera view, the rest of the scene remains largely unchanged, maintaining the same composition and lighting conditions as the beginning."
# Run video2world generation with 5-frame conditioning
python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input3.mp4 \
    --num_conditional_frames 5 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b_5frames.mp4

Note that when using multi-frame conditioning in batch mode, all input files must be videos, not images.

Notes on multi-frame conditioning:

Multi-frame conditioning requires video inputs with at least 5 frames
The model will extract the last 5 frames from the input video

Using the Prompt Refiner#

The Cosmos-Predict2 models include a prompt refiner model using Cosmos-Reason1-7B that automatically enhances short prompts with additional details. This is particularly useful when:

Brief prompts need to be expanded into more detailed videos
Additional descriptive elements would improve video quality
Detailed prompt writing expertise is limited

The following example uses a short prompt that will be automatically expanded by the prompt refiner:

# Set the input short prompt
PROMPT="A nighttime city bus terminal."
# Run video2world generation
python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --num_conditional_frames 1 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b_with_prompt_refiner.mp4

The prompt refiner is enabled by default. To disable it, use the --disable_prompt_refiner flag:

# Run video2world generation without prompt refinement
python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --prompt "${PROMPT}" \
    --disable_prompt_refiner \
    --save_path output/video2world_2b_without_prompt_refiner.mp4

This configuration can be seen in the model’s configuration:

prompt_refiner_config=CosmosReason1Config(
    checkpoint_dir="checkpoints/nvidia/Cosmos-Reason1-7B",
    offload_model_to_cpu=True,
    enabled=True,  # Controls whether the refiner is used
)

Multi-GPU Inference#

For faster inference on high-resolution videos, Video2World supports context parallelism, which distributes the video frames across multiple GPUs. This can significantly reduce the inference time, especially for the larger 14B model.

To enable multi-GPU inference, set the NUM_GPUS environment variable and use torchrun to launch the script. Both --nproc_per_node and --num_gpus should be set to the same value:

# Set the number of GPUs to use
export NUM_GPUS=8

# Run video2world generation with context parallelism using torchrun
torchrun --nproc_per_node=${NUM_GPUS} examples/video2world.py \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b_${NUM_GPUS}gpu.mp4 \
    --num_gpus ${NUM_GPUS}

This distributes the computation across multiple GPUs, with each GPU processing a subset of the video frames. The final video is automatically combined from the results of all GPUs.

Note: Both parameters are required: --nproc_per_node tells PyTorch how many processes to launch, while --num_gpus tells the model how to distribute the workload. Using the same environment variable for both ensures they are synchronized.

Important considerations for multi-GPU inference:

The number of GPUs should ideally be a divisor of the number of frames in the video
All GPUs should have the same model capacity and memory
For best results, use context parallelism with the 14B model where memory constraints are significant
Context parallelism works with both single-frame and multi-frame conditioning
Requires NCCL support and proper GPU interconnect for efficient communication

Rejection Sampling for Quality Improvement#

Video quality can be further improved by generating multiple variations and selecting the best one based on automatic quality assessment using Cosmos-Reason1-7B as the critic model. This approach, known as rejection sampling, can significantly enhance the visual quality of the generated videos.

# Set the input prompt
PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. Multiple double-decker buses are parked under overhead lights, with a central bus labeled '87D' facing forward."
# Run video2world generation with rejection sampling
python -m examples.video2world_bestofn \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --prompt "${PROMPT}" \
    --num_generations 5 \
    --num_critic_trials 3 \
    --save_path output/rejection_sampling_demo

This command:

Generates 5 different videos from the same input and prompt
Evaluates each video 3 times using the Cosmos-Reason1 critic model
Saves all videos with quality scores in their filenames (from 000 to 100)
Creates HTML reports with detailed analysis for each video

The highest-scored video represents the best generation from the batch. For batch processing with existing videos:

# Run critic on existing videos without generation
python -m examples.video2world_bestofn \
    --skip_generation \
    --save_path output/my_existing_videos

API Documentation#

The video2world.py script supports the following command-line arguments:

Model selection:

--model_size: Size of the model to use (choices: “2B”, “14B”, default: “2B”)
--dit_path: Custom path to the DiT model checkpoint for post-trained models (default: uses standard checkpoint path based on model_size)
--fps: FPS of the model to use for video-to-world generation (choices: 10, 16, default: 16)
--resolution: Resolution of the model to use for video-to-world generation (choices: 480, 720, default: 720)

By default a 720P + 16FPS model is used for model_size size model. If you want to use another config, download the corresponding checkpoin tand pass either --fps or --resolution or both.

Input parameters:

--prompt: Text prompt describing the video to generate (default: empty string)
--negative_prompt: Text describing what to avoid in the generated video (default: predefined negative prompt)
--input_path: Path to input image or video for conditioning (default: “assets/video2world/input0.jpg”)
--num_conditional_frames: Number of frames to condition on (choices: 1, 5, default: 1)

Output parameters:

--save_path: Path to save the generated video (default: “output/generated_video.mp4”)

Generation parameters:

--guidance: Classifier-free guidance scale (default: 7.0)
--seed: Random seed for reproducibility (default: 0)
--num_gpus: Number of GPUs to use for context parallel inference in the video generation phase (default: 1)

Multi-GPU inference:

For multi-GPU inference, use torchrun --nproc_per_node=$NUM_GPUS examples/video2world.py ...
Both --nproc_per_node (for torchrun) and --num_gpus (for the script) must be set to the same value
Setting the NUM_GPUS environment variable and using it for both parameters ensures they stay synchronized

Batch processing:

--batch_input_json: Path to JSON file containing batch inputs, where each entry should have ‘input_video’, ‘prompt’, and ‘output_video’ fields

Content safety and controls:

--disable_guardrail: Disable guardrail checks on prompts (by default, guardrails are enabled to filter harmful content)
--disable_prompt_refiner: Disable prompt refiner that enhances short prompts (by default, the prompt refiner is enabled)

GPU memory controls:

--offload_guardrail: Offload guardrail to CPU to save GPU memory
--offload_prompt_refiner: Offload prompt refiner to CPU to save GPU memory

Specialized Scripts#

In addition to the main video2world.py script, there are specialized variants for specific use cases:

Rejection Sampling (video2world_bestofn.py)#

The video2world_bestofn.py script extends the standard Video2World capabilities with rejection sampling to improve video quality. It supports all the standard Video2World parameters plus:

--num_generations: Number of different videos to generate from the same input (default: 5)
--num_critic_trials: Number of times to evaluate each video with the critic model (default: 3)
--skip_generation: Flag to run critic only on existing videos without generation
--save_path: Directory to save the generated videos and HTML reports (default: “output/best-of-n”)

For more details, see the Rejection Sampling for Quality Improvement section.

Prompt Engineering Tips#

For best results with Video2World models, create detailed prompts that emphasize:

Physical realism: Describe how objects interact with the environment following natural laws of physics
Motion details: Specify how elements in the scene should move over time
Visual consistency: Maintain logical relationships between objects throughout the video
Cinematography terminology: Use terms like “tracking shot,” “pan,” or “zoom” to guide camera movement
Temporal progression: Describe how the scene evolves (e.g., “gradually,” “suddenly,” “transitions to”)
Cinematography terms: Include camera movements like “panning across,” “zooming in,” or “tracking shot”

Include negative prompts to explicitly specify undesired elements, such as jittery motion, visual artifacts, or unrealistic physics.

The more grounded a prompt is in real-world physics and natural temporal progression, the more physically plausible and realistic the generated video will be.

Example of a good prompt:

A tranquil lakeside at sunset. Golden light reflects off the calm water surface, gradually rippling as a gentle breeze passes through. Tall pine trees along the shore sway slightly, their shadows lengthening across the water. A small wooden dock extends into the lake, where a rowboat gently bobs with the subtle movements of the water.

This prompt includes both static scene elements and suggestions for motion that the Video2World model can interpret and animate.

Text2World#

The Text2World pipeline combines the Text2Image and Video2World models to generate videos directly from text prompts in a two-phase process:

Text2Image generation: The text prompt is processed by the Text2Image model to create a single still image that serves as the first frame.
Video2World generation: This image (single-frame conditioning) is then fed into the Video2World model along with the original text prompt to animate it into a dynamic video.

The temporary image created in the first phase is automatically cleaned up after the process completes successfully.

The inference script is located at examples/text2world.py. It requires the input argument --prompt (text input).

For a complete list of available arguments and options:

python -m examples.text2world --help

Examples#

Single Video Generation#

This is a basic example for running inference on the 2B model with a text prompt. The output is saved to output/text2world_2b.mp4.

# Set the input prompt
PROMPT="An autonomous welding robot arm operating inside a modern automotive factory, sparks flying as it welds a car frame with precision under bright overhead lights."
# Run text2world generation
python -m examples.text2world \
    --model_size 2B \
    --prompt "${PROMPT}" \
    --save_path output/text2world_2b.mp4

The 14B model can be run similarly by changing the model size parameter.

Batch Video Generation#

For generating multiple videos with different prompts, you can use a JSON file with batch inputs. The JSON file should contain an array of objects, where each object has:

prompt: The text prompt describing the desired video (required)
output_video: The path where the generated video should be saved (required)

An example can be found in assets/text2world/batch_example.json:

[
  {
    "prompt": "An autonomous welding robot arm operating inside a modern automotive factory, sparks flying as it welds a car frame with precision under bright overhead lights.",
    "output_video": "output/welding-robot-factory.mp4"
  },
  {
    "prompt": "A wooden sailboat gently moves across a tranquil lake, its sail billowing slightly with the breeze. The water ripples around the hull as the boat glides forward. Mountains are visible in the background under a clear blue sky with scattered clouds.",
    "output_video": "output/sailboat-mountain-lake.mp4"
  },
  {
    "prompt": "A modern kitchen scene where a stand mixer is actively blending cake batter in a glass bowl. The beater rotates steadily, incorporating ingredients as the mixture swirls. A light dusting of flour is visible on the countertop, and sunshine streams in through a nearby window.",
    "output_video": "output/kitchen-mixer-baking.mp4"
  }
]

Specify the input via the --batch_input_json argument:

# Run batch text2world generation
python -m examples.text2world \
    --model_size 2B \
    --batch_input_json assets/text2world/batch_example.json

This will generate three separate videos according to the prompts specified in the JSON file, with each output saved to its corresponding path.

Multi-GPU Inference#

For faster inference, particularly with the 14B model, Text2World supports context parallelism for the video generation phase. This distributes the video frames across multiple GPUs, significantly reducing the time required for video generation.

To enable multi-GPU inference, set the NUM_GPUS environment variable and use torchrun to launch the script:

# Set the number of GPUs to use
export NUM_GPUS=8

# Run text2world generation with context parallelism using torchrun
torchrun --nproc_per_node=${NUM_GPUS} examples/text2world.py \
    --model_size 2B \
    --prompt "${PROMPT}" \
    --save_path output/text2world_2b_${NUM_GPUS}gpu.mp4 \
    --num_gpus ${NUM_GPUS}

This distributes the computation across multiple GPUs for the video generation phase (Video2World), with each GPU processing a subset of the video frames. The image generation phase (Text2Image) still runs on a single GPU.

Note: Both parameters are required: --nproc_per_node tells PyTorch how many processes to launch, while --num_gpus tells the model how to distribute the workload. Using the same environment variable for both ensures they are synchronized.

Important considerations for multi-GPU inference:

The number of GPUs should ideally be a divisor of the number of frames in the generated video
All GPUs should have the same model capacity and memory
For best results, use context parallelism with the 14B model where memory constraints are significant
Requires NCCL support and proper GPU interconnect for efficient communication

This feature is particularly useful for generating higher quality videos with the 14B model, which would otherwise require significant processing time on a single GPU.

API Documentation#

The predict2_text2world.py script supports the following command-line arguments:

Model selection:

--model_size: Size of the model to use (choices: “2B”, “14B”, default: “2B”)

Input parameters:

--prompt: Text prompt describing the video to generate (default: predefined example prompt)
--negative_prompt: Text describing what to avoid in the generated video (default: predefined negative prompt)

Output parameters:

--save_path: Path to save the generated video (default: “output/generated_video.mp4”)

Generation parameters:

--guidance: Classifier-free guidance scale (default: 7.0)
--seed: Random seed for reproducibility (default: 0)
--use_cuda_graphs: Use CUDA Graphs for inference acceleration (for Text2Image phase)
--num_gpus: Number of GPUs to use for context parallel inference in the video generation phase (default: 1)

Multi-GPU inference:

For multi-GPU inference, use torchrun --nproc_per_node=$NUM_GPUS examples/text2world.py ...
Both --nproc_per_node (for torchrun) and --num_gpus (for the script) must be set to the same value
Setting the NUM_GPUS environment variable and using it for both parameters ensures they stay synchronized

Batch processing:

--batch_input_json: Path to JSON file containing batch inputs, where each entry should have ‘prompt’ and ‘output_video’ fields

Content safety and controls:

--disable_guardrail: Disable guardrail checks on prompts (by default, guardrails are enabled to filter harmful content)
--disable_prompt_refiner: Disable prompt refiner that enhances short prompts (by default, the prompt refiner is enabled)

Prompt Engineering Tips#

For best results with Text2World models, prompts should describe both what should appear in the scene and how things should move or change over time:

Scene description: Include details about objects, lighting, materials, and spatial relationships
Motion description: Describe how elements should move, interact, or change during the video
Temporal progression: Use words like “gradually,” “suddenly,” or “transitions to” to guide how the scene evolves
Physical dynamics: Describe physical effects like “water splashing,” “leaves rustling,” or “smoke billowing”
Cinematography terms: Include camera movements like “panning across,” “zooming in,” or “tracking shot”

Example of a good prompt:

A tranquil lakeside at sunset. Golden light reflects off the calm water surface, gradually rippling as a gentle breeze passes through. Tall pine trees along the shore sway slightly, their shadows lengthening across the water. A small wooden dock extends into the lake, where a rowboat gently bobs with the subtle movements of the water.

This prompt includes both static scene elements and suggestions for motion that the Video2World model can interpret and animate.