Model Reference#

This page details the options available when using Cosmos-Predict2 diffusion models.

Model Types#

There are two model types available for world generation:

Text2Image: World generation of images from text input
- Cosmos-Predict2-2B-Text2Image
- Cosmos-Predict2-14B-Text2Image
Video2World: World generation of videos from text and image/video input
- Cosmos-Predict2-2B-Video2World
- Cosmos-Predict2-14B-Video2World
Text2World: Combines the Text2Image and Video2World models to generate videos directly from text prompts.

Text2Image runs on a single GPU and does not support multi-GPU inference. For multi-GPU video generation, use the Text2World or Video2World pipelines.

Text2Image#

Cosmos-Predict2 provides two models for text-to-image generation: Cosmos-Predict2-2B-Text2Image and Cosmos-Predict2-14B-Text2Image. These models can transform natural language descriptions into high-quality images through progressive diffusion guided by the text prompt.

The inference script is examples/text2image.py. It requires the input argument --prompt (text input). For a complete list of available arguments, run the following:

python -m examples.text2image --help

Download Commands#

Cosmos-Predict2-2B-Text2Image:

python -m scripts.download_checkpoints --model_types text2image --model_sizes 2B

Cosmos-Predict2-14B-Text2Image:

python -m scripts.download_checkpoints --model_types text2image --model_sizes 14B

Examples#

Single Image Generation#

This is a basic example for running inference on the 2B model with a single prompt. The output is saved to output/text2image_2b.jpg.

# Set the input prompt
PROMPT="A well-worn broom sweeps across a dusty wooden floor, its bristles gathering crumbs and flecks of debris in swift, rhythmic strokes. Dust motes dance in the sunbeams filtering through the window, glowing momentarily before settling. The quiet swish of straw brushing wood is interrupted only by the occasional creak of old floorboards. With each pass, the floor grows cleaner, restoring a sense of quiet order to the humble room."
# Run text2image generation
python -m examples.text2image \
    --prompt "${PROMPT}" \
    --model_size 2B \
    --save_path output/text2image_2b.jpg

To run the 14B model, change the model_size parameter to 14B.

Batch Image Generation#

For generating multiple images with different prompts, you can use a JSON file with batch inputs. The JSON file should contain an array of objects, where each object has:

prompt: The text prompt describing the desired image (required)
output_image: The path where the generated image should be saved (required)

An example can be found in assets/text2image/batch_example.json:

[
  {
    "prompt": "A well-worn broom sweeps across a dusty wooden floor, its bristles gathering crumbs and flecks of debris in swift, rhythmic strokes. Dust motes dance in the sunbeams filtering through the window, glowing momentarily before settling. The quiet swish of straw brushing wood is interrupted only by the occasional creak of old floorboards. With each pass, the floor grows cleaner, restoring a sense of quiet order to the humble room.",
    "output_image": "output/sweeping-broom-sunlit-floor.jpg"
  },
  {
    "prompt": "A laundry machine whirs to life, tumbling colorful clothes behind the foggy glass door. Suds begin to form in a frothy dance, clinging to fabric as the drum spins. The gentle thud of shifting clothes creates a steady rhythm, like a heartbeat of the home. Outside the machine, a quiet calm fills the room, anticipation building for the softness and warmth of freshly laundered garments.",
    "output_image": "output/laundry-machine-spinning-clothes.jpg"
  },
  {
    "prompt": "A robotic arm tightens a bolt beneath the hood of a car, its tool head rotating with practiced torque. The metal-on-metal sound clicks into place, and the arm pauses briefly before retracting with a soft hydraulic hiss. Overhead lights reflect off the glossy vehicle surface, while scattered tools and screens blink in the background—a garage scene reimagined through the lens of precision engineering.",
    "output_image": "output/robotic-arm-car-assembly.jpg"
  }
]

Specify the input via the --batch_input_json argument:

# Run batch text2image generation
python -m examples.text2image \
    --model_size 2B \
    --batch_input_json assets/text2image/batch_example.json

This will generate three separate images according to the prompts specified in the JSON file, with each output saved to its corresponding path.

API Documentation#

The text2image.py script supports the following command-line arguments:

Input and output parameters:

--prompt: Text prompt describing the image to generate (default: predefined example prompt)
--negative_prompt: Text describing what to avoid in the generated image (default: empty)
--aspect_ratio: Aspect ratio of the generated output, including “1:1”, “4:3”, “3:4”, “16:9”, “9:16” (default: “16:9”)
--save_path: Path to save the generated image (default: “output/generated_image.jpg”)
--batch_input_json: Path to the JSON file containing batch inputs. In the file, each entry should have ‘prompt’ and ‘output_image’ fields.

Model selection:

--model_size: Size of the model to use (choices: “2B”, “14B”, default: “2B”)
--dit_path: Custom path to the DiT model checkpoint for post-trained models (default: uses standard checkpoint path based on model_size)
--load_ema: Use EMA weights from the post-trained DIT model checkpoint for generation.

Generation parameters:

--seed: Random seed for reproducible results (default: 0)

Performance optimization:

--use_cuda_graphs: Use CUDA Graphs for inference acceleration.
--natten: Use sparse attention variants built with NATTEN.
--benchmark: Run in benchmark mode to measure average generation time.

Content safety:

--disable_guardrail: Disable guardrail checks on prompts (by default, guardrails are enabled to filter harmful content).

Prompt Engineering Tips#

For best results with Cosmos models, create detailed prompts that emphasize physical realism, natural laws, and real-world behaviors. Describe specific objects, materials, lighting conditions, and spatial relationships while maintaining logical consistency throughout the scene.

Incorporate photography terminology like composition, lighting setups, and camera settings. Use concrete terms like “natural lighting” or “wide-angle lens” rather than abstract descriptions, unless intentionally aiming for surrealism. Include negative prompts to explicitly specify undesired elements.

The more grounded a prompt is in real-world physics and natural phenomena, the more physically plausible and realistic the gen

Video2World#

Cosmos-Predict2 provides two models for generating videos from a combination of text and visual inputs: Cosmos-Predict2-2B-Video2World and Cosmos-Predict2-14B-Video2World. These models can transform a still image or video clip into a longer, animated sequence guided by the text description.

The inference script is located at examples/video2world.py. It requires input arguments:

--input_path: input image or video
--prompt: text prompt

For a complete list of available arguments and options, run the following:

python -m examples.video2world --help

Download Commands#

Cosmos-Predict2-2B-Video2World:

python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B

Cosmos-Predict2-14B-Video2World:

python -m scripts.download_checkpoints --model_types video2world --model_sizes 14B

Notes

By default, the checkpoints downloaded using the Quick Start Guide are for 720P and 16FPS. If you instead want to change the behavior to 480P and 10FPS, for example, you need to download the corresponding checkpoint and pass --fps 10 --resolution 480.
```
python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B --resolution 480 --fps 10
```
The model_types, model_sizes, fps and resolution parameters support multiple values. For example, to download {2,14}B Video2World models with {10,16} FPS at {480,720}P (i.e. eight models in total), use the following command:
```
python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B 14B --resolution 480 720 --fps 10 16
```
Pass --checkpoint_dir <path_to_checkpoint> to control where to store the checkpoints.
Add the --verify_md5 flag to verify the MD5 checksums of downloaded files. If the checksums don’t match, models will be automatically re-downloaded.

To download models with sparse attention, run the download script with the --natten option:

python -m scripts.download_checkpoints --model_types video2world --model_sizes 2B 14B --resolution 720 --fps 10 16 --natten

Examples#

Single Video Generation#

This is a basic example for running inference on the 2B model with a single image. The output is saved to output/video2world_2b.mp4.

# Set the input prompt
PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple double-decker buses are parked under the glow of overhead lights, with a central bus labeled '87D' facing forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban night scene."
# Run video2world generation
python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --num_conditional_frames 1 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b.mp4

Change the model_size parameter to 14B to run the 14B model. The 14B model requires significant GPU memory, so you may want to offload the guardrail and prompt refiner models using the --offload_guardrail and --offload_prompt_refiner flags if you are using a GPU with limited memory.

# Set the input prompt
PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple double-decker buses are parked under the glow of overhead lights, with a central bus labeled '87D' facing forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban night scene."
# Run video2world generation
python -m examples.video2world \
    --model_size 14B \
    --input_path assets/video2world/input0.jpg \
    --num_conditional_frames 1 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_14b.mp4 \
    --offload_guardrail \
    --offload_prompt_refiner

Batch Video Generation#

For generating multiple videos with different inputs and prompts, you can use a JSON file with batch inputs. The JSON file should contain an array of objects, where each object has:

input_video: The path to the input image or video (required)
prompt: The text prompt describing the desired video (required)
output_video: The path where the generated video should be saved (required)

An example can be found in assets/video2world/batch_example.json:

[
  {
    "input_video": "assets/video2world/input0.jpg",
    "prompt": "A nighttime city bus terminal gradually shifts from stillness to subtle movement. At first, multiple double-decker buses are parked under the glow of overhead lights, with a central bus labeled '87D' facing forward and stationary. As the video progresses, the bus in the middle moves ahead slowly, its headlights brightening the surrounding area and casting reflections onto adjacent vehicles. The motion creates space in the lineup, signaling activity within the otherwise quiet station. It then comes to a smooth stop, resuming its position in line. Overhead signage in Chinese characters remains illuminated, enhancing the vibrant, urban night scene.",
    "output_video": "output/bus-terminal-night-movement.mp4"
  },
  {
    "input_video": "assets/video2world/input1.jpg",
    "prompt": "As the red light shifts to green, the red bus at the intersection begins to move forward, its headlights cutting through the falling snow. The snowy tire tracks deepen as the vehicle inches ahead, casting fresh lines onto the slushy road. Around it, streetlights glow warmer, illuminating the drifting flakes and wet reflections on the asphalt. Other cars behind start to edge forward, their beams joining the scene. The stillness of the urban street transitions into motion as the quiet snowfall is punctuated by the slow advance of traffic through the frosty city corridor.",
    "output_video": "output/snowy-intersection-traffic.mp4"
  },
  {
    "input_video": "assets/video2world/input2.jpg",
    "prompt": "In the later moments of the video, the female worker in the front, dressed in a white coat and hairnet, performs a repetitive yet precise task. She scoops golden granular material from a wide jar and steadily pours it into the next empty glass bottle on the conveyor belt. Her hand moves with practiced control as she aligns the scoop over each container, ensuring an even fill. The sequence highlights her focused attention and consistent motion, capturing the shift from preparation to active material handling as the production line advances bottle by bottle.",
    "output_video": "output/factory-worker-bottle-filling.mp4"
  }
]

Specify the input via the --batch_input_json argument:

# Run batch video2world generation
python -m examples.video2world \
    --model_size 2B \
    --batch_input_json assets/video2world/batch_example.json

This will generate three separate videos according to the inputs and prompts specified in the JSON file, with each output saved to its corresponding path.

Multi-Frame Video Conditioning#

Video2World models support two types of conditioning on visual input:

Single-frame conditioning (default): Uses 1 frame from an image or video for conditioning
Multi-frame conditioning: Uses the last 5 consecutive frames from a video for enhanced temporal consistency

Using multiple frames as conditioning input can provide better temporal coherence in the generated video by giving the model more context about the motion present in the original sequence.

Multi-frame conditioning is particularly effective when:

Preservation of specific motion patterns from the input video is desired
The input contains complex or distinctive movements that should be maintained
Stronger visual coherence between the input and output videos is needed
Extending or transforming an existing video clip is the goal

For 5-frame conditioning, the input must be a video file, not a still image. Specify the number of conditional frames with the --num_conditional_frames 5 argument:

# Set the input prompt
PROMPT="A point-of-view video shot from inside a vehicle, capturing a quiet suburban street bathed in bright sunlight. The road is lined with parked cars on both sides, and buildings, likely residential or small businesses, are visible across the street. A STOP sign is prominently displayed near the center of the intersection. The sky is clear and blue, with the sun shining brightly overhead, casting long shadows on the pavement. On the left side of the street, several vehicles are parked, including a van with some text on its side. Across the street, a white van is parked near two trash bins, and a red SUV is parked further down. The buildings on either side have a mix of architectural styles, with some featuring flat roofs and others with sloped roofs. Overhead, numerous power lines stretch across the street, and a few trees are visible in the background, partially obscuring the view of the buildings. As the video progresses, a white car truck makes a right turn into the adjacent opposite lane. The ego vehicle slows down and comes to a stop, waiting until the car fully enters the opposite lane before proceeding. The pedestrian keeps walking on the street. The other vehicles remain stationary, parked along the curb. The scene remains static otherwise, with no significant changes in the environment or additional objects entering the frame. By the end of the video, the white car truck has moved out of the camera view, the rest of the scene remains largely unchanged, maintaining the same composition and lighting conditions as the beginning."
# Run video2world generation with 5-frame conditioning
python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input3.mp4 \
    --num_conditional_frames 5 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b_5frames.mp4

Note that when using multi-frame conditioning in batch mode, all input files must be videos, not images.

Notes on multi-frame conditioning:

Multi-frame conditioning requires video inputs with at least 5 frames
The model will extract the last 5 frames from the input video

Using the Prompt Refiner#

The Cosmos-Predict2 models include a prompt refiner model using Cosmos-Reason1-7B that automatically enhances short prompts with additional details. This is particularly useful when:

Brief prompts need to be expanded into more detailed videos
Additional descriptive elements would improve video quality
Detailed prompt writing expertise is limited

The following example uses a short prompt that will be automatically expanded by the prompt refiner:

# Set the input short prompt
PROMPT="A nighttime city bus terminal."
# Run video2world generation
python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --num_conditional_frames 1 \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b_with_prompt_refiner.mp4

The prompt refiner is enabled by default. To disable it, use the --disable_prompt_refiner flag:

# Run video2world generation without prompt refinement
python -m examples.video2world \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --prompt "${PROMPT}" \
    --disable_prompt_refiner \
    --save_path output/video2world_2b_without_prompt_refiner.mp4

This configuration can be seen in the model’s configuration:

prompt_refiner_config=CosmosReason1Config(
    checkpoint_dir="checkpoints/nvidia/Cosmos-Reason1-7B",
    offload_model_to_cpu=True,
    enabled=True,  # Controls whether the refiner is used
)

Multi-GPU Inference#

For faster inference on high-resolution videos, Video2World supports context parallelism, which distributes the video frames across multiple GPUs. This can significantly reduce the inference time, especially for the larger 14B model.

To enable multi-GPU inference, set the NUM_GPUS environment variable and use torchrun to launch the script. Both --nproc_per_node and --num_gpus should be set to the same value:

# Set the number of GPUs to use
export NUM_GPUS=8

# Run video2world generation with context parallelism using torchrun
torchrun --nproc_per_node=${NUM_GPUS} examples/video2world.py \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b_${NUM_GPUS}gpu.mp4 \
    --num_gpus ${NUM_GPUS}

This distributes the computation across multiple GPUs, with each GPU processing a subset of the video frames. The final video is automatically combined from the results of all GPUs.

Note

The 14B model requires significant GPU memory, so you may want to offload the guardrail and prompt refiner models using the --offload_guardrail and --offload_prompt_refiner flags if you are using a GPU with limited memory.

Note

Both parameters are required: --nproc_per_node tells PyTorch how many processes to launch, while --num_gpus tells the model how to distribute the workload. Using the same environment variable for both ensures they are synchronized.

Important considerations for multi-GPU inference:

The number of GPUs should ideally be a divisor of the number of frames in the video
All GPUs should have the same model capacity and memory
For best results, use context parallelism with the 14B model where memory constraints are significant
Context parallelism works with both single-frame and multi-frame conditioning
Requires NCCL support and proper GPU interconnect for efficient communication

Rejection Sampling for Quality Improvement#

Video quality can be further improved by generating multiple variations and selecting the best one based on automatic quality assessment using Cosmos-Reason1-7B as the critic model. This approach, known as rejection sampling, can significantly enhance the visual quality of the generated videos.

# Set the input prompt
PROMPT="A nighttime city bus terminal gradually shifts from stillness to subtle movement. Multiple double-decker buses are parked under overhead lights, with a central bus labeled '87D' facing forward."
# Run video2world generation with rejection sampling
python -m examples.video2world_bestofn \
    --model_size 2B \
    --input_path assets/video2world/input0.jpg \
    --prompt "${PROMPT}" \
    --num_generations 5 \
    --num_critic_trials 3 \
    --save_path output/rejection_sampling_demo

This command performs the following steps:

Generates 5 different videos from the same input and prompt
Evaluates each video 3 times using the Cosmos-Reason1 critic model
Saves all videos with quality scores in their filenames (from 000 to 100)
Creates HTML reports with detailed analysis for each video

The highest-scored video represents the best generation from the batch. For batch processing with existing videos:

# Run critic on existing videos without generation
python -m examples.video2world_bestofn \
    --skip_generation \
    --save_path output/my_existing_videos

Long Video Generation#

Each single forward pass of the Video2World model only generates one chunk of video. The Video2World model also supports long-video generation using auto-regressive inference: It generates the first chunk, then iteratively takes the last num_conditional_frames frames of the previous chunk as the input condition of the next chunk.

Since long-video generation calls the entire denoising process of the Video2World model for num_chunks times, it’s much slower than single-chunk video generation. Therefore, multi-GPU inference is highly recommended to boost computation speed.

# Set the input prompt
PROMPT="The video opens with a view inside a well-lit warehouse or retail store aisle, characterized by high ceilings and industrial shelving units stocked with various products. The shelves are neatly organized with items such as canned goods, packaged foods, and cleaning supplies, all displayed in bright packaging that catches the eye. The surrounding environment includes additional shelving units filled with similar products. The scene concludes with the forklift still in motion, ensuring the pallet is securely placed on the shelf."

# Set the number of GPUs to use
export NUM_GPUS=8

# Run video2world long video generation of 6 chunks
PYTHONPATH=. torchrun --nproc_per_node=${NUM_GPUS} examples/video2world_lvg.py \
    --model_size 14B \
    --num_chunks 6 \
    --input_path assets/video2world_lvg/example_input.jpg \
    --prompt "${PROMPT}" \
    --save_path output/video2world_2b_lvg_example1.mp4 \
    --num_gpus ${NUM_GPUS} \
    --disable_guardrail \
    --disable_prompt_refiner

Example output is located at assets/video2world_lvg/example_output.mp4.

Note

The 14B model requires significant GPU memory, so you may want to offload the guardrail and prompt refiner models using the --offload_guardrail and --offload_prompt_refiner flags if you are using a GPU with limited memory.

Faster inference with Sparse Attention#

If you’re targeting 720p generation, you can use sparse attention variants built with NATTEN. This can increase inference speed from 1.7X to 2.6X over the base model, depending on the variant, frame rate, and hardware. Refer to the Model Matrix page for more details.

python -m examples.video2world \
    --model_size 2B \
    --input_path $INPUT_PATH \
    --prompt "${PROMPT}" \
    --natten \
    --save_path output/video2world_2b_with_natten.mp4

API Documentation#

The video2world.py script supports the following command-line arguments:

Model selection:

--model_size: Size of the model to use (choices: “2B”, “14B”, default: “2B”)
--dit_path: Custom path to the DiT model checkpoint for post-trained models (default: uses standard checkpoint path based on model_size)
--load_ema: Use EMA weights from the post-trained DIT model checkpoint for generation.
--fps: FPS of the model to use for video-to-world generation (choices: 10, 16, default: 16)
--resolution: Resolution of the model to use for video-to-world generation (choices: 480, 720, default: 720)

Note

By default, a 720P, 16FPS model is used for model_size model. If you want to use another config, download the corresponding checkpoint and pass either --fps or --resolution or both.

Input parameters:

--prompt: Text prompt describing the video to generate (default: empty string)
--negative_prompt: Text describing what to avoid in the generated video (default: predefined negative prompt)
--aspect_ratio: Aspect ratio of the generated output, including “1:1”, “4:3”, “3:4”, “16:9”, “9:16” (default: “16:9”)
--input_path: Path to input image or video for conditioning (default: “assets/video2world/input0.jpg”)
--num_conditional_frames: Number of frames to condition on (choices: 1, 5, default: 1)

Note

If the resolution of the input image/video does not match the specified aspect_ratio for the output, the Video2World pipeline will perform the following steps:

Resize the input to equal or larger lengths in height/width dimensions.
Center-crop the input to match the predefined resolution for the corresponding aspect ratio.

Output parameters:

--save_path: Path to save the generated video (default: “output/generated_video.mp4”)

Generation parameters:

--guidance: Classifier-free guidance scale (default: 7.0)
--seed: Random seed for reproducibility (default: 0)
--num_gpus: Number of GPUs to use for context parallel inference in the video generation phase (default: 1)

Performance parameters:

--use_cuda_graphs: Use CUDA Graphs to accelerate DiT inference.
--natten: Use sparse attention variants built with NATTEN.
--benchmark: Run in benchmark mode to measure average generation time.

Multi-GPU inference:

For multi-GPU inference, use torchrun --nproc_per_node=$NUM_GPUS examples/video2world.py ...
Both --nproc_per_node (for torchrun) and --num_gpus (for the script) must be set to the same value
Setting the NUM_GPUS environment variable and using it for both parameters ensures they stay synchronized

Batch processing:

--batch_input_json: Path to JSON file containing batch inputs, where each entry should have ‘input_video’, ‘prompt’, and ‘output_video’ fields

Content safety and controls:

--disable_guardrail: Disable guardrail checks on prompts (by default, guardrails are enabled to filter harmful content)
--disable_prompt_refiner: Disable prompt refiner that enhances short prompts (by default, the prompt refiner is enabled)

GPU memory controls:

--offload_guardrail: Offload guardrail to CPU to save GPU memory
--offload_prompt_refiner: Offload prompt refiner to CPU to save GPU memory

Specialized Scripts#

The video2world_bestofn.py script extends the standard Video2World capabilities with rejection sampling to improve video quality. It supports all the standard Video2World parameters, in addition to the following:

--num_generations: Number of different videos to generate from the same input (default: 2)
--num_critic_trials: Number of times to evaluate each video with the critic model (default: 5)
--skip_generation: Run critic only on existing videos without generation
--save_path: Directory to save the generated videos and HTML reports (default: “output/best-of-n”)

For more details, refer to the Rejection Sampling for Quality Improvement section.

Prompt Engineering Tips#

For best results with Video2World models, create detailed prompts that emphasize:

Physical realism: Describe how objects interact with the environment following natural laws of physics
Motion details: Specify how elements in the scene should move over time
Visual consistency: Maintain logical relationships between objects throughout the video
Cinematography terminology: Use terms like “tracking shot,” “pan,” or “zoom” to guide camera movement
Temporal progression: Describe how the scene evolves (e.g., “gradually,” “suddenly,” “transitions to”)
Cinematography terms: Include camera movements like “panning across,” “zooming in,” or “tracking shot”

Include negative prompts to explicitly specify undesired elements, such as jittery motion, visual artifacts, or unrealistic physics.

The more grounded a prompt is in real-world physics and natural temporal progression, the more physically plausible and realistic the generated video will be.

Example of a good prompt:

A tranquil lakeside at sunset. Golden light reflects off the calm water surface, gradually rippling as a gentle breeze passes through. Tall pine trees along the shore sway slightly, their shadows lengthening across the water. A small wooden dock extends into the lake, where a rowboat gently bobs with the subtle movements of the water.

This prompt includes both static scene elements and suggestions for motion that the Video2World model can interpret and animate.

Text2World#

The Text2World pipeline combines the Text2Image and Video2World models to generate videos directly from text prompts in a two-phase process:

Text2Image generation: The text prompt is processed by the Text2Image model to create a single still image that serves as the first frame.
Video2World generation: This image (single-frame conditioning) is then fed into the Video2World model along with the original text prompt to animate it into a dynamic video.

The temporary image created in the first phase is automatically cleaned up after the process completes successfully.

The inference script is located at examples/text2world.py. It requires the input argument --prompt (text input).

For a complete list of available arguments and options:

python -m examples.text2world --help

Examples#

Single Video Generation#

This is a basic example for running inference on the 2B model with a text prompt. The output is saved to output/text2world_2b.mp4.

# Set the input prompt
PROMPT="An autonomous welding robot arm operating inside a modern automotive factory, sparks flying as it welds a car frame with precision under bright overhead lights."
# Run text2world generation
python -m examples.text2world \
    --model_size 2B \
    --prompt "${PROMPT}" \
    --save_path output/text2world_2b.mp4

The 14B model can be run similarly by changing the model size parameter.

Batch Video Generation#

For generating multiple videos with different prompts, you can use a JSON file with batch inputs. The JSON file should contain an array of objects, where each object has:

prompt: The text prompt describing the desired video (required)
output_video: The path where the generated video should be saved (required)

An example can be found in assets/text2world/batch_example.json:

[
  {
    "prompt": "An autonomous welding robot arm operating inside a modern automotive factory, sparks flying as it welds a car frame with precision under bright overhead lights.",
    "output_video": "output/welding-robot-factory.mp4"
  },
  {
    "prompt": "A wooden sailboat gently moves across a tranquil lake, its sail billowing slightly with the breeze. The water ripples around the hull as the boat glides forward. Mountains are visible in the background under a clear blue sky with scattered clouds.",
    "output_video": "output/sailboat-mountain-lake.mp4"
  },
  {
    "prompt": "A modern kitchen scene where a stand mixer is actively blending cake batter in a glass bowl. The beater rotates steadily, incorporating ingredients as the mixture swirls. A light dusting of flour is visible on the countertop, and sunshine streams in through a nearby window.",
    "output_video": "output/kitchen-mixer-baking.mp4"
  }
]

Specify the input via the --batch_input_json argument:

# Run batch text2world generation
python -m examples.text2world \
    --model_size 2B \
    --batch_input_json assets/text2world/batch_example.json

This will generate three separate videos according to the prompts specified in the JSON file, with each output saved to its corresponding path.

Multi-GPU Inference#

Text2World supports multi-GPU inference to significantly accelerate video generation without a significant reduction in quality, especially for the 14B model. The pipeline uses an optimized two-stage approach:

Text2Image Stage: Runs on GPU rank 0 to generate the first frame(s).
Video2World Stage: Uses context parallelism across all GPUs to generate the final video.

To enable multi-GPU inference, use torchrun to launch the script:

# Set the number of GPUs to use
export NUM_GPUS=8

# Run text2world generation with multi-GPU acceleration
torchrun --nproc_per_node=${NUM_GPUS} -m examples.text2world \
    --model_size 2B \
    --prompt "${PROMPT}" \
    --save_path output/text2world_2b_${NUM_GPUS}gpu.mp4 \
    --num_gpus ${NUM_GPUS} \
    --disable_guardrail \
    --disable_prompt_refiner

This distributes the computation across multiple GPUs for the video generation phase (Video2World), with each GPU processing a subset of the video frames. The image generation phase (Text2Image) still runs on a single GPU.

Note

Both the --nproc_per_node and --num_gpus parameters are required and must be set to the same value. --nproc_per_node specifies how many processes PyTorch will launch, while --num_gpus determines how the model will distribute the workload.

Consider the following when using multi-GPU inference:

The number of GPUs should ideally be a divisor of the number of frames in the generated video.
All GPUs should have the same model capacity and memory.
Context parallelism works best with the 14B model because memory constraints are significant.
NCCL support and proper GPU interconnects are required for efficient communication.

API Documentation#

The text2world.py script supports the following command-line arguments:

Model selection:

--model_size: Size of the model to use (choices: “2B”, “14B”, default: “2B”)
--load_ema: Use EMA weights from the post-trained DIT model checkpoint for generation.

Input parameters:

--prompt: Text prompt describing the video to generate (default: predefined example prompt)
--negative_prompt: Text describing what to avoid in the generated video (default: predefined negative prompt)

Output parameters:

--save_path: Path to save the generated video (default: “output/generated_video.mp4”)

Generation parameters:

--guidance: Classifier-free guidance scale for video generation (default: 7.0)
--seed: Random seed for reproducibility (default: 0)

Performance optimization:

--use_cuda_graphs: Use CUDA Graphs for Text2Image inference acceleration.
--natten: Use sparse attention variants built with NATTEN.
--benchmark: Run in benchmark mode to measure average generation time.

Text2Image phase parameters:

--dit_path_text2image: Custom path to the DiT model checkpoint for post-trained Text2Image models

Video2World phase parameters:

--dit_path_video2world: Custom path to the DiT model checkpoint for post-trained Video2World models
--resolution: Resolution for Text2Image generation. Supported values include “480” and “720” (default: “720”)
--fps: FPS for Video2World generation. Supported values include 10 and 16 (default: 16)

Multi-GPU inference:

--num_gpus: Number of GPUs to use for context parallel inference (default: 1)
--nproc_per_node: Number of processes PyTorch will launch

Note

Both the --nproc_per_node and --num_gpus parameters are required for multi-GPU inference and must be set to the same value. Setting the NUM_GPUS environment variable and using it for both parameters ensures they stay synchronized

Note

Text2Image runs multi-GPU inference only on rank 0, while Video2World uses context parallelism across all GPUs.

Batch processing:

--batch_input_json: Path to JSON file containing batch inputs, where each entry should have ‘prompt’ and ‘output_video’ fields

Content safety and controls:

--disable_guardrail: Disable guardrail checks on prompts. By default, guardrails are enabled to filter harmful content.
--disable_prompt_refiner: Disable the prompt refiner, which enhances short prompts. By default, the prompt refiner is enabled.
--offload_guardrail: Offload the guardrail to the CPU to save GPU memory.
--offload_prompt_refiner: Offload the prompt refiner to the CPU to save GPU memory.

Prompt Engineering Tips#

For best results with Text2World models, prompts should describe both what should appear in the scene and how things should move or change over time:

Scene description: Include details about objects, lighting, materials, and spatial relationships
Motion description: Describe how elements should move, interact, or change during the video
Temporal progression: Use words like “gradually,” “suddenly,” or “transitions to” to guide how the scene evolves
Physical dynamics: Describe physical effects like “water splashing,” “leaves rustling,” or “smoke billowing”
Cinematography terms: Include camera movements like “panning across,” “zooming in,” or “tracking shot”

Example of a good prompt:

A tranquil lakeside at sunset. Golden light reflects off the calm water surface, gradually rippling as a gentle breeze passes through. Tall pine trees along the shore sway slightly, their shadows lengthening across the water. A small wooden dock extends into the lake, where a rowboat gently bobs with the subtle movements of the water.

This prompt includes both static scene elements and suggestions for motion that the Video2World model can interpret and animate.