Model Matrix#

Refer to the table below for world foundation models (WFMs) available in Cosmos Predict2.

Model	Description	Required GPU VRAM	Post-Training Supported
Cosmos-Predict2-2B-Text2Image	Diffusion-based text to image generation (2 billion parameters)	26.02 GB	Yes
Cosmos-Predict2-14B-Text2Image	Diffusion-based text to image generation (14 billion parameters)	48.93 GB	Yes
Cosmos-Predict2-2B-Video2World	Diffusion-based video and text to visual world generation (2 billion parameters)	32.54 GB	Yes
Cosmos-Predict2-14B-Video2World	Diffusion-based video and text to visual world generation (14 billion parameters)	56.38 GB	Yes
Cosmos-Predict2-2B-Sample-Action-Conditioned	Video- and action-based future visual world generation; post-trained on the Bridge dataset		Yes
Cosmos-Predict2-14B-Sample-GR00T-Dreams-GR1	Video- and text-based future visual world generation; post-trained on GR00T Dreams GR1 dataset		Yes
Cosmos-Predict2-14B-Sample-GR00T-Dreams-DROID	Video- and text-based future visual world generation; post-trained on GR00T Dreams DROID dataset		Yes

Performance Benchmarks#

The following table shows generation times across different NVIDIA GPU hardware:

GPU Hardware	2B-Text2Image	14B-Text2Image	2B-Video2World	14B-Video2World
NVIDIA GB200	3.39 sec	8.5 sec	25.61 sec	85.26 sec
NVIDIA B200	3.24 sec	8.68 sec	30.7 sec	92.59 sec
NVIDIA RTX PRO 6000	5.59 sec	24.16 sec	82.43 sec	321.9 sec
NVIDIA DGX Spark	24.87 sec	138.94 sec	344.64 sec	1902.26 sec
NVIDIA H200 SXM	9.02 sec	15.96 sec	50.2 sec	176.19 sec
NVIDIA H200 NVL	6.34 sec	16.95 sec	54.01 sec	203.56 sec
NVIDIA H100 PCIe	11.12 sec	23.83 sec	79.87 sec	286.46 sec
NVIDIA H100 NVL	5.05 sec	23.97 sec	87.32 sec	377.67 sec
NVIDIA H20	11.47 sec	59.59 sec	179.69 sec	852.64 sec
NVIDIA L40S	8.9 sec	(OOM)	127.49 sec	1036.24 sec
NVIDIA RTX 6000 Ada	11.94 sec	167.86 sec	180.99 sec	876.68 sec

Note

(OOM) indicates “Out of Memory”; the model is too large to run on that GPU.

Note

Video2World inference was performed at 480p resolution with 16 FPS.

Video2World Sparse Attention Variants#

The Video2World models include variants trained with sparse attention using NATTEN, which can accelerate inference up to 2.5 times, with comparable quality, on the Hopper and Blackwell architectures. This feature is only available for 720p inference and only on NVIDIA GPUs with compute capability 9.0 or 10.0. Refer to the Predict2 Model Reference for downloading the NATTEN-enabled Video2World models.

The baseline models run with SOTA attention kernels for the Hopper Flash Attention V3 and Blackwell cuDNN Attention architectures–contrast this with other examples of sparse attention for video generation, which often report performance numbers with Flash Attention V2 as baseline.

The NATTEN Hopper and Blackwell FNA kernels can deliver speedups proportional to a reduction in FLOPs over Flash Attention V3 and cuDNN Blackwell FMHA.

The following table shows generation times (720p, 16 FPS) with and without sparsity across supported NVIDIA GPUs:

GPU Hardware	2B-Video2World	2B-Video2World + NATTEN	14B-Video2World	14B-Video2World + NATTEN
NVIDIA B200	123.9 sec	54.0 sec (2.3X)	439.4 sec	223.1 sec (2.0X)
NVIDIA H200 SXM	221.7 sec	89.4 sec (2.5X)	836.9 sec	412.9 sec (2.0X)
NVIDIA H200 NVL	267.2 sec	104.3 sec (2.6X)	1006.7 sec	489.5 sec (2.1X)
NVIDIA H100 PCIe	378.5 sec	149.6 sec (2.5X)	1425.4 sec	706.9 sec (2.0X)
NVIDIA H100 NVL	355.7 sec	138.7 sec (2.6X)	1348.6 sec	677.0 sec (2.0X)
NVIDIA H100 SXM	228.8 sec	94.2 sec (2.4X)	856.9 sec	426.0 sec (2.0X)

The following table shows generation times (720p, 10 FPS) with and without sparsity across supported NVIDIA GPUs:

GPU Hardware	2B-Video2World	2B-Video2World + NATTEN	14B-Video2World	14B-Video2World + NATTEN
NVIDIA B200	62.4 sec	32.6 sec (1.9X)	230.0 sec	136.5 sec (1.7X)
NVIDIA H200 SXM	111.1 sec	52.9 sec (2.1X)	436.7 sec	252.1 sec (1.7X)
NVIDIA H200 NVL	133.1 sec	60.7 sec (2.2X)	519.3 sec	296.6 sec (1.8X)
NVIDIA H100 PCIe	187.9 sec	87.4 sec (2.1X)	749.2 sec	439.3 sec (1.7X)
NVIDIA H100 NVL	175.5 sec	79.0 sec (2.2X)	711.5 sec	418.0 sec (1.7X)

Model Selection Guide#

The 2B models offer a good balance of quality and performance for many practical applications, while being more resource-efficient. The 14B models generally produce higher fidelity results with better coherence and detail, but with increased computational costs.

For most development and testing scenarios, starting with the 2B models is recommended. You can then scale up to 14B models when higher quality is needed and hardware resources permit.

We recommended the 2B models for the following use cases:

Faster inference times and lower latency
Limited GPU memory (requires ~26-33GB VRAM)
Simpler scenes and compositions
Rapid prototyping or testing
Processing large batches of images/videos efficiently

We recommended the 14B models for the following use cases:

Higher quality and more detailed outputs
Sufficient GPU resources (requires ~49-57GB VRAM)
Complex scenes with intricate detail
Quality prioritized over generation speed
Final production assets