Model Matrix#
Refer to the table below for world foundation models (WFMs) available in Cosmos Predict2.
Model |
Description |
Required GPU VRAM |
Post-Training Supported |
---|---|---|---|
Diffusion-based text to image generation (2 billion parameters) |
26.02 GB |
No |
|
Diffusion-based text to image generation (14 billion parameters) |
48.93 GB |
No |
|
Diffusion-based video and text to visual world generation (2 billion parameters) |
32.54 GB |
Yes |
|
Diffusion-based video and text to visual world generation (14 billion parameters) |
56.38 GB |
Yes |
|
Video- and action-based future visual world generation; post-trained on the Bridge dataset |
Yes |
||
Video- and text-based future visual world generation; post-trained on GR00T Dreams GR1 dataset |
Yes |
||
Video- and text-based future visual world generation; post-trained on GR00T Dreams DROID dataset |
Yes |
Performance Benchmarks#
The following table shows generation times across different NVIDIA GPU hardware:
GPU Hardware |
2B-Text2Image |
14B-Text2Image |
2B-Video2World |
14B-Video2World |
---|---|---|---|---|
NVIDIA GB200 |
3.39 sec |
8.5 sec |
25.61 sec |
85.26 sec |
NVIDIA B200 |
3.24 sec |
8.68 sec |
30.7 sec |
92.59 sec |
NVIDIA RTX PRO 6000 |
5.59 sec |
24.16 sec |
82.43 sec |
321.9 sec |
NVIDIA DGX Spark |
24.87 sec |
138.94 sec |
344.64 sec |
1902.26 sec |
NVIDIA H200 SXM |
9.02 sec |
15.96 sec |
50.2 sec |
176.19 sec |
NVIDIA H200 NVL |
6.34 sec |
16.95 sec |
54.01 sec |
203.56 sec |
NVIDIA H100 PCIe |
11.12 sec |
23.83 sec |
79.87 sec |
286.46 sec |
NVIDIA H100 NVL |
5.05 sec |
23.97 sec |
87.32 sec |
377.67 sec |
NVIDIA H20 |
11.47 sec |
59.59 sec |
179.69 sec |
852.64 sec |
NVIDIA L40S |
8.9 sec |
(OOM) |
127.49 sec |
1036.24 sec |
NVIDIA RTX 6000 Ada |
11.94 sec |
167.86 sec |
180.99 sec |
876.68 sec |
Note
(OOM) indicates “Out of Memory”; the model is too large to run on that GPU.
Note
Video2World inference was performed at 480p resolution with 16 FPS.
Video2World Sparse Attention Variants#
The Video2World models include variants trained with sparse attention using NATTEN, which can accelerate inference up to 2.5 times, with comparable quality, on the Hopper and Blackwell architectures. This feature is only available for 720p inference and only on NVIDIA GPUs with compute capability 9.0 or 10.0. Refer to the Predict2 Model Reference for downloading the NATTEN-enabled Video2World models.
The baseline models run with SOTA attention kernels for the Hopper Flash Attention V3 and Blackwell cuDNN Attention architectures–contrast this with other examples of sparse attention for video generation, which often report performance numbers with Flash Attention V2 as baseline.
The NATTEN Hopper and Blackwell FNA kernels can deliver speedups proportional to a reduction in FLOPs over Flash Attention V3 and cuDNN Blackwell FMHA.
The following table shows generation times (720p, 16 FPS) with and without sparsity across supported NVIDIA GPUs:
GPU Hardware |
2B-Video2World |
2B-Video2World + NATTEN |
14B-Video2World |
14B-Video2World + NATTEN |
---|---|---|---|---|
NVIDIA B200 |
123.9 sec |
54.0 sec (2.3X) |
439.4 sec |
223.1 sec (2.0X) |
NVIDIA H200 SXM |
221.7 sec |
89.4 sec (2.5X) |
836.9 sec |
412.9 sec (2.0X) |
NVIDIA H200 NVL |
267.2 sec |
104.3 sec (2.6X) |
1006.7 sec |
489.5 sec (2.1X) |
NVIDIA H100 PCIe |
378.5 sec |
149.6 sec (2.5X) |
1425.4 sec |
706.9 sec (2.0X) |
NVIDIA H100 NVL |
355.7 sec |
138.7 sec (2.6X) |
1348.6 sec |
677.0 sec (2.0X) |
NVIDIA H100 SXM |
228.8 sec |
94.2 sec (2.4X) |
856.9 sec |
426.0 sec (2.0X) |
The following table shows generation times (720p, 10 FPS) with and without sparsity across supported NVIDIA GPUs:
GPU Hardware |
2B-Video2World |
2B-Video2World + NATTEN |
14B-Video2World |
14B-Video2World + NATTEN |
---|---|---|---|---|
NVIDIA B200 |
62.4 sec |
32.6 sec (1.9X) |
230.0 sec |
136.5 sec (1.7X) |
NVIDIA H200 SXM |
111.1 sec |
52.9 sec (2.1X) |
436.7 sec |
252.1 sec (1.7X) |
NVIDIA H200 NVL |
133.1 sec |
60.7 sec (2.2X) |
519.3 sec |
296.6 sec (1.8X) |
NVIDIA H100 PCIe |
187.9 sec |
87.4 sec (2.1X) |
749.2 sec |
439.3 sec (1.7X) |
NVIDIA H100 NVL |
175.5 sec |
79.0 sec (2.2X) |
711.5 sec |
418.0 sec (1.7X) |
Model Selection Guide#
The 2B models offer a good balance of quality and performance for many practical applications, while being more resource-efficient. The 14B models generally produce higher fidelity results with better coherence and detail, but with increased computational costs.
For most development and testing scenarios, starting with the 2B models is recommended. You can then scale up to 14B models when higher quality is needed and hardware resources permit.
We recommended the 2B models for the following use cases:
Faster inference times and lower latency
Limited GPU memory (requires ~26-33GB VRAM)
Simpler scenes and compositions
Rapid prototyping or testing
Processing large batches of images/videos efficiently
We recommended the 14B models for the following use cases:
Higher quality and more detailed outputs
Sufficient GPU resources (requires ~49-57GB VRAM)
Complex scenes with intricate detail
Quality prioritized over generation speed
Final production assets