Model Matrix#
World Foundation Models#
Refer to the table below for world foundation models (WFMs) available in Cosmos Predict1, along with their supported workflows and compute requirements.
Note
We recommend using NVIDIA H100-80GB or A100-80GB GPUs for inference and post-training.
Model |
Description |
Inference |
Post-Training |
||
---|---|---|---|---|---|
Compute Requirements |
Multi-GPU Supported |
Post-Training Supported |
Compute Requirements |
||
Diffusion-based text to visual world generation (7 billion parameters) |
1 GPU |
Yes |
Yes |
8 GPUs |
|
Diffusion-based text to visual world generation (14 billion parameters) |
1 GPU |
No |
No |
– |
|
Diffusion-based video and text to visual world generation (7 billion parameters) |
1 GPU |
Yes |
Yes |
8 GPUs |
|
Diffusion-based video and text to visual world generation (14 billion parameters) |
1 GPU |
No |
No |
– |
|
Diffusion-based text to visual world generation with multiple views (e.g. different cameras on an autonomous vehicle) from text input (7 billion parameters) |
1 GPU |
Yes |
Yes |
8 GPUs |
|
World generation with multiple views (e.g. different cameras on an autonomous vehicle) from video input (7 billion parameters) |
1 GPU |
Yes |
Yes |
8 GPUs |
|
Autoregressive-based video or image to visual world generation (4 billion parameters) |
1 GPU |
No |
Yes |
2 GPUs |
|
Autoregressive-based video or image to visual world generation (12 billion parameters) |
1 GPU |
No |
Yes |
8 GPUs |
|
Autoregressive-based video and text to visual world generation (5 billion parameters) |
1 GPU |
No |
No |
– |
|
Autoregressive-based video and text to visual world generation (13 billion parameters) |
1 GPU |
No |
No |
– |
Tokenizers#
Refer to the table below for tokenizers available in Cosmos Predict1,
Model |
Description |
Post-Training |
|
---|---|---|---|
Post-Training Supported |
Compute Requirements |
||
Continuous Video Tokenizer with 8x8x8 spatio-temporal compression with 121 frames of context |
Yes |
8 GPUs |
|
Discrete Video Tokenizer with 8x16x16 spatio-temporal compression and 49 frames of context |
Yes |
8 GPUs |
|
Continuous Image Tokenizer with 8x8 spatial compression and low-resolution support |
No |
– |
|
Continuous Image Tokenizer with 16x16 spatial compression and low-resolution support |
No |
– |
|
Continuous Video Tokenizer with 4x8x8 spatio-temporal compression and low-resolution support |
Yes |
8 GPUs |
|
Discrete Image Tokenizer with 8x8 spatial compression and low-resolution support |
No |
– |
|
Cosmos-Tokenize1-DI16x16-360p |
Discrete Image Tokenizer with 16x16 spatial compression and low-resolution support |
No |
– |
Cosmos-Tokenize1-DV4×8×8-360p |
Discrete Video Tokenizer with 4x8x8 spatio-temporal compression and low-resolution support |
Yes |
8 GPUs |