Model Matrix#

World Foundation Models#

Refer to the table below for world foundation models (WFMs) available in Cosmos Predict1, along with their supported workflows and compute requirements.

Note

We recommend using NVIDIA H100-80GB or A100-80GB GPUs for inference and post-training.

Model

Description

Inference

Post-Training

Compute Requirements

Multi-GPU Supported

Post-Training Supported

Compute Requirements

Cosmos-Predict1-7B-Text2World

Diffusion-based text to visual world generation (7 billion parameters)

1 GPU

Yes

Yes

8 GPUs

Cosmos-Predict1-14B-Text2World

Diffusion-based text to visual world generation (14 billion parameters)

1 GPU

No

No

Cosmos-Predict1-7B-Video2World

Diffusion-based video and text to visual world generation (7 billion parameters)

1 GPU

Yes

Yes

8 GPUs

Cosmos-Predict1-14B-Video2World

Diffusion-based video and text to visual world generation (14 billion parameters)

1 GPU

No

No

Cosmos-Predict1-7B-Text2World-Multiview

Diffusion-based text to visual world generation with multiple views (e.g. different cameras on an autonomous vehicle) from text input (7 billion parameters)

1 GPU

Yes

Yes

8 GPUs

Cosmos-Predict1-7B-Video2World-Multiview

World generation with multiple views (e.g. different cameras on an autonomous vehicle) from video input (7 billion parameters)

1 GPU

Yes

Yes

8 GPUs

Cosmos-Predict1-4B

Autoregressive-based video or image to visual world generation (4 billion parameters)

1 GPU

No

Yes

2 GPUs

Cosmos-Predict1-12B

Autoregressive-based video or image to visual world generation (12 billion parameters)

1 GPU

No

Yes

8 GPUs

Cosmos-Predict1-5B-Video2World

Autoregressive-based video and text to visual world generation (5 billion parameters)

1 GPU

No

No

Cosmos-Predict1-13B-Video2World

Autoregressive-based video and text to visual world generation (13 billion parameters)

1 GPU

No

No

Tokenizers#

Refer to the table below for tokenizers available in Cosmos Predict1,

Model

Description

Post-Training

Post-Training Supported

Compute Requirements

Cosmos-Tokenize1-CV8×8×8-720p

Continuous Video Tokenizer with 8x8x8 spatio-temporal compression with 121 frames of context

Yes

8 GPUs

Cosmos-Tokenize1-DV8×16×16-720p

Discrete Video Tokenizer with 8x16x16 spatio-temporal compression and 49 frames of context

Yes

8 GPUs

Cosmos-Tokenize1-CI8×8-360p

Continuous Image Tokenizer with 8x8 spatial compression and low-resolution support

No

Cosmos-Tokenize1-CI16x16-360p

Continuous Image Tokenizer with 16x16 spatial compression and low-resolution support

No

Cosmos-Tokenize1-CV4×8×8-360p

Continuous Video Tokenizer with 4x8x8 spatio-temporal compression and low-resolution support

Yes

8 GPUs

Cosmos-Tokenize1-DI8×8-360p

Discrete Image Tokenizer with 8x8 spatial compression and low-resolution support

No

Cosmos-Tokenize1-DI16x16-360p

Discrete Image Tokenizer with 16x16 spatial compression and low-resolution support

No

Cosmos-Tokenize1-DV4×8×8-360p

Discrete Video Tokenizer with 4x8x8 spatio-temporal compression and low-resolution support

Yes

8 GPUs