Supported Models#

Overview#

Cosmos-Predict1-7B models are pre-trained, generative models designed for Physical AI. They are available in two variants:

Cosmos-Predict1-7B-Text2World: A text-to-video model that generates video content from textual descriptions.
Cosmos-Predict1-7B-Video2World: A model that generates video from images or other videos, with optional text input.

Cosmos-Transfer2.5-2B is a video-to-video model that supports edge, depth, visual, and segmentation control modalities for style transfer and content transformation.

Cosmos3-Generator is a generative, world foundational model supporting text-to-video and image-to-video modalities.

Learn more about Cosmos through these resources:

NIM for Cosmos WFM models are available for commercial use under the NVIDIA Open Model license agreement. For more details, refer to the EULA.

Note

NIM for Cosmos WFM models leverage a sophisticated optimization stack that includes PyTorch, NVIDIA NeMo, NVIDIA TensorRT, and NVIDIA TensorRT-LLM for hardware-accelerated inference and deployment.

Configurations#

`Cosmos-Predict1-7B`#

Predict 1 requires NVIDIA GPUs with Ampere architecture or later. The following configurations are optimized using NVIDIA TensorRT (TRT) and NVIDIA TensorRT-LLM (TRT-LLM) for specific GPU models:

Cosmos-Predict1-7B-Text2World

GPU	GPU Memory (GB)	Precision	Number of GPUs
H200	141	FP8	1, 2, 4, 8
H100 SXM	80	FP8	1, 2, 4, 8
H100 NVL	94	FP8	1, 2, 4, 8
H100 PCIe	80	FP8	1, 2, 4, 8

Cosmos-Predict1-7B-Video2World

GPU	GPU Memory (GB)	Precision	Number of GPUs
H200	141	FP8	1, 2, 4, 8
H100 SXM	80	FP8	1, 2, 4, 8
H100 NVL	94	FP8	1, 2, 4, 8
H100 PCIe	80	FP8	1, 2, 4, 8

Warning

Deploying on configurations not listed above may result in suboptimal performance.

Fallback Configurations#

Other GPU models in the Ampere and Ada generations are supported via the fallback method (latency profile) only, with these requirements:

Combined VRAM across all GPUs must exceed 100GB
Minimum single-GPU VRAM must be at least 48GB

Deployment Profiles#

Profile Selection#

NIM automatically selects the optimal profile based on the number of GPUs exposed to the container (via the --gpus flag). You can override this selection using the NIM_MODEL_PROFILE environment variable.

Two profile types are available:

Latency Profiles#

Latency profiles use a parallelized diffusion component with context-parallel (CP) distribution across exposed GPUs, reducing inference time nearly linearly with GPU count.

Four configurations are available:

CP=8: Used when 8 GPUs are available
CP=4: Used when 4-7 GPUs are available
CP=2: Used when 2-3 GPUs are available
CP=1: Used when 1 GPU is available

Performance example: With an H100 SXM GPU, generating 121 frames takes approximately 5 minutes using CP1, but approximately 1 minute with CP8.

Tip

For optimal resource utilization, match the number of GPUs exactly to profile requirements (8, 4, 2, or 1). For example, with 7 GPUs, only 4 will be utilized (CP4 profile), leaving 3 GPUs idle.

Throughput Profiles#

Throughput profiles use quantized and TRT-accelerated diffusion components replicated across GPUs, enabling concurrent requests and increasing overall system throughput nearly linearly with GPU count.

Note

Throughput profiles are only available on H200, H100 SXM, H100 PCIe, and H100 NVL GPUs.

Tip

TRT-accelerated profiles run diffusion with lower precisions. Throughput is increased up to double that of the latency profile (with 8x H100 SXM), but they may produce visual artifacts that are not observed with the latency profile.

By default, NIM prioritizes latency profiles. To change this behavior, set NIM_MODEL_PROFILE=throughput or NIM_MODEL_PROFILE=latency, or specify a particular profile ID.

For detailed configuration options, refer to the Configuring a NIM page.

`Cosmos-Transfer2.5-2B`#

Cosmos-Transfer2.5-2B is a lighter and more efficient model compared to Transfer1-7B. It supports two profile types:

Latency profiles: Use context parallelism (CP) to distribute work across GPUs, processing requests sequentially with reduced inference time.
Throughput profiles: Replicate the model across GPUs, enabling parallel request processing for higher overall throughput.

Supported GPUs#

Transfer2.5 requires NVIDIA GPUs with Hopper architecture or later and 80 GBs of VRAM or more.

The following table shows supported GPU configurations for throughput profile:

GPU	Memory (GB)	Precision	Number of GPUs
B300	288	BF16, FP8	1, 2, 4
GB200	192	BF16, FP8	1, 2, 4
B200	192	BF16, FP8	1, 2, 4, 8
H200	141	BF16, FP8	1, 2, 4, 8
H200 NVL	141	BF16, FP8	1, 2, 4, 8
H100 80GB	80	BF16, FP8	1, 2, 4, 8
H100 NVL	94	BF16, FP8	1, 2, 4, 8
H100 PCIe	80	BF16, FP8	1
H20	96	BF16, FP8	1, 2, 4

The following table shows supported GPU configurations for latency profile:

GPU	Memory (GB)	Precision	Number of GPUs
B300	288	BF16, FP8	1, 2, 4
GB200	192	BF16, FP8	1, 2
B200	192	BF16, FP8	1, 2
H200	141	BF16, FP8	1, 2, 4
H200 NVL	141	BF16, FP8	1, 2
H100 80GB	80	BF16, FP8	1, 2, 4
H100 NVL	94	BF16, FP8	1, 2, 4
H100 PCIe	80	BF16, FP8	1
H20	96	BF16, FP8	1, 2, 4

Note

There are also latency profiles on 4 or 8 GPUs, but performance may be same or lower compared to the opensource version.

Note

Transfer2.5-2B requires at least one control modality (edge, depth, vis, or seg) to be provided in the request.

Note

Transfer2.5-2B supports fallback configurations. Some GPU configurations not listed above are supported, but performance may be degraded.

`Cosmos3-Generator`#

Cosmos3-Generator packages two model sizes — 8B (nano, default) and 32B (super) — into a single container. Pick the size with -e NIM_MODEL_SIZE=nano|super and the precision with -e NIM_PRECISION=bf16|fp8|nvfp4 (default: fp8). Cosmos3-Generator runs on any NVIDIA GPU with Hopper architecture or later (CC ≥ 9.0); nvfp4 additionally requires Blackwell (CC ≥ 10.0).

Per-device VRAM requirements#

The minimum per-device VRAM you need depends on the size and precision you pick. For the 32B (super) size, the NIM automatically falls back to a tensor-parallel (TP) profile on hosts where the single-device layout does not fit — there is no knob to set, the selector picks the smallest viable layout for your hardware.

Size	Precision	Single-device (`nim_tp=1`)	2-GPU TP fallback (`nim_tp=2`)	4-GPU TP fallback (`nim_tp=4`)
8B (nano)	any (`bf16` / `fp8` / `nvfp4`)	≥ 79 GiB	n/a (always fits in 80 GB-class GPUs)	n/a
32B (super)	`fp8`	≥ 121 GiB	≥ 79 GiB	n/a
32B (super)	`bf16`	≥ 150 GiB	≥ 92 GiB	≥ 65 GiB
32B (super)	`nvfp4`	≥ 131 GiB	n/a (TP fallback not yet emitted)	n/a

Note

Thresholds are binary GiB (1 GiB = 1024³ bytes, matching NVIDIA spec sheets and the NVML probe the selector uses). On a nominally “80 GB” H100 the NVML-reported total is ~79.6 GiB, which clears the 79 GiB floor.

Validated SKUs#

These GPUs are explicitly tested and benchmarked at launch. Other GPUs that clear the VRAM and compute-capability thresholds above are also selectable, but per-step performance is not validated on them.

GPU	Memory (GB)	Precisions	Number of GPUs
NVIDIA-B200	192	`bf16`, `fp8`, `nvfp4`	1, 2, 4, 8
NVIDIA-H200	141	`bf16`, `fp8`	1, 2, 4, 8
NVIDIA-H100-80GB-HBM3	80	`bf16`, `fp8`	1 (nano only), 2, 4, 8
NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition	96	`bf16`, `fp8`, `nvfp4`	1 (nano only), 2, 4, 8

Warning

nvfp4 requires native Blackwell FP4 tensor cores (CC ≥ 10.0); the selector rejects nvfp4 at boot on Hopper SKUs (H100, H200) with a clear error.

Parallelism and profile selection#

At startup the NIM reads the GPUs visible to the container — count and per-device VRAM from NVML — and picks the manifest row whose n_gpus and per-precision VRAM floor fit the host. The user only sets NIM_MODEL_SIZE / NIM_PRECISION / NIM_PERF_PROFILE; the parallelism axes are derived automatically:

Throughput profiles drive aggregate throughput by replicating the model: nim_dp = n_gpus.
Latency profiles lower per-request latency by sharding a single request. With n_gpus ≥ 2 they pair CFG-parallel (nim_gp = 2, halves per-step latency for any request with guidance_scale > 1) with Ulysses-parallel over the rest (nim_up = n_gpus / 2).
Super on tight VRAM uses the smallest nim_tp (2 or 4) that fits per-device VRAM — see the threshold table above. When the host has more GPUs than nim_tp requires, throughput layers data-parallel on top (nim_dp = n_gpus / nim_tp); latency keeps nim_gp = 2 and grows nim_tp further (nim_tp = n_gpus / 2).

To pin a specific layout, use NIM_TAGS_SELECTOR or NIM_MODEL_PROFILE. See Configuring a NIM.

Supported Codecs#

Output video#

Output videos encoded in the b64_video output field always use the .mp4 container and VP9 codec.

Input video#

For Cosmos-Predict1-7B-Video2World, any video container supported by ffmpeg native demuxers is supported.

Depending on GPU type, supported codecs will be a subset of the following:

VP9
VP8
VC1
MPEG-1
MPEG-2
H.264
H.265 (HEVC)
AV1

Refer to the Video Decode GPU Support Matrix (NVDEC) for details concerning your platform.

Supported Models#

Overview#

Configurations#

Cosmos-Predict1-7B#

Fallback Configurations#

Deployment Profiles#

Profile Selection#

Latency Profiles#

Throughput Profiles#

Cosmos-Transfer2.5-2B#

Supported GPUs#

Cosmos3-Generator#

Per-device VRAM requirements#

Validated SKUs#

Parallelism and profile selection#

Supported Codecs#

Output video#

Input video#

`Cosmos-Predict1-7B`#

`Cosmos-Transfer2.5-2B`#

`Cosmos3-Generator`#