Supported Models#
Overview#
Cosmos-Predict1-7B models are pre-trained, generative models designed for Physical AI. They are available in two variants:
Cosmos-Predict1-7B-Text2World: A text-to-video model that generates video content from textual descriptions.Cosmos-Predict1-7B-Video2World: A model that generates video from images or other videos, with optional text input.
Cosmos-Transfer2.5-2B is a video-to-video model that supports edge, depth, visual, and segmentation control modalities for style transfer and content transformation.
Cosmos3-Generator is a generative, world foundational model supporting text-to-video and image-to-video modalities.
Learn more about Cosmos through these resources:
NIM for Cosmos WFM models are available for commercial use under the NVIDIA Open Model license agreement. For more details, refer to the EULA.
Note
NIM for Cosmos WFM models leverage a sophisticated optimization stack that includes PyTorch, NVIDIA NeMo, NVIDIA TensorRT, and NVIDIA TensorRT-LLM for hardware-accelerated inference and deployment.
Configurations#
Cosmos-Predict1-7B#
Predict 1 requires NVIDIA GPUs with Ampere architecture or later. The following configurations are optimized using NVIDIA TensorRT (TRT) and NVIDIA TensorRT-LLM (TRT-LLM) for specific GPU models:
GPU |
GPU Memory (GB) |
Precision |
Number of GPUs |
|---|---|---|---|
H200 |
141 |
FP8 |
1, 2, 4, 8 |
H100 SXM |
80 |
FP8 |
1, 2, 4, 8 |
H100 NVL |
94 |
FP8 |
1, 2, 4, 8 |
H100 PCIe |
80 |
FP8 |
1, 2, 4, 8 |
GPU |
GPU Memory (GB) |
Precision |
Number of GPUs |
|---|---|---|---|
H200 |
141 |
FP8 |
1, 2, 4, 8 |
H100 SXM |
80 |
FP8 |
1, 2, 4, 8 |
H100 NVL |
94 |
FP8 |
1, 2, 4, 8 |
H100 PCIe |
80 |
FP8 |
1, 2, 4, 8 |
Warning
Deploying on configurations not listed above may result in suboptimal performance.
Fallback Configurations#
Other GPU models in the Ampere and Ada generations are supported via the fallback method (latency profile) only, with these requirements:
Combined VRAM across all GPUs must exceed 100GB
Minimum single-GPU VRAM must be at least 48GB
Deployment Profiles#
Profile Selection#
NIM automatically selects the optimal profile based on the number of GPUs exposed to the container (via the --gpus flag). You can override this selection using the NIM_MODEL_PROFILE environment variable.
Two profile types are available:
Latency Profiles#
Latency profiles use a parallelized diffusion component with context-parallel (CP) distribution across exposed GPUs, reducing inference time nearly linearly with GPU count.
Four configurations are available:
CP=8: Used when 8 GPUs are availableCP=4: Used when 4-7 GPUs are availableCP=2: Used when 2-3 GPUs are availableCP=1: Used when 1 GPU is available
Performance example: With an H100 SXM GPU, generating 121 frames takes approximately 5 minutes using CP1, but approximately 1 minute with CP8.
Tip
For optimal resource utilization, match the number of GPUs exactly to profile requirements (8, 4, 2, or 1). For example, with 7 GPUs, only 4 will be utilized (CP4 profile), leaving 3 GPUs idle.
Throughput Profiles#
Throughput profiles use quantized and TRT-accelerated diffusion components replicated across GPUs, enabling concurrent requests and increasing overall system throughput nearly linearly with GPU count.
Note
Throughput profiles are only available on H200, H100 SXM, H100 PCIe, and H100 NVL GPUs.
Tip
TRT-accelerated profiles run diffusion with lower precisions. Throughput is increased up to double that of the latency profile (with 8x H100 SXM), but they may produce visual artifacts that are not observed with the latency profile.
By default, NIM prioritizes latency profiles. To change this behavior, set NIM_MODEL_PROFILE=throughput or NIM_MODEL_PROFILE=latency, or specify a particular profile ID.
For detailed configuration options, refer to the Configuring a NIM page.
Cosmos-Transfer2.5-2B#
Cosmos-Transfer2.5-2B is a lighter and more efficient model compared to Transfer1-7B. It supports two profile types:
Latency profiles: Use context parallelism (CP) to distribute work across GPUs, processing requests sequentially with reduced inference time.
Throughput profiles: Replicate the model across GPUs, enabling parallel request processing for higher overall throughput.
Supported GPUs#
Transfer2.5 requires NVIDIA GPUs with Hopper architecture or later and 80 GBs of VRAM or more.
The following table shows supported GPU configurations for throughput profile:
GPU |
Memory (GB) |
Precision |
Number of GPUs |
|---|---|---|---|
B300 |
288 |
BF16, FP8 |
1, 2, 4 |
GB200 |
192 |
BF16, FP8 |
1, 2, 4 |
B200 |
192 |
BF16, FP8 |
1, 2, 4, 8 |
H200 |
141 |
BF16, FP8 |
1, 2, 4, 8 |
H200 NVL |
141 |
BF16, FP8 |
1, 2, 4, 8 |
H100 80GB |
80 |
BF16, FP8 |
1, 2, 4, 8 |
H100 NVL |
94 |
BF16, FP8 |
1, 2, 4, 8 |
H100 PCIe |
80 |
BF16, FP8 |
1 |
H20 |
96 |
BF16, FP8 |
1, 2, 4 |
The following table shows supported GPU configurations for latency profile:
GPU |
Memory (GB) |
Precision |
Number of GPUs |
|---|---|---|---|
B300 |
288 |
BF16, FP8 |
1, 2, 4 |
GB200 |
192 |
BF16, FP8 |
1, 2 |
B200 |
192 |
BF16, FP8 |
1, 2 |
H200 |
141 |
BF16, FP8 |
1, 2, 4 |
H200 NVL |
141 |
BF16, FP8 |
1, 2 |
H100 80GB |
80 |
BF16, FP8 |
1, 2, 4 |
H100 NVL |
94 |
BF16, FP8 |
1, 2, 4 |
H100 PCIe |
80 |
BF16, FP8 |
1 |
H20 |
96 |
BF16, FP8 |
1, 2, 4 |
Note
There are also latency profiles on 4 or 8 GPUs, but performance may be same or lower compared to the opensource version.
Note
Transfer2.5-2B requires at least one control modality (edge, depth, vis, or seg) to be provided in the request.
Note
Transfer2.5-2B supports fallback configurations. Some GPU configurations not listed above are supported, but performance may be degraded.
Cosmos3-Generator#
Cosmos3-Generator packages two model sizes — 8B (nano, default) and 32B (super) — into a
single container. Pick the size with -e NIM_MODEL_SIZE=nano|super and the precision with
-e NIM_PRECISION=bf16|fp8|nvfp4 (default: fp8). Cosmos3-Generator runs on any NVIDIA
GPU with Hopper architecture or later
(CC ≥ 9.0); nvfp4 additionally requires Blackwell
(CC ≥ 10.0).
Per-device VRAM requirements#
The minimum per-device VRAM you need depends on the size and precision you pick. For the 32B (super) size, the NIM automatically falls back to a tensor-parallel (TP) profile on hosts where the single-device layout does not fit — there is no knob to set, the selector picks the smallest viable layout for your hardware.
Size |
Precision |
Single-device ( |
2-GPU TP fallback ( |
4-GPU TP fallback ( |
|---|---|---|---|---|
8B (nano) |
any ( |
≥ 79 GiB |
n/a (always fits in 80 GB-class GPUs) |
n/a |
32B (super) |
|
≥ 121 GiB |
≥ 79 GiB |
n/a |
32B (super) |
|
≥ 150 GiB |
≥ 92 GiB |
≥ 65 GiB |
32B (super) |
|
≥ 131 GiB |
n/a (TP fallback not yet emitted) |
n/a |
Note
Thresholds are binary GiB (1 GiB = 1024³ bytes, matching NVIDIA spec sheets and the NVML probe the selector uses). On a nominally “80 GB” H100 the NVML-reported total is ~79.6 GiB, which clears the 79 GiB floor.
Validated SKUs#
These GPUs are explicitly tested and benchmarked at launch. Other GPUs that clear the VRAM and compute-capability thresholds above are also selectable, but per-step performance is not validated on them.
GPU |
Memory (GB) |
Precisions |
Number of GPUs |
|---|---|---|---|
NVIDIA-B200 |
192 |
|
1, 2, 4, 8 |
NVIDIA-H200 |
141 |
|
1, 2, 4, 8 |
NVIDIA-H100-80GB-HBM3 |
80 |
|
1 (nano only), 2, 4, 8 |
NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition |
96 |
|
1 (nano only), 2, 4, 8 |
Warning
nvfp4 requires native Blackwell FP4 tensor cores (CC ≥ 10.0); the selector rejects
nvfp4 at boot on Hopper SKUs (H100, H200) with a clear error.
Parallelism and profile selection#
At startup the NIM reads the GPUs visible to the container — count and per-device VRAM
from NVML — and picks the manifest row whose n_gpus and per-precision VRAM floor fit
the host. The user only sets NIM_MODEL_SIZE / NIM_PRECISION / NIM_PERF_PROFILE;
the parallelism axes are derived automatically:
Throughput profiles drive aggregate throughput by replicating the model:
nim_dp = n_gpus.Latency profiles lower per-request latency by sharding a single request. With
n_gpus ≥ 2they pair CFG-parallel (nim_gp = 2, halves per-step latency for any request withguidance_scale > 1) with Ulysses-parallel over the rest (nim_up = n_gpus / 2).Super on tight VRAM uses the smallest
nim_tp(2 or 4) that fits per-device VRAM — see the threshold table above. When the host has more GPUs thannim_tprequires, throughput layers data-parallel on top (nim_dp = n_gpus / nim_tp); latency keepsnim_gp = 2and growsnim_tpfurther (nim_tp = n_gpus / 2).
To pin a specific layout, use NIM_TAGS_SELECTOR or NIM_MODEL_PROFILE. See
Configuring a NIM.
Supported Codecs#
Output video#
Output videos encoded in the b64_video output field always use the .mp4 container and VP9 codec.
Input video#
For Cosmos-Predict1-7B-Video2World, any video container supported by ffmpeg native demuxers is supported.
Depending on GPU type, supported codecs will be a subset of the following:
VP9
VP8
VC1
MPEG-1
MPEG-2
H.264
H.265 (HEVC)
AV1
Refer to the Video Decode GPU Support Matrix (NVDEC) for details concerning your platform.