Supported Models#
These models are optimized using PyTorch, NeMo, TRT, and TRT-LLM.
Cosmos-Predict1-7B-Text2World#
Overview#
The Cosmos-Predict1-7B-Text2World is a pre-trained, generative text-to-video model. Cosmos models are ready for commercial use under the NVIDIA Open Model license agreement.
The 7B model is recommended for users who want to prioritize response speed and have a moderate compute budget.
We recommend at least 100GB of disk space for the container and model.
Throughput-optimized Configurations#
NVIDIA NIM for Cosmos requires NVIDIA GPUs with Ampere architecture or later. Some profiles are further optimized using NVIDIA TensorRT (TRT) and NVIDIA TensorRT-LLM (TRT-LLM) for specific GPU models.
GPU |
GPU Memory (GB) |
Precision |
Number of GPUs |
---|---|---|---|
H200 |
141 |
BF16 |
[1, 2, 4, 8] |
H100 SXM |
80 |
BF16 |
[1, 2, 4, 8] |
H100 NVL |
94 |
BF16 |
[1, 2, 4, 8] |
H100 PCIe |
80 |
BF16 |
[1, 2, 4, 8] |
Attempting to deploy on a different configuration than those listed above might result in suboptimal performance.
Fallback Configurations#
Other GPU models on the Ampere and Ada generation are only supported via the fallback method / latency profile, provided that the combination of all available VRAM is larger than 100GB and that the minimum single-GPU VRAM is at least 48GB.
Profile Selection#
Depending on how many GPUs are exposed to the container (i.e. with the --gpus
flag), NIM will by default auto-select a profile that
best fits that configuration.
Two types of profiles are available:
Latency Profiles: These profiles use a parallelized diffusion component, with context-parallel (CP) across exposed GPUs. These profiles allow for a lower inference latency. The latency reduction scales close to linearly with the number of GPUs.
When using the latency profiles, four configurations are available:
CP=1
,CP=2
,CP=4
, andCP=8
:CP=8
configuration: Used when 8 GPUs are available.CP=4
configuration: Used when 4-7 GPUs are available.CP=2
configuration: Used when 2-3 GPUs are available.CP=1
configuration: Used when 1 GPU is available.
NIM will select the configuration with the largest CP that fits the number of GPUs exposed to Docker.
Using the with the latency profile with a H100 SXM GPU, inference with default parameters can take around 5 minutes to generate 121 frames using the CP1 profile, and less than 1 minute using the CP8 profile.
For optimal resource utilization, we recommend matching the number of GPUs exactly to the profile requirements (8, 4, 2, or 1). For example, if you expose 7 GPUs, the model will use the CP4 profile, effectively utilizing only 4 GPUs and leaving 3 GPUs idle.
Throughput Profiles: Throughput profiles use a quantized and TRT-accelerated diffusion component, replicated on exposed GPUs. These profiles allow concurrent requests in order to increase overall system throughput. Throughput scales close to linearly with the number of GPUs.
Throughput profiles are only available on H200, H100 SXM, H100 PCIe, and H100 NVL GPUs.
Note
TRT-accelerated profiles run diffusion with lower precisions. Throughput is increased up to double that of the latency profile (with 8x H100 SXM), but they may produce visual artifacts that are not observed with the latency profile.
Default profile selection prioritizes latency profiles. This behavior can be changed with the NIM_MODEL_PROFILE
environment variable, which
can be set to a specific profile ID–or to a general profile type by providing NIM_MODEL_PROFILE=throughput
or NIM_MODEL_PROFILE=latency
.
Refer to the Configuring a NIM page for details on how to filter and choose a specific profile.