Troubleshooting#
Installation and Verification#
If you encounter issues with installing and configuring prerequisites, follow these steps:
Ensure the NVIDIA Container Toolkit is properly installed and configured.
Verify that your GPU drivers are compatible with the CUDA version.
Check that your user has the necessary permissions to run Docker commands.
Consult the Configuring a NIM section for additional configuration options.
Running NIMs#
If you encounter issues running NIM for Cosmos WFM, follow these steps:
Check that your hardware meets the Prerequisites.
Verify the NIM container is running properly with docker ps.
Ensure your request parameters are within supported ranges.
Check server logs for detailed error messages.
Metrics Collection#
If you encounter issues with metrics collection or visualization, follow these steps:
Ensure the NIM container is running and accessible at the expected address.
Verify that Prometheus can reach the NIM metrics endpoint.
Check Prometheus logs for any scraping errors.
Confirm that the metrics are available by directly accessing the metrics endpoint.
Generic operational gotchas#
GPU exclusive-process mode. Multi-process NIM workers will fail to attach if the GPU compute mode is set to
EXCLUSIVE_PROCESS. Switch toDEFAULTmode withsudo nvidia-smi -c DEFAULT(per device with-i <gpu_id>) before launching the container.Air-gapped hosts (broken symlinks). When pre-populating
$LOCAL_NIM_CACHEfrom another host, copy withcp -RL(orrsync -aL) to materialize symlinks rather than carrying them across. Stale or dangling symlinks inside the cache cause cryptic file-not-found errors during model load.Long requests timing out at the client. Long-form video generation can take several minutes per request. Set a generous client-side timeout (for example,
curl --max-time 1800orrequests.post(..., timeout=1800)) so the client does not abort while the server is still rendering.
Troubleshooting Cosmos3-Generator#
Common issues with Cosmos3-Generator and their fixes, grouped by failure phase.
Profile selection#
Selector raises ``compute_capability>=10.0 (from precision=nvfp4)`` at boot — the host GPU is pre-Blackwell (e.g. H100, H200) and
NIM_PRECISION=nvfp4was requested. Pick a different precision:-e NIM_PRECISION=fp8or-e NIM_PRECISION=bf16.Selector raises ``compute_capability>=9.0`` at boot — the host GPU is below the Hopper architecture floor. Use a Hopper or newer SKU (see Cosmos3-Generator).
Selector raises ``No SKU-matched profiles for local hardware`` — the host SKU is not in the manifest’s tested-SKU list and per-device VRAM is below the minimum floor for the chosen precision/model. Use a SKU that meets the supportability gates listed in the support matrix, or pin a smaller
n_gpusvalue via-e NIM_TAGS_SELECTOR='n_gpus=2'if your host has multiple GPUs that fit a smaller layout.``Conflicting selectors: NIM_PRECISION=fp8 vs NIM_TAGS_SELECTOR=precision=bf16`` — both shorthands set and disagree. Set only one of
NIM_PRECISION/NIM_TAGS_SELECTOR.precision(same forNIM_PERF_PROFILE/NIM_TAGS_SELECTOR.profileandNIM_MODEL_SIZE/NIM_TAGS_SELECTOR.model_size).
Boot and download#
NGC download failure on first boot —
NGC_API_KEYis missing or wrong, or the host has no network access to NGC. SetNGC_API_KEYto a valid token, or pre-populate$LOCAL_NIM_CACHEfrom a host that does have NGC access (usingcp -RLto materialize symlinks; see Air-gapped hosts above).Readiness probe never returns 200 — first-run engine compilation (especially for the 32B (super) size or under Bring your own checkpoint for Cosmos3-Generator) can take several minutes on a cold cache. Wait for
GET /v1/health/readyto return200before sending inference traffic; on Kubernetes, increaseinitialDelaySecondson the readiness probe accordingly.
BYOC validation#
Boot fails with a BYOC cross-check error. When
NIM_FT_CHECKPOINTis set, the NIM cross-checks the auto-discovered model size and precision against the active profile and raises a clearexpected vs. receivederror if they disagree. Update either the BYOC checkpoint or the matchingNIM_MODEL_SIZE/NIM_PRECISIONshorthand so they align, then restart.``config.json`` not found inside the BYOC directory.
Cosmos3-Generatorexpects the layout described in Bring your own checkpoint for Cosmos3-Generator (transformer/config.json,vae/,scheduler/,model_index.json). Verify the bind-mount points at the directory that contains those four entries — not at one level above or below.
Runtime#
``Out of memory`` at engine load on a multi-GPU latency profile — the chosen profile’s per-device VRAM still exceeds what the host has. For the 32B (super) size on 80 GB-class GPUs, pin a tensor-parallel fallback explicitly with
-e NIM_TAGS_SELECTOR='model_size=super,nim_tp=2'(ornim_tp=4); otherwise use a more memory-efficient precision (e.g.fp8instead ofbf16) or fewer GPUs.
Output playback#
The output ``.mp4`` does not play in a browser or stock player — VP9-in-MP4 is not universally supported. Use
mpv,ffplay, orIINA; to produce a more portable file, re-encode externally withffmpeg -i video.mp4 -c:v libx264 -pix_fmt yuv420p out.mp4.