Troubleshooting#

Installation and Verification#

If you encounter issues with installing and configuring prerequisites, follow these steps:

  1. Ensure the NVIDIA Container Toolkit is properly installed and configured.

  2. Verify that your GPU drivers are compatible with the CUDA version.

  3. Check that your user has the necessary permissions to run Docker commands.

  4. Consult the Configuring a NIM section for additional configuration options.

Running NIMs#

If you encounter issues running NIM for Cosmos WFM, follow these steps:

  1. Check that your hardware meets the Prerequisites.

  2. Verify the NIM container is running properly with docker ps.

  3. Ensure your request parameters are within supported ranges.

  4. Check server logs for detailed error messages.

Metrics Collection#

If you encounter issues with metrics collection or visualization, follow these steps:

  1. Ensure the NIM container is running and accessible at the expected address.

  2. Verify that Prometheus can reach the NIM metrics endpoint.

  3. Check Prometheus logs for any scraping errors.

  4. Confirm that the metrics are available by directly accessing the metrics endpoint.

Generic operational gotchas#

  • GPU exclusive-process mode. Multi-process NIM workers will fail to attach if the GPU compute mode is set to EXCLUSIVE_PROCESS. Switch to DEFAULT mode with sudo nvidia-smi -c DEFAULT (per device with -i <gpu_id>) before launching the container.

  • Air-gapped hosts (broken symlinks). When pre-populating $LOCAL_NIM_CACHE from another host, copy with cp -RL (or rsync -aL) to materialize symlinks rather than carrying them across. Stale or dangling symlinks inside the cache cause cryptic file-not-found errors during model load.

  • Long requests timing out at the client. Long-form video generation can take several minutes per request. Set a generous client-side timeout (for example, curl --max-time 1800 or requests.post(..., timeout=1800)) so the client does not abort while the server is still rendering.

Troubleshooting Cosmos3-Generator#

Common issues with Cosmos3-Generator and their fixes, grouped by failure phase.

Profile selection#

  • Selector raises ``compute_capability>=10.0 (from precision=nvfp4)`` at boot — the host GPU is pre-Blackwell (e.g. H100, H200) and NIM_PRECISION=nvfp4 was requested. Pick a different precision: -e NIM_PRECISION=fp8 or -e NIM_PRECISION=bf16.

  • Selector raises ``compute_capability>=9.0`` at boot — the host GPU is below the Hopper architecture floor. Use a Hopper or newer SKU (see Cosmos3-Generator).

  • Selector raises ``No SKU-matched profiles for local hardware`` — the host SKU is not in the manifest’s tested-SKU list and per-device VRAM is below the minimum floor for the chosen precision/model. Use a SKU that meets the supportability gates listed in the support matrix, or pin a smaller n_gpus value via -e NIM_TAGS_SELECTOR='n_gpus=2' if your host has multiple GPUs that fit a smaller layout.

  • ``Conflicting selectors: NIM_PRECISION=fp8 vs NIM_TAGS_SELECTOR=precision=bf16`` — both shorthands set and disagree. Set only one of NIM_PRECISION / NIM_TAGS_SELECTOR.precision (same for NIM_PERF_PROFILE / NIM_TAGS_SELECTOR.profile and NIM_MODEL_SIZE / NIM_TAGS_SELECTOR.model_size).

Boot and download#

  • NGC download failure on first bootNGC_API_KEY is missing or wrong, or the host has no network access to NGC. Set NGC_API_KEY to a valid token, or pre-populate $LOCAL_NIM_CACHE from a host that does have NGC access (using cp -RL to materialize symlinks; see Air-gapped hosts above).

  • Readiness probe never returns 200 — first-run engine compilation (especially for the 32B (super) size or under Bring your own checkpoint for Cosmos3-Generator) can take several minutes on a cold cache. Wait for GET /v1/health/ready to return 200 before sending inference traffic; on Kubernetes, increase initialDelaySeconds on the readiness probe accordingly.

BYOC validation#

  • Boot fails with a BYOC cross-check error. When NIM_FT_CHECKPOINT is set, the NIM cross-checks the auto-discovered model size and precision against the active profile and raises a clear expected vs. received error if they disagree. Update either the BYOC checkpoint or the matching NIM_MODEL_SIZE / NIM_PRECISION shorthand so they align, then restart.

  • ``config.json`` not found inside the BYOC directory. Cosmos3-Generator expects the layout described in Bring your own checkpoint for Cosmos3-Generator (transformer/config.json, vae/, scheduler/, model_index.json). Verify the bind-mount points at the directory that contains those four entries — not at one level above or below.

Runtime#

  • ``Out of memory`` at engine load on a multi-GPU latency profile — the chosen profile’s per-device VRAM still exceeds what the host has. For the 32B (super) size on 80 GB-class GPUs, pin a tensor-parallel fallback explicitly with -e NIM_TAGS_SELECTOR='model_size=super,nim_tp=2' (or nim_tp=4); otherwise use a more memory-efficient precision (e.g. fp8 instead of bf16) or fewer GPUs.

Output playback#

  • The output ``.mp4`` does not play in a browser or stock player — VP9-in-MP4 is not universally supported. Use mpv, ffplay, or IINA; to produce a more portable file, re-encode externally with ffmpeg -i video.mp4 -c:v libx264 -pix_fmt yuv420p out.mp4.