Release Notes#

This page lists changes, fixes, and known issues for each NIM LLM release.

Release 2.0.2#

Highlights#

This release upgrades the inference backend to vLLM 0.18.0, improves air-gapped redeployments and mirroring workflows, expands the OpenAPI specification with NIM endpoints, and adds the nemotron-3-super-120b-a12b model-specific NIM.

vLLM 0.18.0#

The inference backend is updated from vLLM 0.17.1 to vLLM 0.18.0. This upgrade resolves the double-serialization issue in tool calling with anyOf and oneOf schemas. Refer to vLLM PR #36032 for details. It also brings general performance and stability improvements.

Runtime Manifest Caching for Air-Gapped Redeployments#

Model-Free NIM now caches the manifest generated at runtime locally. On subsequent launches in air-gapped environments, the cached manifest is reused. This eliminates the need for network access after the initial deployment. Refer to Model-Free NIM for details.

GCS Support for the Mirror Command#

The mirror CLI command now supports Google Cloud Storage (GCS) as a destination, in addition to existing backends. Refer to the CLI Reference for details.

NIM Endpoints in the OpenAPI Specification#

NIM endpoints are now injected into vLLM’s openapi.json. This gives API consumers a complete specification that includes both standard vLLM routes and NIM extensions.

Nemotron-3-Super-120B-A12B Model-Specific NIM#

This release adds the nemotron-3-super-120b-a12b model-specific NIM with curated weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for supported profiles and verified GPUs.

Model-Specific NIM Updates#

This release includes updated 2.0.2 versions of the following model-specific NIMs:

  • gpt-oss-120b

  • gpt-oss-20b

  • llama-3.1-70b-instruct

  • llama-3.1-8b-instruct

  • llama-3.3-70b-instruct

  • llama-3.3-nemotron-super-49b-v1.5

  • nemotron-3-nano

  • starcoder2-7b

Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for supported profiles and verified GPUs.

Model-Free NIM Update#

This release includes an updated 2.0.2 version of Model-Free NIM. Refer to Model-Free NIM for details.

Security Fixes#

This release includes the following security fixes:

  • PyJWT — CVE-2026-32597 / GHSA-752w-5fwx-jx9f: Fixed a JWT crit header parameter validation issue (CVSS 7.5 HIGH).

  • gRPC-Go — CVE-2026-33186 / GHSA-p77j-4mvh-x3m3: Fixed a server :path authorization bypass.

  • pyasn1 — CVE-2026-30922 / GHSA-jr27-m4p2-rc6r: Fixed a recursion depth vulnerability.

  • nginx — Updated from 1.29.5 to 1.29.7 to address CVE-2026-1642, CVE-2026-27651, CVE-2026-27654, CVE-2026-32647, and CVE-2026-27784.

  • cbor2 — CVE-2026-26209 / GHSA-3c37-wwvx-h642: Fixed a recursion depth denial-of-service vulnerability.

Bug Fixes#

This release includes the following bug fixes:

  • Restored the Ray dependency removed in vLLM 0.18.0.

  • Enabled nginx access logging by default to improve observability.

Known Issues#

This release includes the following known issues and limitations:

  • nemotron-3-nano LoRA profiles crash on H100, H200, GH200, and Blackwell GPUs with a CUDA illegal memory access error during inference. The vLLM engine terminates and the server returns HTTP 500 errors for all subsequent requests. Non-LoRA profiles and older GPUs (A100, L40S) are not affected. There is no workaround for this issue in NIM 2.0.2.

  • MoE models with MXFP4 profiles (such as gpt-oss-120b) fail to start on Blackwell GPUs in air-gapped environments. The container requires an external network connection to download kernel dependencies at runtime, which is not available in air-gapped deployments. Non-MoE models and non-MXFP4 profiles are not affected. Fixed in vLLM PR #38391 and will be included in an upcoming release. There is no workaround for NIM 2.0.2.

  • An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an EngineDeadError.

    Example Error Message

    You might encounter the following error message:

    Worker proc VllmWorker-2 died unexpectedly, shutting down executor.
    RuntimeError: cancelled (shm_broadcast.py:677 acquire_read)
    vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
    
  • On UMA GPUs, such as GB200, GH200, and GB10 (DGX Spark), PyTorch counts cached and buffered OS memory as used GPU memory. This might cause vLLM to underestimate available GPU memory at startup and fail with ValueError: Free memory on device cuda:N ... is less than desired GPU memory utilization. Refer to upstream vLLM PR #35356 for progress on a permanent fix.

    Workaround: Drop the OS page cache before starting the container.

    Run the following command before starting the container:

    sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
    
  • Llama 3.3 Nemotron Super 49B v1.5 might fail to deploy on all GPUs and profiles, returning a RuntimeError during engine core initialization.

    Workaround: Set NIM_PASSTHROUGH_ARGS to disable custom all-reduce optimizations.
    export NIM_PASSTHROUGH_ARGS="--disable-custom-all-reduce --compilation-config '{\"pass_config\": {\"fuse_allreduce_rms\": false}}'"
    

For information about past updates and older versions, refer to the previous release notes.