Release Notes#

This page lists changes, fixes, and known issues for each NIM LLM release.

Release 2.0.2#

Highlights#

This release upgrades the inference backend to vLLM 0.18.0, improves air-gapped redeployments and mirroring workflows, expands the OpenAPI specification with NIM endpoints, and adds the nemotron-3-super-120b-a12b model-specific NIM.

vLLM 0.18.0#

The inference backend is updated from vLLM 0.17.1 to vLLM 0.18.0. This upgrade resolves the double-serialization issue in tool calling with anyOf and oneOf schemas. Refer to vLLM PR #36032 for details. It also brings general performance and stability improvements.

Runtime Manifest Caching for Air-Gapped Redeployments#

Model-Free NIM now caches the manifest generated at runtime locally. On subsequent launches in air-gapped environments, the cached manifest is reused. This eliminates the need for network access after the initial deployment. Refer to Model-Free NIM for details.

GCS Support for the Mirror Command#

The mirror CLI command now supports Google Cloud Storage (GCS) as a destination, in addition to existing backends. Refer to the CLI Reference for details.

NIM Endpoints in the OpenAPI Specification#

NIM endpoints are now injected into vLLM’s openapi.json. This gives API consumers a complete specification that includes both standard vLLM routes and NIM extensions.

Nemotron-3-Super-120B-A12B Model-Specific NIM#

This release adds the nemotron-3-super-120b-a12b model-specific NIM with curated weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for supported profiles and verified GPUs.

Security Fixes#

This release includes the following security fixes:

  • PyJWT — CVE-2026-32597 / GHSA-752w-5fwx-jx9f: Fixed a JWT crit header parameter validation issue (CVSS 7.5 HIGH).

  • gRPC-Go — CVE-2026-33186 / GHSA-p77j-4mvh-x3m3: Fixed a server :path authorization bypass.

  • pyasn1 — CVE-2026-30922 / GHSA-jr27-m4p2-rc6r: Fixed a recursion depth vulnerability.

  • nginx — Updated from 1.29.5 to 1.29.7 to address CVE-2026-1642, CVE-2026-27651, CVE-2026-27654, CVE-2026-32647, and CVE-2026-27784.

  • cbor2 — CVE-2026-26209 / GHSA-3c37-wwvx-h642: Fixed a recursion depth denial-of-service vulnerability.

Bug Fixes#

This release includes the following bug fixes:

  • Restored the Ray dependency removed in vLLM 0.18.0.

  • Enabled nginx access logging by default to improve observability.

Known Issues#

This release includes the following known issues and limitations:

  • An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an EngineDeadError.

    Example Error Message

    You might encounter the following error message:

    Worker proc VllmWorker-2 died unexpectedly, shutting down executor.
    RuntimeError: cancelled (shm_broadcast.py:677 acquire_read)
    vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
    
  • On UMA GPUs, such as GB200, GH200, and GB10 (DGX Spark), PyTorch counts cached and buffered OS memory as used GPU memory. This might cause vLLM to underestimate available GPU memory at startup and fail with ValueError: Free memory on device cuda:N ... is less than desired GPU memory utilization. Refer to upstream vLLM PR #35356 for progress on a permanent fix.

    Workaround: Drop the OS page cache before starting the container.

    Run the following command before starting the container:

    sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
    

Release 2.0.1#

Highlights#

This is the first release of the newly-architected NIM LLM. Version 2.0 is a ground-up redesign that adopts a one-container, one-backend philosophy, replacing the multi-backend 1.x architecture. This release ships with vLLM 0.17.1 as the sole inference backend. Refer to the 1.x to 2.0 Migration Guide for details on upgrading.

Transparent vLLM Configuration#

NIM LLM exposes vllm serve as a first-class interface. Any argument that vllm serve accepts can be passed directly to the container, allowing backend tuning identical to standalone vLLM. Use --dry-run to preview the fully resolved configuration without launching the server. Refer to Advanced Configuration for details.

Container-as-Binary CLI#

The container now behaves like a self-documenting command-line tool. Pass -h to any action to see its usage. Refer to the CLI Reference for details.

Broad GPU SKU Coverage#

This release is verified across 16 GPU SKUs spanning five NVIDIA architectures, from Ampere through Blackwell. Tested hardware includes A100, A10G, L40S, H100, H100 NVL, H200, H200 NVL, GH200, GB200, GB10, B200, B300, and RTX PRO Blackwell Server Edition cards.

Eight Model-Specific NIMs#

This release includes eight model-specific NIMs:

  • gpt-oss-120b

  • gpt-oss-20b

  • llama-3.1-70b-instruct

  • llama-3.1-8b-instruct

  • llama-3.3-70b-instruct

  • llama-3.3-nemotron-super-49b-v1.5

  • nemotron-3-nano

  • starcoder2-7b

Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for supported profiles and verified GPUs.

Model-Free NIM#

A single container image that can serve any supported model from Hugging Face, NGC, Amazon S3, Google Cloud Storage, ModelScope, or a local directory. Model-free NIM generates a manifest at startup and enables day-zero support for newly released model architectures without waiting for a model-specific container. Refer to Model-Free NIM for details.

LoRA Adapter Serving#

Support for both static LoRA (adapters discovered at startup) and dynamic LoRA (adapters loaded and unloaded at runtime via directory watching or the /v1/load_lora_adapter API). Refer to Fine-Tuning with LoRA for details.

Advanced Inference Features#

Support for custom logits processing via pluggable processors mounted into the container, and prompt embeddings for privacy-preserving inference workflows where sensitive data is converted to embeddings before reaching the server.

Known Issues#

This release includes the following known issues and limitations:

  • In vLLM 0.17.1, tool calls with anyOf or oneOf parameter schemas return double-serialized JSON arguments when the qwen3_coder tool call parser is used. This issue is fixed in upstream vLLM (PR #36032) and will be included in the next release that uses vLLM 0.18.0+.

  • Multi-node deployments (PP > 1) crash with a RayChannelTimeoutError after processing a request that includes a non-null logprobs parameter. Refer to the upstream vLLM issue for details.

  • Llama 3.1 70B Instruct and Llama 3.3 70B Instruct might OOM with the default maximum sequence length. Workaround: Set NIM_MAX_MODEL_LEN=100000.

  • FP8 MoE models (for example, Nemotron Nano) have degraded throughput on Blackwell GPUs because native FP8 Grouped GEMM kernels are not yet available. The flashinfer autotuner falls back to older CUTLASS kernels. Workaround: Use NVFP4 quantization (Blackwell-native TMA kernels).

  • NVFP4 MoE models (for example, Nemotron Nano) might crash on GB10 with CUDA error: misaligned address during CUDA graph capture. Dense NVFP4 models are unaffected. Workaround: Set VLLM_USE_FLASHINFER_MOE_FP4=0.

  • Nemotron 3 Nano and Llama 3.3 Nemotron Super 49B v1.5 might fail to deploy on all GPUs and profiles, returning a RuntimeError during engine core initialization.

    Workaround: Set NIM_PASSTHROUGH_ARGS to disable custom all-reduce optimizations.
    export NIM_PASSTHROUGH_ARGS="--disable-custom-all-reduce --compilation-config '{\"pass_config\": {\"fuse_allreduce_rms\": false}}'"
    
  • On A10G and A100 GPUs, automatic profile selection might pick an FP8 profile that requires compute capability 8.9+ (Ada Lovelace or newer), causing a startup crash.

    Workaround: Set NIM_MODEL_PROFILE to a BF16 profile.

    Startup crash message:

    pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig
      Value error, The quantization method modelopt is not supported for the current GPU.
      Minimum capability: 89. Current capability: 86.
    

    List available profiles to find the correct ID:

    docker run --rm -e NGC_API_KEY=$NGC_API_KEY ${NIM_IMAGE} list-model-profiles
    

    Then start the server with that profile:

    docker run ... -e NIM_MODEL_PROFILE=${BF16_PROFILE_ID} ...
    
  • download-to-cache --lora fails with Error: None because the profile selector does not resolve a LoRA profile.

    Workaround: Pass the explicit LoRA profile ID with --profiles.

    List profiles to find the one with feat_lora in its name:

    docker run --rm --gpus all -e NGC_API_KEY=$NGC_API_KEY ${NIM_IMAGE} list-model-profiles
    

    Then download it explicitly:

    docker run --rm --gpus all \
      -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
      -e NGC_API_KEY=$NGC_API_KEY \
      ${NIM_IMAGE} download-to-cache --profiles ${LORA_PROFILE_ID}