Release Notes#

This page lists changes, fixes, and known issues for each NIM LLM release.

Release 2.0.6#

Highlights#

This release upgrades the inference backend to vLLM 0.21.0, adds first-class OpenShift Security Context Constraints (SCC) awareness to the Helm chart, and exposes additional vLLM inference endpoints through the NIM proxy.

vLLM 0.21.0#

The inference backend is updated from vLLM 0.20.2 to vLLM 0.21.0, bringing improvements to functionality, performance, and stability.

OpenShift Security Context Constraints (SCC) Support#

The Helm chart now ships with first-class OpenShift SCC awareness. Two SCC profiles are supported in this release: nonroot-v2 and restricted-v2. Refer to Configure the Security Context Constraint for details.

Additional vLLM Inference Endpoints Exposed Through Proxy#

The NIM proxy now exposes the vLLM-native /inference/v1/generate and /generative_scoring endpoints in addition to the OpenAI-compatible routes. These endpoints provide direct access to vLLM’s native generation and scoring interfaces for advanced use cases. Refer to the API Reference for details.

Model-Specific NIM Updates#

This release includes updated 2.0.6 versions of the following model-specific Certified NIM containers:

  • gpt-oss-120b

  • gpt-oss-20b

  • llama-3.1-70b-instruct

  • llama-3.1-8b-instruct

  • llama-3.3-70b-instruct

  • llama-3.3-nemotron-super-49b-v1.5

  • nemotron-3-nano

  • nemotron-3-super-120b-a12b

  • starcoder2-7b

Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for Certified NIMs for supported profiles and verified GPUs.

Model-Free NIM Update#

This release includes an updated 2.0.6 version of the model-free NIM container. Refer to Model-Free NIM for details.

NIM Day 0 Offering#

This release introduces the NIM Day 0 offering of the following model-specific NIM:

  • nemotron-3-ultra-550b-a55b

Refer to Release Notes for NIM Day 0 for details and get-started guides.

Security Fixes#

This release includes the following security fixes:

  • nginx — Updated from 1.29.7 to 1.30.1 to address CVE-2026-42945 (“NGINX Rift” RCE in ngx_http_rewrite_module, Critical), CVE-2026-42926, CVE-2026-42934, CVE-2026-42946, CVE-2026-40460, and CVE-2026-40701 (Medium).

  • PyNvVideoCodec — Updated from 2.0.3 to 2.0.4 to patch vulnerabilities in the bundled ffmpeg dependency.

  • log4j-coreCVE-2026-34478 and CVE-2026-34480: Removed ray_dist.jar from the container image. Ray’s Java fat-jar shaded log4j-core 2.25.3, which scanners flag for these vulnerabilities (fixed upstream in log4j-core 2.25.4). Because NIM LLM uses only Ray’s Python runtime, the jar is unused and safe to remove.

Bug Fixes#

This release includes the following bug fixes:

  • The Helm chart no longer overflows the 63-character Kubernetes DNS label limit on multi-node deployments. The base fullname truncation was reduced to leave room for the LeaderWorkerSet (LWS) group suffix, preventing rendering failures on long release names.

  • The container now resolves the FlashInfer mnnvl.py patch path dynamically at runtime. Previously, the path was hardcoded to a specific Python site-packages location, which could fail when the FlashInfer version or Python layout changed.

Known Issues#

This release includes the following known issues and limitations:

  • NIM LLM container images can fail to pull on Docker Engine 29.5.x when the Docker containerd image store is enabled. The pull fails before the container starts, with the message error from registry: Incorrect Repository Format. Other Docker Engine versions and the default image store are not affected.

    Diagnostic: Check whether your Docker daemon uses the containerd image store.

    Run the following command:

    docker info --format '{{json .DriverStatus}}'
    

    If the output includes io.containerd.snapshotter.v1 and docker version reports Docker Engine 29.5.x, use one of the workarounds below.

    Workaround 1: Use Docker Engine 29.4.3.

    Downgrade to an earlier Docker Engine version (such as 29.4.3), restart Docker, and retry the pull.

    Workaround 2: Pull by exact linux/amd64 manifest digest.

    On linux/amd64 systems, you can keep Docker Engine 29.5.x and pull the image by its exact platform manifest digest. Do not use the top-level image index digest.

    IMAGE=nvcr.io/nim/meta/llama-3.1-8b-instruct
    TAG=2.0.6
    AMD64_DIGEST=$(docker buildx imagetools inspect ${IMAGE}:${TAG} --format '{{json .}}' \
      | jq -r '.manifest.manifests[] | select(.platform.os == "linux" and .platform.architecture == "amd64") | .digest')
    docker pull ${IMAGE}@${AMD64_DIGEST}
    docker tag ${IMAGE}@${AMD64_DIGEST} ${IMAGE}:${TAG}
    
  • nemotron-3-nano LoRA profiles crash on H100, H200, GH200, and Blackwell GPUs with a CUDA illegal memory access error during inference. The vLLM engine terminates, and the server returns HTTP 500 errors for all subsequent requests. Non-LoRA profiles and older GPUs (A100, L40S) are not affected. There is no workaround for this issue.

  • An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an EngineDeadError.

    Example Error Message

    You might encounter the following error message:

    Worker proc VllmWorker-2 died unexpectedly, shutting down executor.
    RuntimeError: cancelled (shm_broadcast.py:677 acquire_read)
    vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
    
  • An upstream vLLM issue might corrupt MoE base model responses when the same NIM deployment loads LoRA adapters. This issue affects all MoE models served with LoRA adapters loaded on the server. Refer to the upstream vLLM issue for details. To work around this issue, deploy the base model without loading any LoRA adapters.

  • An upstream vLLM issue might cause MoE LoRA responses to produce corrupted output when the base model is an FP8-quantized checkpoint. Refer to the upstream vLLM issue for details.

  • nemotron-3-nano NVFP4 LoRA profiles with TP values greater than one fail to deploy.

  • NVFP4 MoE models (for example, nemotron-3-nano) on GB10 and RTX PRO 4500 Blackwell Server Edition GPUs might crash at startup with CUDA error: misaligned address during full CUDA graph capture. Dense NVFP4 models are unaffected.

    Workaround: Disable FlashInfer NVFP4 MoE kernels.

    Set the following environment variable before starting the container:

    export VLLM_USE_FLASHINFER_MOE_FP4=0
    
  • A stale FlashInfer compilation cache might cause a deployment crash on Blackwell GPUs with a Ninja build error referencing a missing source file.

    Example Error Message

    You might encounter the following error message:

    ninja: error: '/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fp4_gemm_cutlass_sm103.cu' ... missing and no known rule to make it
    
    Workaround: Clear the FlashInfer cache.

    Remove stale cached FlashInfer files on the host before starting the container. These files are nested under the flashinfer directory in the NIM cache mounted to the container.

For information about past updates and older versions, refer to the previous release notes.