Release Notes#

This page lists changes, fixes, and known issues for each NIM LLM release.

Release 2.0.4#

Highlights#

This release upgrades the inference backend to vLLM 0.20.0, enhances air-gapped deployment with a startup cache probe and actionable error messages, adds Helm chart improvements for Kubernetes security contexts and pod scheduling, and expands verified OpenShift platforms.

vLLM 0.20.0#

The inference backend is updated from vLLM 0.19.0 to vLLM 0.20.0, bringing improvements to functionality, performance, and stability. This update also resolves the UMA GPU memory reporting issue from the previous release.

Air-Gapped Deployment Enhancements#

Air-gapped deployment workflows now include a startup cache reachability probe controlled by the NIM_CACHE_PROBE_TIMEOUT environment variable (default: 60 seconds). When NIM_CACHE_PATH points to an unreachable NFS, CIFS, or FUSE mount, the container exits with an actionable error message instead of hanging indefinitely. Model download failures also now surface the specific cause, such as authentication, authorization, network timeout, or disk space. Refer to the Air-Gap Deployment deployment guide for details.

Improved Outbound TLS Configuration#

The TLS documentation and environment variable reference now clearly distinguish between inbound TLS (client connections to the NIM API) and outbound TLS (model downloads from NGC, Hugging Face, or corporate registries). New guidance for creating combined CA bundles ensures that custom corporate CAs do not inadvertently remove trust for public CAs. The REQUESTS_CA_BUNDLE environment variable is now documented for environments that require corporate CA trust for all download paths. Refer to SSL and TLS for details.

Helm Chart: Container Security Context for Multi-Node Deployments#

The Helm chart now applies containerSecurityContext to multi-node leader and worker containers. This enables setting Linux capabilities such as SYS_PTRACE and IPC_LOCK, which are required for IMEX channel operations on NVIDIA GB200 and GB300 NVL72 systems. Refer to the Helm and Kubernetes deployment guide for details.

Helm Chart: Pod Priority Class Support#

The Helm chart now supports the priorityClassName value for assigning a Kubernetes PriorityClass to NIM pods in both single-node and multi-node deployments. This prevents NIM inference pods from being preempted by lower-priority workloads. Refer to the Helm and Kubernetes deployment guide for details.

Expanded OpenShift Platform Verification#

The OpenShift deployment guide now includes verified deployment instructions for Red Hat OpenShift Service on AWS (ROSA) and OpenShift Dedicated on Google Cloud Platform (GCP). Refer to the OpenShift deployment guide for details.

Added NIM Offerings#

NVIDIA now publishes inference microservices (NIMs) under distinct NIM offerings so you can choose the right balance of speed to publication, peak performance, and enterprise lifecycle guarantees. For more information, refer to NIM Offerings.

Model-Specific NIM Updates#

This release includes updated 2.0.4 versions of the following model-specific Certified NIMs:

  • gpt-oss-120b

  • gpt-oss-20b

  • llama-3.1-70b-instruct

  • llama-3.1-8b-instruct

  • llama-3.3-70b-instruct

  • llama-3.3-nemotron-super-49b-v1.5

  • nemotron-3-nano

  • nemotron-3-super-120b-a12b

  • starcoder2-7b

Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for Certified NIMs for supported profiles and verified GPUs.

Model-Free NIM Update#

This release includes an updated 2.0.4 version of Model-Free NIM. Refer to Model-Free NIM for details.

Bug Fixes#

This release includes the following bug fixes:

  • Ray is bumped to 2.55.1 to patch CVE-2026-41486 (CVSS 8.9), a remote code execution vulnerability triggered by crafted Parquet files during deserialization.

  • On UMA GPUs (such as GB200, GH200, and GB10), PyTorch no longer counts cached and buffered OS memory as used GPU memory at startup. This resolves the ValueError: Free memory on device cuda:N ... is less than desired GPU memory utilization failure reported in the previous release. The workaround of dropping the OS page cache before starting the container is no longer required.

Known Issues#

This release includes the following known issues and limitations:

Note

vLLM 0.20.0 introduces a stricter KV cache budget check. This change raises the default gpu_memory_utilization value from 0.9 to 0.92. This might cause profile and SKU combinations with understated memory requirements to fail at startup with ValueError instead of emitting a soft warning. Refer to Troubleshooting GPU Memory Out-of-Memory Errors for guidance on how to troubleshoot out-of-memory issues.

  • nemotron-3-nano LoRA profiles crash on H100, H200, GH200, and Blackwell GPUs with a CUDA illegal memory access error during inference. The vLLM engine terminates, and the server returns HTTP 500 errors for all subsequent requests. Non-LoRA profiles and older GPUs (A100, L40S) are not affected. There is no workaround for this issue in NIM 2.0.4.

  • An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an EngineDeadError.

    Example Error Message

    You might encounter the following error message:

    Worker proc VllmWorker-2 died unexpectedly, shutting down executor.
    RuntimeError: cancelled (shm_broadcast.py:677 acquire_read)
    vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
    
  • An upstream vLLM issue might corrupt MoE base model responses when the same NIM deployment loads LoRA adapters. This issue affects all MoE models served with LoRA adapters loaded on the server. Refer to the upstream vLLM issue for details. To work around this issue, deploy the base model without loading any LoRA adapters.

  • An upstream vLLM issue might cause MoE LoRA responses to produce corrupted output when the base model is an FP8-quantized checkpoint. Refer to the upstream vLLM issue for details.

  • nemotron-3-nano NVFP4 LoRA profiles with TP values greater than one fail to deploy.

  • NVFP4 MoE models (for example, nemotron-3-nano) on GB10 and RTX PRO 4500 Blackwell Server Edition GPUs might crash at startup with CUDA error: misaligned address during full CUDA graph capture. Dense NVFP4 models are unaffected.

    Workaround: Disable FlashInfer NVFP4 MoE kernels.

    Set the following environment variable before starting the container:

    export VLLM_USE_FLASHINFER_MOE_FP4=0
    

For information about past updates and older versions, refer to the previous release notes.