Release Notes#

This page lists changes, fixes, and known issues for each NIM LLM release.

Release 2.0.5#

Highlights#

This release upgrades the inference backend to vLLM 0.20.2, introduces custom parser and chat template support with automatic path resolution, adds new NIM Turbo offerings, and improves documentation with an interactive GPU selector in the support matrix.

vLLM 0.20.2#

The inference backend is updated from vLLM 0.20.0 to vLLM 0.20.2, bringing improvements to functionality, performance, and stability.

Ubuntu 24.04 LTS Base Image#

The container image is now based on Ubuntu 24.04 LTS (previously Ubuntu 22.04 LTS), bringing updated system libraries, longer upstream support, and improved security.

Custom Parsers and Chat Templates#

NIM LLM now supports vLLM CLI arguments that take file paths, including --reasoning-parser-plugin, --tool-parser-plugin, and --chat-template, with automatic path resolution. You can reference plugin files by bare filename without knowing the container’s internal directory layout. The documentation also lists all built-in reasoning parsers and tool-call parsers that do not require a plugin file. Refer to Custom Parsers and Chat Templates for details.

NIM Turbo Offering#

This release introduces the NIM Turbo offering with initial 1.0.0 versions of the following model-specific Turbo NIMs:

  • kimi-k2.5

  • nemotron-3-super-120b-a12b-turbo

Refer to Release Notes for NIM Turbo for details and get-started guides.

Interactive GPU Selector in Certified NIM Support Matrix#

The support matrix documentation now includes an interactive GPU selector for finding compatible GPU configurations. Refer to the Support Matrix for Certified NIMs for details.

Updated Oracle Cloud Deployment Documentation#

The Oracle Cloud Infrastructure (OCI) and Oracle Kubernetes Engine (OKE) deployment guide is updated with revised instructions. Refer to the Oracle deployment guide for details.

Model-Specific NIM Updates#

This release includes updated 2.0.5 versions of the following model-specific Certified NIMs:

  • gpt-oss-120b

  • gpt-oss-20b

  • llama-3.1-70b-instruct

  • llama-3.1-8b-instruct

  • llama-3.3-70b-instruct

  • llama-3.3-nemotron-super-49b-v1.5

  • nemotron-3-nano

  • nemotron-3-super-120b-a12b

  • starcoder2-7b

Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for Certified NIMs for supported profiles and verified GPUs.

Model-Free NIM Update#

This release includes an updated 2.0.5 version of Model-Free NIM. Refer to Model-Free NIM for details.

Bug Fixes#

This release includes the following bug fixes:

  • NIM_LOG_LEVEL is now properly propagated to nim-sdk’s Rust env_logger. Previously, NIM LLM normalized log level names to Python-style values such as WARNING and CRITICAL, which env_logger silently rejected. This caused all Rust download and hub log lines to be suppressed for the default and CRITICAL configurations.

  • Go is bumped from 1.25.8 to 1.25.9 in the Mooncake wheel build to patch CVE-2026-27143 in libetcd_wrapper.so.

Known Issues#

This release includes the following known issues and limitations:

  • nemotron-3-nano LoRA profiles crash on H100, H200, GH200, and Blackwell GPUs with a CUDA illegal memory access error during inference. The vLLM engine terminates, and the server returns HTTP 500 errors for all subsequent requests. Non-LoRA profiles and older GPUs (A100, L40S) are not affected. There is no workaround for this issue.

  • An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an EngineDeadError.

    Example Error Message

    You might encounter the following error message:

    Worker proc VllmWorker-2 died unexpectedly, shutting down executor.
    RuntimeError: cancelled (shm_broadcast.py:677 acquire_read)
    vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
    
  • An upstream vLLM issue might corrupt MoE base model responses when the same NIM deployment loads LoRA adapters. This issue affects all MoE models served with LoRA adapters loaded on the server. Refer to the upstream vLLM issue for details. To work around this issue, deploy the base model without loading any LoRA adapters.

  • An upstream vLLM issue might cause MoE LoRA responses to produce corrupted output when the base model is an FP8-quantized checkpoint. Refer to the upstream vLLM issue for details.

  • nemotron-3-nano NVFP4 LoRA profiles with TP values greater than one fail to deploy.

  • NVFP4 MoE models (for example, nemotron-3-nano) on GB10 and RTX PRO 4500 Blackwell Server Edition GPUs might crash at startup with CUDA error: misaligned address during full CUDA graph capture. Dense NVFP4 models are unaffected.

    Workaround: Disable FlashInfer NVFP4 MoE kernels.

    Set the following environment variable before starting the container:

    export VLLM_USE_FLASHINFER_MOE_FP4=0
    
  • A stale FlashInfer compilation cache might cause a deployment crash on Blackwell GPUs with a Ninja build error referencing a missing source file.

    Example Error Message

    You might encounter the following error message:

    ninja: error: '/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fp4_gemm_cutlass_sm103.cu' ... missing and no known rule to make it
    
    Workaround: Clear the FlashInfer cache.

    Remove stale cached FlashInfer files on the host before starting the container. These files are nested under the flashinfer directory in the NIM cache mounted to the container.

For information about past updates and older versions, refer to the previous release notes.