Release Notes#
This page lists changes, fixes, and known issues for each NIM LLM release.
Release 2.0.5#
Highlights#
This release upgrades the inference backend to vLLM 0.20.2, introduces custom parser and chat template support with automatic path resolution, adds new NIM Turbo offerings, and improves documentation with an interactive GPU selector in the support matrix.
vLLM 0.20.2#
The inference backend is updated from vLLM 0.20.0 to vLLM 0.20.2, bringing improvements to functionality, performance, and stability.
Ubuntu 24.04 LTS Base Image#
The container image is now based on Ubuntu 24.04 LTS (previously Ubuntu 22.04 LTS), bringing updated system libraries, longer upstream support, and improved security.
Custom Parsers and Chat Templates#
NIM LLM now supports vLLM CLI arguments that take file paths, including --reasoning-parser-plugin, --tool-parser-plugin, and --chat-template, with automatic path resolution. You can reference plugin files by bare filename without knowing the container’s internal directory layout. The documentation also lists all built-in reasoning parsers and tool-call parsers that do not require a plugin file. Refer to Custom Parsers and Chat Templates for details.
NIM Turbo Offering#
This release introduces the NIM Turbo offering with initial 1.0.0 versions of the following model-specific Turbo NIMs:
kimi-k2.5nemotron-3-super-120b-a12b-turbo
Refer to Release Notes for NIM Turbo for details and get-started guides.
Interactive GPU Selector in Certified NIM Support Matrix#
The support matrix documentation now includes an interactive GPU selector for finding compatible GPU configurations. Refer to the Support Matrix for Certified NIMs for details.
Updated Oracle Cloud Deployment Documentation#
The Oracle Cloud Infrastructure (OCI) and Oracle Kubernetes Engine (OKE) deployment guide is updated with revised instructions. Refer to the Oracle deployment guide for details.
Model-Specific NIM Updates#
This release includes updated 2.0.5 versions of the following model-specific Certified NIMs:
gpt-oss-120bgpt-oss-20bllama-3.1-70b-instructllama-3.1-8b-instructllama-3.3-70b-instructllama-3.3-nemotron-super-49b-v1.5nemotron-3-nanonemotron-3-super-120b-a12bstarcoder2-7b
Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for Certified NIMs for supported profiles and verified GPUs.
Model-Free NIM Update#
This release includes an updated 2.0.5 version of Model-Free NIM. Refer to Model-Free NIM for details.
Bug Fixes#
This release includes the following bug fixes:
NIM_LOG_LEVELis now properly propagated to nim-sdk’s Rustenv_logger. Previously, NIM LLM normalized log level names to Python-style values such asWARNINGandCRITICAL, whichenv_loggersilently rejected. This caused all Rust download and hub log lines to be suppressed for the default andCRITICALconfigurations.Go is bumped from 1.25.8 to 1.25.9 in the Mooncake wheel build to patch CVE-2026-27143 in
libetcd_wrapper.so.
Known Issues#
This release includes the following known issues and limitations:
nemotron-3-nanoLoRA profiles crash on H100, H200, GH200, and Blackwell GPUs with a CUDA illegal memory access error during inference. The vLLM engine terminates, and the server returns HTTP 500 errors for all subsequent requests. Non-LoRA profiles and older GPUs (A100, L40S) are not affected. There is no workaround for this issue.
An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an
EngineDeadError.Example Error Message
You might encounter the following error message:
Worker proc VllmWorker-2 died unexpectedly, shutting down executor. RuntimeError: cancelled (shm_broadcast.py:677 acquire_read) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
An upstream vLLM issue might corrupt MoE base model responses when the same NIM deployment loads LoRA adapters. This issue affects all MoE models served with LoRA adapters loaded on the server. Refer to the upstream vLLM issue for details. To work around this issue, deploy the base model without loading any LoRA adapters.
An upstream vLLM issue might cause MoE LoRA responses to produce corrupted output when the base model is an FP8-quantized checkpoint. Refer to the upstream vLLM issue for details.
nemotron-3-nanoNVFP4 LoRA profiles with TP values greater than one fail to deploy.
NVFP4 MoE models (for example,
nemotron-3-nano) on GB10 and RTX PRO 4500 Blackwell Server Edition GPUs might crash at startup withCUDA error: misaligned addressduring full CUDA graph capture. Dense NVFP4 models are unaffected.Workaround: Disable FlashInfer NVFP4 MoE kernels.
Set the following environment variable before starting the container:
export VLLM_USE_FLASHINFER_MOE_FP4=0
A stale FlashInfer compilation cache might cause a deployment crash on Blackwell GPUs with a Ninja build error referencing a missing source file.
Example Error Message
You might encounter the following error message:
ninja: error: '/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fp4_gemm_cutlass_sm103.cu' ... missing and no known rule to make it
Workaround: Clear the FlashInfer cache.
Remove stale cached FlashInfer files on the host before starting the container. These files are nested under the
flashinferdirectory in the NIM cache mounted to the container.
For information about past updates and older versions, refer to the previous release notes.