Release Notes#
This page lists changes, fixes, and known issues for each NIM LLM release.
Release 2.0.7#
Highlights#
This release upgrades the inference backend to vLLM 0.22.1 and refreshes the model-specific and model-free NIM containers to 2.0.7.
vLLM 0.22.1#
The inference backend is updated from vLLM 0.21.0 to vLLM 0.22.1, bringing improvements to functionality, performance, and stability.
Model-Specific NIM Updates#
This release includes updated 2.0.7 versions of the following model-specific Certified NIM containers:
gpt-oss-120bgpt-oss-20bllama-3.1-70b-instructllama-3.1-8b-instructllama-3.3-70b-instructllama-3.3-nemotron-super-49b-v1.5nemotron-3-nanonemotron-3-super-120b-a12bstarcoder2-7b
Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for Certified NIMs for supported profiles and verified GPUs.
Model-Free NIM Update#
This release includes an updated 2.0.7 version of the model-free NIM container. Refer to Model-Free NIM for details.
Security Fixes#
This release includes the following security fixes:
nginx — Updated from 1.29.7 to 1.30.1 to address CVE-2026-42945 (“NGINX Rift” RCE in
ngx_http_rewrite_module, Critical), CVE-2026-42926, CVE-2026-42934, CVE-2026-42946, CVE-2026-40460, and CVE-2026-40701 (Medium).PyNvVideoCodec — Updated from 2.0.3 to 2.0.4 to patch vulnerabilities in the bundled ffmpeg dependency.
Known Issues#
This release includes the following known issues and limitations:
nemotron-3-nanoLoRA profiles crash on H100, H200, GH200, and Blackwell GPUs with a CUDA illegal memory access error during inference. The vLLM engine terminates, and the server returns HTTP 500 errors for all subsequent requests. Non-LoRA profiles and older GPUs (A100, L40S) are not affected. There is no workaround for this issue.
An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an
EngineDeadError.Example Error Message
You might encounter the following error message:
Worker proc VllmWorker-2 died unexpectedly, shutting down executor. RuntimeError: cancelled (shm_broadcast.py:677 acquire_read) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
An upstream vLLM issue might corrupt MoE base model responses when the same NIM deployment loads LoRA adapters. This issue affects all MoE models served with LoRA adapters loaded on the server. Refer to the upstream vLLM issue for details. To work around this issue, deploy the base model without loading any LoRA adapters.
An upstream vLLM issue might cause MoE LoRA responses to produce corrupted output when the base model is an FP8-quantized checkpoint. Refer to the upstream vLLM issue for details.
nemotron-3-nanoNVFP4 LoRA profiles with TP values greater than one fail to deploy.
NVFP4 MoE models (for example,
nemotron-3-nano) on GB10 and RTX PRO 4500 Blackwell Server Edition GPUs might crash at startup withCUDA error: misaligned addressduring full CUDA graph capture. Dense NVFP4 models are unaffected.Workaround: Disable FlashInfer NVFP4 MoE kernels.
Set the following environment variable before starting the container:
export VLLM_USE_FLASHINFER_MOE_FP4=0
A stale FlashInfer compilation cache might cause a deployment crash on Blackwell GPUs with a Ninja build error referencing a missing source file.
Example Error Message
You might encounter the following error message:
ninja: error: '/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fp4_gemm_cutlass_sm103.cu' ... missing and no known rule to make it
Workaround: Clear the FlashInfer cache.
Remove stale cached FlashInfer files on the host before starting the container. These files are nested under the
flashinferdirectory in the NIM cache mounted to the container.
ModelScope manifest downloads can fail with
error decoding response bodyon slow or unstable networks. The NIM SDK defaults to zero retries (NIM_MANIFEST_DOWNLOAD_MAX_RETRY_COUNT=0), so a single transient read failure aborts the download. Other model sources (HuggingFace, NGC, S3, GCS, local) are not affected.Workaround: Increase the retry count.
Set the following environment variable before starting the container to allow the download to retry on failure:
export NIM_MANIFEST_DOWNLOAD_MAX_RETRY_COUNT=5
An upstream vLLM 0.22.1 issue can cause chat and completion requests that enable
logprobsortop_logprobsto fail with HTTP 400 and the errorOut of range float values are not JSON compliant: nan. A non-finite (NaN) log-probability value reaches the JSON serializer when the response is built. This issue affects the following 2.0.7 NVFP4 profiles:llama-3.3-nemotron-super-49b-v1.5on all NVFP4 base profiles (vllm-nvfp4-tp1-pp1,vllm-nvfp4-tp2-pp1,vllm-nvfp4-tp4-pp1, andvllm-nvfp4-tp8-pp1) and the NVFP4 LoRA profilevllm-nvfp4-tp2-pp1-lora.llama-3.1-70b-instructon the NVFP4 base profilesvllm-nvfp4-tp2-pp1,vllm-nvfp4-tp4-pp1, andvllm-nvfp4-tp8-pp1.llama-3.3-70b-instructon the NVFP4 base profilesvllm-nvfp4-tp2-pp1,vllm-nvfp4-tp4-pp1, andvllm-nvfp4-tp8-pp1.
These profiles remain supported and listed in the support matrix; the NIM starts and serves normally, and only requests that opt into
logprobsortop_logprobsare affected. To work around this issue, omitlogprobsandtop_logprobsfrom the request.
An upstream vLLM 0.22.1 issue prevents FP8 + LoRA profiles from starting for
nemotron-3-super-120b-a12bandnemotron-3-nano, so these profiles are not supported in 2.0.7 and have been removed from the support matrix. During engine initialization, the vLLM LoRA Triton kernel rejects the FP8 activation dtype withUnsupported lhs dtype fp8e4nv, the engine core fails, and the container exits. This affects every FP8 + LoRA profile (vllm-fp8-tp1/tp2/tp4/tp8-pp1-lora) for both models, across all GPUs and tensor-parallel sizes. To work around this issue, use a non-FP8 LoRA profile (BF16 or NVFP4 base), or deploy an FP8 base model profile without LoRA adapters.
FP8 profiles for
llama-3.3-nemotron-super-49b-v1.5fail to start on NVIDIA RTX PRO 4500 and RTX PRO 6000 Blackwell Server Edition GPUs, so the affected configurations are not supported in 2.0.7. On RTX PRO 4500 Blackwell Server Edition, the TP2 FP8 profiles (base and LoRA) are affected; on RTX PRO 6000 Blackwell Server Edition, the TP4 and TP8 FP8 profiles (base and LoRA) are affected and have been removed from the support matrix. During CUDA graph capture, the FlashInfer FP8 GEMM (invoked through vLLM 0.22.1) raisesRuntimeError: Plan index 10 is invalidin the cuDNN backend, and the server never becomes ready. There is no workaround other than selecting a non-FP8 (BF16 or NVFP4) profile on these GPUs.
An upstream vLLM 0.22.1 issue limits the maximum LoRA rank to 128 for the fused MoE LoRA path used by
nemotron-3-nano. Starting the NIM with--max-lora-rank 256fails during engine initialization with the errorfused_moe_lora_one_shot supports max_lora_rank<=128; got rank=256, and the container exits before the server becomes ready. The same configuration started on 2.0.6 (vLLM 0.21.0). This is a known limitation rather than a deployment failure: thenemotron-3-nanoLoRA profiles remain supported and listed in the support matrix and start normally with a LoRA rank of 128 or lower; only an explicit--max-lora-rankvalue above 128 triggers the limit. To work around this issue, set--max-lora-rankto 128 or lower.
For information about past updates and older versions, refer to the previous release notes.