Release Notes#

This page lists changes, fixes, and known issues for each NIM LLM release.

Release 2.0.7#

Highlights#

This release upgrades the inference backend to vLLM 0.22.1 and refreshes the model-specific and model-free NIM containers to 2.0.7.

vLLM 0.22.1#

The inference backend is updated from vLLM 0.21.0 to vLLM 0.22.1, bringing improvements to functionality, performance, and stability.

Model-Specific NIM Updates#

This release includes updated 2.0.7 versions of the following model-specific Certified NIM containers:

gpt-oss-120b
gpt-oss-20b
llama-3.1-70b-instruct
llama-3.1-8b-instruct
llama-3.3-70b-instruct
llama-3.3-nemotron-super-49b-v1.5
nemotron-3-nano
nemotron-3-super-120b-a12b
starcoder2-7b

Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for Certified NIMs for supported profiles and verified GPUs.

Model-Free NIM Update#

This release includes an updated 2.0.7 version of the model-free NIM container. Refer to Model-Free NIM for details.

Security Fixes#

This release includes the following security fixes:

nginx — Updated from 1.29.7 to 1.30.1 to address CVE-2026-42945 (“NGINX Rift” RCE in ngx_http_rewrite_module, Critical), CVE-2026-42926, CVE-2026-42934, CVE-2026-42946, CVE-2026-40460, and CVE-2026-40701 (Medium).
PyNvVideoCodec — Updated from 2.0.3 to 2.0.4 to patch vulnerabilities in the bundled ffmpeg dependency.

Known Issues#

This release includes the following known issues and limitations:

nemotron-3-nano LoRA profiles crash on H100, H200, GH200, and Blackwell GPUs with a CUDA illegal memory access error during inference. The vLLM engine terminates, and the server returns HTTP 500 errors for all subsequent requests. Non-LoRA profiles and older GPUs (A100, L40S) are not affected. There is no workaround for this issue.

An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an EngineDeadError.

An upstream vLLM issue might corrupt MoE base model responses when the same NIM deployment loads LoRA adapters. This issue affects all MoE models served with LoRA adapters loaded on the server. Refer to the upstream vLLM issue for details. To work around this issue, deploy the base model without loading any LoRA adapters.

An upstream vLLM issue might cause MoE LoRA responses to produce corrupted output when the base model is an FP8-quantized checkpoint. Refer to the upstream vLLM issue for details.

nemotron-3-nano NVFP4 LoRA profiles with TP values greater than one fail to deploy.

NVFP4 MoE models (for example, nemotron-3-nano) on GB10 and RTX PRO 4500 Blackwell Server Edition GPUs might crash at startup with CUDA error: misaligned address during full CUDA graph capture. Dense NVFP4 models are unaffected.
Workaround: Disable FlashInfer NVFP4 MoE kernels.
Set the following environment variable before starting the container:
export VLLM_USE_FLASHINFER_MOE_FP4=0

A stale FlashInfer compilation cache might cause a deployment crash on Blackwell GPUs with a Ninja build error referencing a missing source file.
Example Error Message
You might encounter the following error message:
ninja: error: '/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/fp4_gemm_cutlass_sm103.cu' ... missing and no known rule to make it
Workaround: Clear the FlashInfer cache.

Remove stale cached FlashInfer files on the host before starting the container. These files are nested under the flashinfer directory in the NIM cache mounted to the container.

ModelScope manifest downloads can fail with error decoding response body on slow or unstable networks. The NIM SDK defaults to zero retries (NIM_MANIFEST_DOWNLOAD_MAX_RETRY_COUNT=0), so a single transient read failure aborts the download. Other model sources (HuggingFace, NGC, S3, GCS, local) are not affected.
Workaround: Increase the retry count.
Set the following environment variable before starting the container to allow the download to retry on failure:
export NIM_MANIFEST_DOWNLOAD_MAX_RETRY_COUNT=5

An upstream vLLM 0.22.1 issue can cause chat and completion requests that enable logprobs or top_logprobs to fail with HTTP 400 and the error Out of range float values are not JSON compliant: nan. A non-finite (NaN) log-probability value reaches the JSON serializer when the response is built. This issue affects the following 2.0.7 NVFP4 profiles:
- llama-3.3-nemotron-super-49b-v1.5 on all NVFP4 base profiles (vllm-nvfp4-tp1-pp1, vllm-nvfp4-tp2-pp1, vllm-nvfp4-tp4-pp1, and vllm-nvfp4-tp8-pp1) and the NVFP4 LoRA profile vllm-nvfp4-tp2-pp1-lora.
- llama-3.1-70b-instruct on the NVFP4 base profiles vllm-nvfp4-tp2-pp1, vllm-nvfp4-tp4-pp1, and vllm-nvfp4-tp8-pp1.
- llama-3.3-70b-instruct on the NVFP4 base profiles vllm-nvfp4-tp2-pp1, vllm-nvfp4-tp4-pp1, and vllm-nvfp4-tp8-pp1.
These profiles remain supported and listed in the support matrix; the NIM starts and serves normally, and only requests that opt into logprobs or top_logprobs are affected. To work around this issue, omit logprobs and top_logprobs from the request.

An upstream vLLM 0.22.1 issue prevents FP8 + LoRA profiles from starting for nemotron-3-super-120b-a12b and nemotron-3-nano, so these profiles are not supported in 2.0.7 and have been removed from the support matrix. During engine initialization, the vLLM LoRA Triton kernel rejects the FP8 activation dtype with Unsupported lhs dtype fp8e4nv, the engine core fails, and the container exits. This affects every FP8 + LoRA profile (vllm-fp8-tp1/tp2/tp4/tp8-pp1-lora) for both models, across all GPUs and tensor-parallel sizes. To work around this issue, use a non-FP8 LoRA profile (BF16 or NVFP4 base), or deploy an FP8 base model profile without LoRA adapters.

FP8 profiles for llama-3.3-nemotron-super-49b-v1.5 fail to start on NVIDIA RTX PRO 4500 and RTX PRO 6000 Blackwell Server Edition GPUs, so the affected configurations are not supported in 2.0.7. On RTX PRO 4500 Blackwell Server Edition, the TP2 FP8 profiles (base and LoRA) are affected; on RTX PRO 6000 Blackwell Server Edition, the TP4 and TP8 FP8 profiles (base and LoRA) are affected and have been removed from the support matrix. During CUDA graph capture, the FlashInfer FP8 GEMM (invoked through vLLM 0.22.1) raises RuntimeError: Plan index 10 is invalid in the cuDNN backend, and the server never becomes ready. There is no workaround other than selecting a non-FP8 (BF16 or NVFP4) profile on these GPUs.

An upstream vLLM 0.22.1 issue limits the maximum LoRA rank to 128 for the fused MoE LoRA path used by nemotron-3-nano. Starting the NIM with --max-lora-rank 256 fails during engine initialization with the error fused_moe_lora_one_shot supports max_lora_rank<=128; got rank=256, and the container exits before the server becomes ready. The same configuration started on 2.0.6 (vLLM 0.21.0). This is a known limitation rather than a deployment failure: the nemotron-3-nano LoRA profiles remain supported and listed in the support matrix and start normally with a LoRA rank of 128 or lower; only an explicit --max-lora-rank value above 128 triggers the limit. To work around this issue, set --max-lora-rank to 128 or lower.

For information about past updates and older versions, refer to the previous release notes.