Release Notes#
This page lists changes, fixes, and known issues for each NIM LLM release.
Release 2.0.1#
Highlights#
This is the first release of the newly-architected NIM LLM. Version 2.0 is a ground-up redesign that adopts a one-container, one-backend philosophy, replacing the multi-backend 1.x architecture. This release ships with vLLM 0.17.1 as the sole inference backend. Refer to the 1.x to 2.0 Migration Guide for details on upgrading.
Transparent vLLM Configuration#
NIM LLM exposes vllm serve as a first-class interface. Any argument that vllm serve accepts can be passed directly to the container, allowing backend tuning identical to standalone vLLM. Use --dry-run to preview the fully resolved configuration without launching the server. Refer to Advanced Configuration for details.
Container-as-Binary CLI#
The container now behaves like a self-documenting command-line tool. Pass -h to any action to see its usage. Refer to the CLI Reference for details.
Broad GPU SKU Coverage#
This release is verified across 16 GPU SKUs spanning five NVIDIA architectures, from Ampere through Blackwell. Tested hardware includes A100, A10G, L40S, H100, H100 NVL, H200, H200 NVL, GH200, GB200, GB10, B200, B300, and RTX PRO Blackwell Server Edition cards.
Eight Model-Specific NIMs#
This release includes eight model-specific NIMs:
gpt-oss-120bgpt-oss-20bllama-3.1-70b-instructllama-3.1-8b-instructllama-3.3-70b-instructllama-3.3-nemotron-super-49b-v1.5nemotron-3-nanostarcoder2-7b
Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for supported profiles and verified GPUs.
Model-Free NIM#
A single container image that can serve any supported model from Hugging Face, NGC, Amazon S3, Google Cloud Storage, ModelScope, or a local directory. Model-free NIM generates a manifest at startup and enables day-zero support for newly released model architectures without waiting for a model-specific container. Refer to Model-Free NIM for details.
LoRA Adapter Serving#
Support for both static LoRA (adapters discovered at startup) and dynamic LoRA (adapters loaded and unloaded at runtime via directory watching or the /v1/load_lora_adapter API). Refer to Fine-Tuning with LoRA for details.
Advanced Inference Features#
Support for custom logits processing via pluggable processors mounted into the container, and prompt embeddings for privacy-preserving inference workflows where sensitive data is converted to embeddings before reaching the server.
Known Issues#
The following known issues and limitations are present in this release.
Llama 3.1 70B Instruct and Llama 3.3 70B Instruct May OOM with Default Sequence Length#
Llama 3.1 70B Instruct and Llama 3.3 70B Instruct may run out of memory (OOM) when using the default maximum sequence length.
Workaround: Set the NIM_MAX_MODEL_LEN environment variable to 100000 to reduce the sequence length.
Nemotron 3 Nano May Fail to Deploy on B300 TP8 NVFP4 and GB200 TP4 NVFP4#
Nemotron 3 Nano may fail to deploy when using the NVFP4 profile with TP8 on B300 GPU or TP4 on GB200 GPU. The deployment fails with a RuntimeError due to a failure to initialize the engine core.
Tool Calling Produces Double-Serialized JSON Arguments with anyOf Schemas#
In vLLM 0.17.1, tool calls with anyOf or oneOf parameter schemas return double-serialized JSON arguments if qwen3_coder is used as the tool call parser. This is fixed in upstream vLLM (PR #36032) and will ship in the next release that contains vLLM 0.18.0+.
download-to-cache --lora Fails to Download the Default LoRA Profile#
Running download-to-cache --lora fails with Error: None instead of downloading the default LoRA-capable profile. The profile selector does not resolve a LoRA profile, which causes the command to exit without downloading any artifacts.
Workaround: Use the --profiles flag with the explicit LoRA profile ID instead. To find the LoRA profile ID, run the following command:
docker run --rm --gpus all -e NGC_API_KEY=$NGC_API_KEY ${NIM_IMAGE} list-model-profiles
Identify the profile with feat_lora in its name, then download it explicitly:
docker run --rm --gpus all \
-v $LOCAL_NIM_CACHE:/opt/nim/.cache \
-e NGC_API_KEY=$NGC_API_KEY \
${NIM_IMAGE} download-to-cache --profiles ${LORA_PROFILE_ID}
Multi-Node Deployments Crash When logprobs Is Non-Null in the Request Body#
Multi-node inference using pipeline parallelism (PP > 1) crashes after processing a completions request that includes a non-null logprobs parameter. The initial request completes successfully, but the engine encounters a fatal RayChannelTimeoutError shortly thereafter. Refer to the upstream vLLM issue for more details.
FP8 Profile Selected on A10G and A100 GPUs Causes Startup Crash#
On A10G and A100 GPUs, when NIM_MODEL_PROFILE is not set, the FP8 profile can be selected even though FP8 quantization requires a minimum GPU compute capability of 8.9 (which corresponds to Ada Lovelace and Hopper GPU architectures). This causes a startup crash with the following error:
pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig
Value error, The quantization method modelopt is not supported for the current GPU.
Minimum capability: 89. Current capability: 86.
Workaround: Explicitly set NIM_MODEL_PROFILE to a BF16 profile to bypass automatic profile selection. To list available profiles and find the correct profile ID for your model, run the following command:
docker run --rm -e NGC_API_KEY=$NGC_API_KEY ${NIM_IMAGE} list-model-profiles
Then start the server with the BF16 profile:
docker run ... -e NIM_MODEL_PROFILE=${BF16_PROFILE_ID} ...
FP8 MoE Models Have Degraded Performance on Blackwell GPUs#
Pure FP8 Mixture-of-Experts (MoE) models (for example, Nemotron Nano with FP8) on Blackwell GPUs experience degraded MoE throughput. Native Blackwell FP8 Grouped GEMM kernels are not yet available, so the flashinfer autotuner falls back to older-generation CUTLASS kernels during startup. The server starts successfully but with reduced MoE throughput.
Workaround: Use NVFP4 quantization instead of FP8 for MoE models on Blackwell. NVFP4 uses Blackwell-native TMA kernels and is unaffected by this issue.
NVFP4 MoE Models Fail to Start on GB10#
NVFP4 MoE models (for example, Nemotron Nano) on GB10 GPUs may crash during startup with CUDA error: misaligned address during full CUDA graph capture. Dense NVFP4 models are unaffected.
Workaround: Set VLLM_USE_FLASHINFER_MOE_FP4=0.