Release Notes#

This page lists changes, fixes, and known issues for each NIM LLM release.

Release 2.0.3#

Highlights#

This release upgrades the inference backend to vLLM 0.19.0, officially supports the Anthropic Messages API, improves AWS SageMaker compatibility, and enhances GPU memory estimation accuracy.

vLLM 0.19.0#

The inference backend is updated from vLLM 0.18.0 to vLLM 0.19.0, bringing improvements to functionality, performance, and stability.

Anthropic Messages API Support#

NIM LLM officially supports the /v1/messages endpoint, enabling integration with Anthropic client SDKs and tools such as Claude Code. Refer to the Messages (Anthropic-compatible) API reference and the Use Claude Code with NIM integration guide for details.

Improved AWS SageMaker Compatibility#

NIM LLM now natively implements the AWS SageMaker BYOC (Bring Your Own Container) protocol. When SageMaker mode is active, NIM listens on port 8080, responds to GET /ping for health checks, and accepts inference requests at POST /invocations. SageMaker mode is detected automatically from the SageMaker host agent’s environment variables, and can be explicitly controlled with NIM_SAGEMAKER_MODE. Refer to the SageMaker Deployment deployment guide for details.

Arbitrary UID with GID 0 for Container User Configuration#

NIM LLM containers now support running as an arbitrary user ID as long as the group ID is 0. This enables deployment on platforms that assign random UIDs to containers (such as OpenShift) without requiring additional privileges. Refer to Configuration for details.

Improved GPU Memory Estimation#

GPU memory estimation is refit with a data-driven overhead regression derived from 408 golden-model measurements, with separate coefficients for CUDA graph modes (on and off). The update also corrects weight estimation for mxfp4 and nvfp4 quantization and adds ECC and initialization overhead constants tuned for NVLink and PCIe topologies. The result is more accurate memory sizing and fewer spurious out-of-memory failures across supported GPUs.

Model-Specific NIM Updates#

This release includes updated 2.0.3 versions of the following model-specific NIMs:

  • gpt-oss-120b

  • gpt-oss-20b

  • llama-3.1-70b-instruct

  • llama-3.1-8b-instruct

  • llama-3.3-70b-instruct

  • llama-3.3-nemotron-super-49b-v1.5

  • nemotron-3-nano

  • nemotron-3-super-120b-a12b

  • starcoder2-7b

Each container ships with curated model weights, validated quantization profiles, and optimal runtime configurations. Refer to the Support Matrix for supported profiles and verified GPUs.

Model-Free NIM Update#

This release includes an updated 2.0.3 version of Model-Free NIM. Refer to Model-Free NIM for details.

Bug Fixes#

This release includes the following bug fixes:

  • The Swagger UI and OpenAPI specification now display the correct NIM release version. Previously, the reported version did not match the installed NIM release.

  • On UMA GPUs (such as GB200, GH200, and GB10), the busy GPU check no longer overrides the UMA safety clamp when the free-memory ratio falls in the narrow range just above the UMA safety factor. This ensures GPU memory utilization is not set below the UMA safety threshold.

  • Proxy timeouts have been adjusted for long-running inference. The inference request timeout is increased from 300 seconds to 14,400 seconds (four hours) to prevent premature termination of long requests.

  • The PyTorch Inductor compilation cache is now written to /tmp/torchinductor, avoiding writes to the default NIM cache directory.

  • vLLM /health, /version, and /metrics endpoints are now exposed through the NIM proxy, in addition to their /v1/-prefixed counterparts.

  • On Blackwell GPUs (B200, B300, GB200), FlashInfer can now write runtime-downloaded cubins when the container runs as a non-root user.

Known Issues#

This release includes the following known issues and limitations:

  • nemotron-3-nano LoRA profiles crash on H100, H200, GH200, and Blackwell GPUs with a CUDA illegal memory access error during inference. The vLLM engine terminates, and the server returns HTTP 500 errors for all subsequent requests. Non-LoRA profiles and older GPUs (A100, L40S) are not affected. There is no workaround for this issue in NIM 2.0.3.

  • An unreported vLLM issue might cause the vLLM worker to fail on four or eight H100-NVL GPUs with an EngineDeadError.

    Example Error Message

    You might encounter the following error message:

    Worker proc VllmWorker-2 died unexpectedly, shutting down executor.
    RuntimeError: cancelled (shm_broadcast.py:677 acquire_read)
    vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
    
  • On UMA GPUs, such as GB200, GH200, and GB10 (DGX Spark), PyTorch counts cached and buffered OS memory as used GPU memory. This might cause vLLM to underestimate available GPU memory at startup and fail with ValueError: Free memory on device cuda:N ... is less than desired GPU memory utilization. This is fixed in vLLM v0.19.1 and will be fixed in the next NIM LLM release.

    Workaround: Drop the OS page cache before starting the container.

    Run the following command before starting the container:

    sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
    
  • NVFP4 MoE models (for example, nemotron-3-nano) on GB10 and RTX PRO 4500 Blackwell Server Edition GPUs might crash at startup with CUDA error: misaligned address during full CUDA graph capture. Dense NVFP4 models are unaffected.

    Workaround: Disable FlashInfer NVFP4 MoE kernels.

    Set the following environment variable before starting the container:

    export VLLM_USE_FLASHINFER_MOE_FP4=0
    

For information about past updates and older versions, refer to the previous release notes.