Release Notes for NIM Day 0#

This page lists updates, fixes, and known issues for NIM Day 0 releases.

Release 2.0.5#

Highlights#

The following LLM NIM is now available:

This release is built on the vLLM 0.20.2 inference backend.

Performance#

With Multi-Token Prediction (MTP) speculative decoding enabled (refer to Speculative Decoding with Multi-Token Prediction), Nemotron 3 Ultra 550B-A55B delivers approximately 17% higher throughput on chat workloads and approximately 26% higher throughput on software engineering (SWE) workloads compared to open-source vLLM. One observed limitation: with the maximum reasoning budget enabled, GPQA accuracy decreases by approximately 5 percentage points.

Known Issues#

This release includes the following known issues and limitations:

  • Nemotron 3 Ultra 550B-A55B

    The following items are inherited from the vLLM backend or the model checkpoint configuration. You can use these items to configure and validate deployments intentionally.

    • Reasoning parser: Start the container with --reasoning-parser nemotron_v3 when you want vLLM to parse Nemotron 3 reasoning output. Without a reasoning parser, non-JSON responses can include <think> content in the assistant content field, the reasoning field can be null, and reasoning-token accounting can be reported as regular output tokens.

    • Thinking budget: This model supports thinking_token_budget when thinking is enabled through chat_template_kwargs: {"enable_thinking": true}. thinking_token_budget limits the thinking portion of generation, while max_tokens still caps total generated tokens for the request. If a response ends before producing a final answer, increase max_tokens, increase thinking_token_budget, or simplify the prompt.

    • Responses token caps: For /v1/responses, max_output_tokens covers the total generated output budget. If the request spends the budget before producing final visible text, the response can return status: "incomplete". Increase max_output_tokens for prompts that need more reasoning or longer final answers.

    • JSON mode: response_format: {"type": "json_object"} can suppress visible reasoning output and is recommended for structured responses that you validate with a schema.

    • Tool calling: Requests that use "tool_choice": "auto" require the container to start with --enable-auto-tool-choice and --tool-call-parser qwen3_coder. Pass these using NIM_PASSTHROUGH_ARGS in Docker, Kubernetes, or other orchestrated environments.

    • Generation defaults: The model checkpoint can provide generation_config.json defaults, including top_p=0.95. Override sampling parameters per request when you need a different value. To ignore model-provided generation defaults and use vLLM defaults, add --generation-config vllm to NIM_PASSTHROUGH_ARGS.

    • Profile verification: Use the support matrix as the source of truth for supported GPU, precision, TP, and PP combinations. After startup, verify the selected profile and served model with /v1/metadata, /v1/models, and the startup logs.

    • Multi-node Ray executor: For multi-node profiles, such as the H100 BF16 TP=8, PP=2 profile, add VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1. This uses the vLLM Ray V2 executor backend to avoid an upstream Ray Compiled Graph deadlock tracked by Ray issue #58426 while Ray PR #58866 remains open.

    • Backend validation errors: Some malformed requests, such as empty or missing messages, are validated by the backend and can return backend error wording. Send a non-empty messages array for Chat Completions and Anthropic Messages requests.

    • Scope: Nemotron 3 Ultra 550B-A55B is a text-only model. Use it for text reasoning, coding, tool calling, and agentic workflows. It is not a multimodal image, audio, or video model.