Release Notes for NIM Day 0#
This page lists updates, fixes, and known issues for NIM Day 0 releases.
Release 2.0.5#
Highlights#
The following LLM NIM is now available:
This release is built on the vLLM 0.20.2 inference backend.
Performance#
With Multi-Token Prediction (MTP) speculative decoding enabled (refer to Speculative Decoding with Multi-Token Prediction), Nemotron 3 Ultra 550B-A55B delivers approximately 17% higher throughput on chat workloads and approximately 26% higher throughput on software engineering (SWE) workloads compared to open-source vLLM. One observed limitation: with the maximum reasoning budget enabled, GPQA accuracy decreases by approximately 5 percentage points.
Known Issues#
This release includes the following known issues and limitations:
-
The following items are inherited from the vLLM backend or the model checkpoint configuration. You can use these items to configure and validate deployments intentionally.
Reasoning parser: Start the container with
--reasoning-parser nemotron_v3when you want vLLM to parse Nemotron 3 reasoning output. Without a reasoning parser, non-JSON responses can include<think>content in the assistantcontentfield, thereasoningfield can benull, and reasoning-token accounting can be reported as regular output tokens.Thinking budget: This model supports
thinking_token_budgetwhen thinking is enabled throughchat_template_kwargs: {"enable_thinking": true}.thinking_token_budgetlimits the thinking portion of generation, whilemax_tokensstill caps total generated tokens for the request. If a response ends before producing a final answer, increasemax_tokens, increasethinking_token_budget, or simplify the prompt.Responses token caps: For
/v1/responses,max_output_tokenscovers the total generated output budget. If the request spends the budget before producing final visible text, the response can returnstatus: "incomplete". Increasemax_output_tokensfor prompts that need more reasoning or longer final answers.JSON mode:
response_format: {"type": "json_object"}can suppress visible reasoning output and is recommended for structured responses that you validate with a schema.Tool calling: Requests that use
"tool_choice": "auto"require the container to start with--enable-auto-tool-choiceand--tool-call-parser qwen3_coder. Pass these usingNIM_PASSTHROUGH_ARGSin Docker, Kubernetes, or other orchestrated environments.Generation defaults: The model checkpoint can provide
generation_config.jsondefaults, includingtop_p=0.95. Override sampling parameters per request when you need a different value. To ignore model-provided generation defaults and use vLLM defaults, add--generation-config vllmtoNIM_PASSTHROUGH_ARGS.Profile verification: Use the support matrix as the source of truth for supported GPU, precision, TP, and PP combinations. After startup, verify the selected profile and served model with
/v1/metadata,/v1/models, and the startup logs.Multi-node Ray executor: For multi-node profiles, such as the H100 BF16 TP=8, PP=2 profile, add
VLLM_USE_RAY_V2_EXECUTOR_BACKEND=1. This uses the vLLM Ray V2 executor backend to avoid an upstream Ray Compiled Graph deadlock tracked by Ray issue #58426 while Ray PR #58866 remains open.Backend validation errors: Some malformed requests, such as empty or missing
messages, are validated by the backend and can return backend error wording. Send a non-emptymessagesarray for Chat Completions and Anthropic Messages requests.Scope: Nemotron 3 Ultra 550B-A55B is a text-only model. Use it for text reasoning, coding, tool calling, and agentic workflows. It is not a multimodal image, audio, or video model.