Is this page helpful?

Release Notes#

Release 2.0.4-variant#

This release contains model updates outlined in the following sections.

Qwen3.5-397B-A17B#

This is the initial release of Qwen3.5-397B-A17B. For more information on this model, refer to the model card.

For GPU support, refer to the support matrix for Qwen3.5-397B-A17B.

Note the following limitations:

Qwen3.5-397B-A17B
- max_num_seqs is set to 512 by default in NIM. You can override this value by specifying the --max-num-seqs command-line option when starting the NIM server. Increasing max_num_seqs may increase GPU memory usage and can result in out-of-memory (OOM) errors for some deployment profiles.

Qwen3.5-122B-A10B#

This is an updated release of Qwen3.5-122B-A10B. For more information on this model, refer to the model card.

For GPU support, refer to the support matrix for Qwen3.5-122B-A10B.

Note the following limitations:

Qwen3.5-122B-A10B
- max_num_seqs is set to 512 by default in NIM. You can override this value by specifying the --max-num-seqs command-line option when starting the NIM server. Increasing max_num_seqs may increase GPU memory usage and can result in out-of-memory (OOM) errors for some deployment profiles.
- Include the --gpu-memory-utilization 0.9 flag when launching the model on NVIDIA DGX Spark (GB10).

Kimi-K2.6#

This is an updated release of Kimi-K2.6. For more information on this model, refer to the model card.

For GPU support, refer to the support matrix for Kimi-K2.6.

Note the following limitations:

Kimi-K2.6
- Only the INT4 precision profile is supported. BF16 and FP8 are not provided.
- First-time container start downloads a ~554 GB NGC artifact; allow ~60-90 min for cold start on a fast NVMe cache. Subsequent restarts reuse the local cache.
- Requests to the /v1/completions endpoint with a blank prompt return an empty text field. Use the /v1/chat/completions endpoint instead.
- A request to the /v1/chat/completions endpoint that includes an empty structured_outputs.json schema (“”) may crash the underlying vLLM engine and terminate the container.
- list-model-profiles may classify a profile as runnable on SKUs not listed in the support matrix; deployment may fail with a CUDA out-of-memory error.

Nemotron-3-Nano-Omni-30B-A3B-Reasoning#

This is the initial release of Nemotron-3-Nano-Omni-30B-A3B-Reasoning. This NIM is part of the NIM Certified offering. For more information on this model, refer to the model card.

For GPU support, refer to the support matrix for Nemotron-3-Nano-Omni-30B-A3B-Reasoning.

Note the following limitations:

Nemotron-3-Nano-Omni-30B-A3B-Reasoning
- A request to the /v1/chat/completions endpoint that includes an empty structured_outputs.json schema (“”) may crash the underlying vLLM engine and terminate the container.
- Requests to the /v1/completions endpoint with a blank prompt return an empty text field. Use the /v1/chat/completions endpoint instead.
- Requests containing large videos (typically > ~150 MB, > ~2 minutes at 1080p, or with resolution beyond 1080p) may return HTTP 500 due to a CUDA out-of-memory error and may affect server stability.

For information about past updates and older versions, refer to the previous release notes.