Release Notes#

Release 1.7.0#

Summary#

This is the initial release of Qwen3.5-397B-A17B. For more information on this model, see the model card on build.nvidia.com.

This is the initial release of Kimi-K2.5. For more information on this model, refer to the model card on build.nvidia.com.

Visual Language Models#

Refer to the support matrix for the following models:

Limitations#

Qwen3.5-397B-A17B
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- The ARM architecture is not supported.
- Video input is not enabled by default. To enable, set NIM_MAX_VIDEOS_PER_PROMPT=1.
- Enabling video input can cause performance degradation for high ISL and high concurrency loads when KV cache is saturated.
Kimi-K2.5
- Guided decoding is not supported.
- Only text and image inputs are supported.
- You can only deploy this NIM using Docker. Helm is not supported.
- Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- This model does not support the media_io_kwargs and mm_processor_kwargs parameters.
- The following metrics are not currently reported in the metrics endpoint:
  - vision_encoder_latency_seconds
  - request_image_count
  - image_size_pixels
  - gpu_cache_usage_perc

Release 1.6.0#

Summary#

This is the initial release of Mistral Large 3 675B Instruct 2512. For more information on this model, refer to the model card on build.nvidia.com.

This is the initial release of Cosmos Reason2. This model is available in two sizes: Cosmos Reason2 2B and Cosmos Reason2 8B.

For more information on this model, refer to the model cards on Hugging Face:

This is an updated release of Nemotron Nano 12B v2 VL. For more information on this model, refer to the model card on build.nvidia.com.

Visual Language Models#

Refer to the support matrix for the following models:

Changes from 1.5.0#

Nemotron Nano 12B v2 VL
- Added FP4 quantized profiles
- Enabled prefix caching

Limitations#

All models
- Setting NIM_TOKENIZER_MODE=slow is not supported.
- When passing an invalid image or video URL, the error code is 500 instead of 4xx.
- Video models have a minimum frame resolution of 128x128.
- Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.
Mistral Large 3 675B Instruct 2512
- For Mistral models, you cannot change the guided decoding backend. The environment variable NIM_GUIDED_DECODING_BACKEND is set to guidance by default and is not configurable.
- For Mistral models, you cannot override chat_template at the request level using the chat_template or chat_template_kwargs parameter.
- For Mistral models, the function role is not supported in the messages field. The supported roles are system, user, assistant, and tool.
- To ensure optimal response quality, avoid using stop words in tool calling requests.
- For Mistral visual language models, the mm_processor_kwargs parameter is not supported at the request level.
Cosmos Reason2
- Speculative decoding on Blackwell chips is not supported.
- Warnings about pynvml deprecation and transformers version 4.57.1 incompatibility are shown during startup.
- Inference with very long videos (minutes to hours) and default sampling (4 FPS) may hang.
Nemotron Nano 12B v2 VL
- If you set the min_tokens sampling parameter in a request, you should also set the max_tokens sampling parameter. Setting min_tokens alone causes the model to generate repetitive content.
- For L40S GPU deployments, disable KV cache reuse by setting environment variable NIM_ENABLE_KV_CACHE_REUSE=0 to prevent out of memory errors.

Release 1.5.0#

Summary#

This is the initial release of Nemotron Nano 12B v2 VL. For more information on this model, refer to the model card on build.nvidia.com.

This is an updated version of nemoretriever-parse, which was originally released in version 1.2.0. The updated model is now known as Nemotron Parse. For more information, refer to the Nemotron Parse Overview.

Visual Language Models#

Refer to the support matrix for the following models:

Limitations#

All models
- Setting NIM_TOKENIZER_MODE=slow is not supported.
- When passing an invalid image or video URL, the error code will be 500 instead of 4xx.
Nemotron Nano 12B v2 VL
- KV cache reuse between requests is not supported.
Nemotron Parse
- Only one image per request is supported.
- Text input is not supported.
- System messages are not supported.
- Output streaming is not supported.
- Video input is not supported.

Release 1.4.1#

Summary#

This is an updated release for Cosmos Reason1 7B. For more information on this model, refer to the model card on build.nvidia.com.

Visual Language Models#

Refer to the support matrix for the following model:

Cosmos Reason1 7B

Changes from 1.4.0#

Cosmos Reason1 7B
- Fixed memory profiling for frames-to-tokens encoding.
- Fixed error that resulted in wrong decoding backend being used for h265 videos.
- Fixed memory leaks in video decoding.
- Added memory profiling for video decoding.
- Number of input video frames calculated with fps parameter can now be limited by the num_frames parameter.

Limitations#

Cosmos Reason1 7B
- If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.
- The LlamaStack API is not supported.
- PEFT is not supported.
- When passing an invalid image URL or parameters, the HTTP error code can be incorrect
- Changing the guided decoding backend at runtime is not supported.
- The best_of parameter is not supported.
- FP8 profiles are not available on B200 and GB200.
- TP2 profiles can lead to higher TTFT than TP1 profiles in some cases.

Release 1.4.0#

Summary#

This is the initial release of Cosmos Reason1 7B. For more information on this model, refer to the model card on build.nvidia.com.

This is the initial release of Llama 4 Maverick 17B 128E Instruct. For more information on this model, refer to the model card on build.nvidia.com.

Visual Language Models#

Refer to the support matrix for the following models:

Limitations#

Cosmos Reason1 7B
- If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.
- The LlamaStack API is not supported.
- PEFT is not supported.
- When passing an invalid image URL or parameters, the HTTP error code can be incorrect
- Changing the guided decoding backend at runtime is not supported
- The best_of parameter is not supported
- Sending long videos (multiple minutes) can lead to timeouts
- FP8 profiles are not available on B200 and GB200
- TP2 profiles can lead to higher TTFT than TP1 profiles in some cases
Llama 4 Maverick 17B 128E Instruct
- If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.
- There’s a time to first token (TTFT) degradation observed with concurrency >= 16. End-to-end latency isn’t affected.

Release 1.3.2#

Summary#

This is an updated release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Visual Language Models#

Refer to the support matrix for the following model:

Llama 4 Scout 17B 16E Instruct

Changes from 1.3.1#

Llama 4 Scout 17B 16E Instruct
- Tool calling is now supported.

Limitations#

Llama 4 Scout 17B 16E Instruct
- If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.
- The LlamaStack API is not supported.
- PEFT is not supported.
- The default maximum sequence length is 131k
- Following Meta’s guidance, each request supports up to 5 images by default
- Accuracy of text-only requests can be lower on FP8 profiles on Hopper
- When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.3.1#

Summary#

This is the initial release of Mistral Small 3.2. For more information on this model, refer to the model card on build.nvidia.com.

This is an updated release of Llama Nemotron Nano VL. For more information on this model, refer to the model card on build.nvidia.com.

This is an updated release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Visual Language Models#

Refer to the support matrix for the following models:

Changes from 1.3.0#

Llama Nemotron Nano VL
- Fixed performance issues with SFT and FP8
- Improved air-gap deployment of SFT checkpoints
Llama 4 Scout 17B 16E Instruct
- Introduced generic TP2 profiles for deployment on H200 NVL

Limitations#

Mistral Small 3.2 24B Instruct 2506
- The LlamaStack API is not supported.
- Structured generation is not supported.
Llama Nemotron Nano VL
- The LlamaStack API is not supported.
- PEFT is not supported.
Llama 4 Scout 17B 16E Instruct
- The LlamaStack API is not supported.
- PEFT is not supported.
- Tool calling is not supported.
- The default maximum sequence length is 131k
- Following Meta’s guidance, each request supports up to 5 images by default
- Accuracy of text-only requests can be lower on FP8 profiles on Hopper
- When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.3.0#

Summary#

This is the initial release of Llama Nemotron Nano VL. For more information on this model, refer to the model card on build.nvidia.com.

This is the initial release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Visual Language Models#

Refer to the support matrix for the following models:

Limitations#

Llama Nemotron Nano VL
- The LlamaStack API is not supported.
- PEFT is not supported.
Llama 4 Scout 17B 16E Instruct
- The LlamaStack API is not supported.
- PEFT is not supported.
- Tool calling is not supported.
- The default maximum sequence length is 131k
- Following Meta’s guidance, each request supports up to 5 images by default
- Accuracy of text-only requests can be lower on FP8 profiles on Hopper
- When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.2.0#

Summary#

This is the initial release of nemoretriever-parse. For more information, refer to the nemoretriever-parse Overview.

Visual Language Models#

Refer to the support matrix for the following model:

nemoretriever-parse

Limitations#

Only one image per request is supported.
Text input is not allowed.
System messages are not allowed.

Release 1.1.1#

Summary#

This patch release fixes CUDA runtime errors seen on AWS and Azure instances.

Visual Language Models#

Refer to the support matrix for the following models:

Limitations#

PEFT is not supported.
Following Meta’s guidance, function calling is not supported.
Following Meta’s guidance, only one image per request is supported.
Following Meta’s guidance, system messages are not allowed with images.
Following the official vLLM implementation, images are always added to the front of user messages.
Maximum concurrency can be low when using the vLLM backend.
Image and vision encoder Prometheus metrics are not available with the vLLM backend.
With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.

Release 1.1.0#

Summary#

This is the 1.1.0 release of NIM for VLMs.

Visual Language Models#

Refer to the support matrix for the following models:

Limitations#

PEFT is not supported.
Following Meta’s guidance, function calling is not supported.
Following Meta’s guidance, only one image per request is supported.
Following Meta’s guidance, system messages are not allowed with images.
Following the official vLLM implementation, images are always added to the front of user messages.
Maximum concurrency can be low when using the vLLM backend.
Image and vision encoder Prometheus metrics are not available with the vLLM backend.
With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.
When deploying an optimized profile on AWS A10G, you might encounter the error [TensorRT-LLM][ERROR] ICudaEngine::createExecutionContextWithoutDeviceMemory: Error Code 1: Cuda Runtime (an illegal memory access was encountered). Use the vLLM backend instead as described in Profile Selection.