Release Notes#

Release 1.7.0#

Summary#

DiffusionGemma 26B A4B IT#

This is the initial release of DiffusionGemma 26B A4B IT. DiffusionGemma 26B A4B IT is a 26B parameter mixed of experts (MoE) Diffusion VLM. For more information about this model, refer to the model card.

Gemma 4#

This is the initial release of Gemma-4-26B-A4B-IT and an updated release of Gemma 4 31B Instruct. Refer to the following model cards for more information on each size:

Cosmos 3 Reasoner#

This is the initial release of Cosmos 3 Reasoner. Nano (8B) and Super (32B) sizes of this model are available from a single download.

Key features of Cosmos 3 Reasoner include the following:

Efficient Video Sampling (EVS): Configurable frame pruning for performance optimization
Pre-decoded Video Frames: New video_frames content type lets you send pre-decoded JPEG frames as video input, skipping server-side video decode for faster TTFT

For more information on this model, refer to the following model cards:

Step 3.7 Flash#

This is the initial release of Step 3.7 Flash. For more information on this model, see the model card.

Mistral Medium 3.5#

This is the initial release of Mistral Medium 3.5. For more information on this model, see the model card.

NVIDIA Nemotron 3 Nano Omni#

This is the initial release of NVIDIA Nemotron 3 Nano Omni. For more information on this model, see the model card.

Qwen3.6#

This is the initial release of two sizes of Qwen3.6. Refer to the following model cards for more information on each size:

Kimi-K2.6#

This is the initial release of Kimi-K2.6. For more information on this model, see the model card.

Cosmos Reason2#

This is an updated release of Cosmos Reason2. This model is available in two sizes: Cosmos Reason2 2B and Cosmos Reason2 8B.

Key features of Cosmos Reason2 include the following:

Speculative Decoding: Built-in EAGLE support for faster inference
BYOC EAGLE Support: Train and use custom speculative decoding heads
Efficient Video Sampling (EVS): Configurable frame pruning for performance optimization
Pre-decoded Video Frames: New video_frames content type allows sending pre-decoded JPEG frames as video input, skipping server-side video decode for faster TTFT

For more information about this model, refer to the model cards:

Gemma 4 31B Instruct#

This is the initial release of Gemma 4 31B Instruct. For more information on this model, see the model card.

Mistral-Small-4-119B-2603#

This is the initial release of Mistral-Small-4-119B-2603. For more information on this model, see the model card.

Qwen3.5#

This is the initial release of several sizes of Qwen3.5. Refer to the following model cards for more information on each size:

Nemotron-Parse-v1.2#

This is an updated release of Nemotron Parse, now known as Nemotron-Parse-v1.2. The API has changed since the release of Nemotron Parse in version 1.5.0. For more information on Nemotron-Parse-v1.2, refer to the model card on Hugging Face.

Ministral 3 14B Instruct 2512#

This is the initial release of Ministral 3 14B Instruct 2512. For more information about this model, refer to the model card on build.nvidia.com.

Kimi-K2.5#

This is the initial release of Kimi-K2.5. For more information about this model, refer to the model card on build.nvidia.com.

Supported Hardware#

Refer to the support matrix for the following models:

Limitations#

All models
- Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.
- When environment variable NIM_DISABLE_LOG_REQUESTS=0, the prompt_token_ids and prompt fields are empty in Completions API responses; prompt_token_ids is also empty in Chat Completions API responses.
DiffusionGemma 26B A4B IT
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
Gemma-4-26B-A4B-IT
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- Set environment variable NIM_ENABLE_MTP=1 to enable speculative decoding.
Gemma 4 31B Instruct
- GPUs built on the ARM architecture are not supported.
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.1-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- With NVFP4 and thinking mode enabled, math problem outputs can have unstable or variable formatting (for example, inconsistent answer markers or extra latex symbols). This formatting may break formula evaluation. You should implement a formatting post-process step to normalize the final answer format and ensure correct scoring and results.
Cosmos 3 Reasoner
- Inference with very long videos (minutes to hours) and default sampling (4 FPS) may hang.
- BF16 generic profiles for the Super (32B) model are not supported on L40S and RTX PRO 6000 GPUs due to memory constraints.
- BF16 profiles on the Super (32B) model require TP=2 on Hopper GPUs (H100, H100 PCIe, H100 NVL, H200, H200 NVL, GH200, H20). Only B200, GB200, and B300 support BF16 profiles on the Super (32B) model at TP=1.
- The L40S GPU does not support the Super (32B) model.
- The vLLM BF16 profile for the Nano (8B) model is not supported on L40S GPUs due to memory constraints. Use the FP8 or NVFP4 quantized profiles on L40S.
Step 3.7 Flash
- This model does not support video workloads.
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the
Mistral Medium 3.5
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- Video input is not supported.
- Structured output is not supported.
NVIDIA Nemotron 3 Nano Omni
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- Requests containing large videos (> 1GB) may fail with a NanoNemotronVLProcessor error.
Qwen3.6-27B
- Video input is not supported.
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the
Kimi-K2.6
- This model uses the SGLang backend.
- This model does not support video workloads.
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
Qwen3.6-35B-A3B
- This model uses the SGLang backend.
- Under high-concurrency video workloads, the model may experience video decoding resource saturation, which can result in performance degradation at elevated concurrency levels.
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the
Cosmos Reason2
- Warnings about pynvml deprecation and transformers version 4.57.1 incompatibility are shown during startup.
- Inference with very long videos (minutes to hours) and default sampling (4 FPS) may hang.
- BF16 generic profiles are not supported for L40, RTX 4500, and RTX PRO 4500 GPUs due to memory constraints.
- Tensor parallelism (TP=2) is not supported. All deployments use TP=1.
- Efficient Video Sampling (EVS) is only supported for the 8B model. The 2B model does not support EVS.
Mistral-Small-4-119B-2603
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- The ARM architecture is not supported.
- Only text and image inputs are supported.
Qwen3.5-35B-A3B
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- The /v1/responses endpoint is currently not supported.
- Under high-concurrency video workloads, the model may experience video decoding resource saturation, which can result in performance degradation at elevated concurrency levels.
- When mounting a cache directory from the host system, add -u $(id -u), -e HOME=/tmp, and -e USER=$(id -un) to the docker run command to avoid potential permission issues.
Qwen3.5-122B-A10B
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- The /v1/responses endpoint is currently not supported.
- Under high-concurrency video workloads, the model may experience video decoding resource saturation, which can result in performance degradation at elevated concurrency levels for video workloads.
- On NVIDIA H100 SXM, the NIM_MAX_BATCH_SIZE parameter is limited to 128. Increasing this value may result in out-of-memory errors in the Mamba layer state cache.
- Video input support is enabled by default. However, under high-concurrency and long ISL workloads, the model may experience performance degradation due to KV cache saturation. If video input support is not required, disable video input by setting NIM_MAX_VIDEOS_PER_PROMPT=0 for optimal performance.
- When mounting a cache directory from the host system, add -u $(id -u), -e HOME=/tmp, and -e USER=$(id -un) to the docker run command to avoid potential permission issues.
- When running on H100 GPUs, manually set the environment variable NIM_MANIFEST_PROFILE.
Qwen3.5-397B-A17B
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- The ARM architecture is not supported.
- Video input is not enabled by default. To enable, set NIM_MAX_VIDEOS_PER_PROMPT=1.
- Enabling video input can cause performance degradation for high ISL and high concurrency loads when KV cache is saturated.
Nemotron-Parse-v1.2
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- Only one image per request is supported.
- Text input is not supported.
- System messages are not supported.
- Video input is not supported.
- Guided decoding and tool calling are not supported.
- Multilingual support is not available for the model.
- Only x86_64 based GPUs are supported.
Ministral 3 14B Instruct 2512
- For Mistral models, you cannot change the guided decoding backend. The environment variable NIM_GUIDED_DECODING_BACKEND is set to guidance by default and is not configurable.
- The following parameters do not work as expected: chat_template, chat_template_kwargs, and media_io_kwargs.
- Sending an invalid value for the mm_processor_kwargs parameter results in an HTTP 500 error code.
- The function role is not supported in the messages field. The supported roles are system, user, assistant, and tool.
- On L40S GPUs, maximum context length is 100k due to limited GPU memory. Other GPUs support up to 262,144 tokens.
- To ensure optimal response quality, do not use stop words in tool calling requests.
- The model may still attempt to make tool calls when tool_choice=none and tool definitions are present in the request.
- Sending an image in text-only prompts may result in an HTTP 500 error code. Use image_url instead.
Kimi-K2.5
- Guided decoding is not supported.
- Only text and image inputs are supported.
- You can only deploy this NIM using Docker. Helm is not supported.
- Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.
- This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.
- This model does not support the media_io_kwargs and mm_processor_kwargs parameters.
- The following metrics are not currently reported in the metrics endpoint:
  - vision_encoder_latency_seconds
  - request_image_count
  - image_size_pixels
  - gpu_cache_usage_perc

Release 1.6.0#

Summary#

Mistral Large 3 675B Instruct 2512#

This is the initial release of Mistral Large 3 675B Instruct 2512. For more information about this model, refer to the model card on build.nvidia.com.

Cosmos Reason2#

This is the initial release of Cosmos Reason2. This model is available in two sizes: Cosmos Reason2 2B and Cosmos Reason2 8B.

For more information about this model, refer to the model cards on Hugging Face:

Nemotron Nano 12B v2 VL#

This is an updated release of Nemotron Nano 12B v2 VL. For more information about this model, refer to the model card on build.nvidia.com.

Supported Hardware#

Refer to the support matrix for the following models:

Changes from 1.5.0#

Nemotron Nano 12B v2 VL
- Added FP4 quantized profiles
- Enabled prefix caching

Limitations#

All models
- Setting NIM_TOKENIZER_MODE=slow is not supported.
- When passing an invalid image or video URL, the error code is 500 instead of 4xx.
- Video models have a minimum frame resolution of 128x128.
- Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.
Mistral Large 3 675B Instruct 2512
- For Mistral models, you cannot change the guided decoding backend. The environment variable NIM_GUIDED_DECODING_BACKEND is set to guidance by default and is not configurable.
- For Mistral models, you cannot override chat_template at the request level using the chat_template or chat_template_kwargs parameter.
- For Mistral models, the function role is not supported in the messages field. The supported roles are system, user, assistant, and tool.
- To ensure optimal response quality, avoid using stop words in tool calling requests.
- For Mistral visual language models, the mm_processor_kwargs parameter is not supported at the request level.
Cosmos Reason2
- Speculative decoding on Blackwell chips is not supported.
- Warnings about pynvml deprecation and transformers version 4.57.1 incompatibility are shown during startup.
- Inference with very long videos (minutes to hours) and default sampling (4 FPS) may hang.
Nemotron Nano 12B v2 VL
- If you set the min_tokens sampling parameter in a request, you should also set the max_tokens sampling parameter. Setting min_tokens alone causes the model to generate repetitive content.
- For L40S GPU deployments, disable KV cache reuse by setting environment variable NIM_ENABLE_KV_CACHE_REUSE=0 to prevent out of memory errors.

Release 1.5.0#

Summary#

Nemotron Nano 12B v2 VL#

This is the initial release of Nemotron Nano 12B v2 VL. For more information about this model, refer to the model card on build.nvidia.com.

Nemotron Parse#

This is an updated version of nemoretriever-parse, which was originally released in version 1.2.0. The updated model is now known as Nemotron Parse. For more information, refer to the Nemotron Parse Overview.

Supported Hardware#

Refer to the support matrix for the following models:

Limitations#

All models
- Setting NIM_TOKENIZER_MODE=slow is not supported.
- When passing an invalid image or video URL, the error code will be 500 instead of 4xx.
Nemotron Nano 12B v2 VL
- KV cache reuse between requests is not supported.
Nemotron Parse
- Only one image per request is supported.
- Text input is not supported.
- System messages are not supported.
- Output streaming is not supported.
- Video input is not supported.

Release 1.4.1#

Summary#

Cosmos Reason1 7B#

This is an updated release for Cosmos Reason1 7B. For more information about this model, refer to the model card on build.nvidia.com.

Supported Hardware#

Refer to the support matrix for the following model:

Cosmos Reason1 7B

Changes from 1.4.0#

Cosmos Reason1 7B
- Fixed memory profiling for frames-to-tokens encoding.
- Fixed error that resulted in wrong decoding backend being used for h265 videos.
- Fixed memory leaks in video decoding.
- Added memory profiling for video decoding.
- Number of input video frames calculated with fps parameter can now be limited by the num_frames parameter.

Limitations#

Cosmos Reason1 7B
- If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.
- The LlamaStack API is not supported.
- PEFT is not supported.
- When passing an invalid image URL or parameters, the HTTP error code can be incorrect.
- Changing the guided decoding backend at runtime is not supported.
- The best_of parameter is not supported.
- FP8 profiles are not available on B200 and GB200.
- TP2 profiles can lead to higher TTFT than TP1 profiles in some cases.

Release 1.4.0#

Summary#

Cosmos Reason1 7B#

This is the initial release of Cosmos Reason1 7B. For more information about this model, refer to the model card on build.nvidia.com.

Llama 4 Maverick 17B 128E Instruct#

This is the initial release of Llama 4 Maverick 17B 128E Instruct. For more information about this model, refer to the model card on build.nvidia.com.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Limitations#

Cosmos Reason1 7B
- If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.
- The LlamaStack API is not supported.
- PEFT is not supported.
- When passing an invalid image URL or parameters, the HTTP error code can be incorrect.
- Changing the guided decoding backend at runtime is not supported
- The best_of parameter is not supported
- Sending long videos (multiple minutes) can lead to timeouts
- FP8 profiles are not available on B200 and GB200
- TP2 profiles can lead to higher TTFT than TP1 profiles in some cases
Llama 4 Maverick 17B 128E Instruct
- If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.
- There’s a time to first token (TTFT) degradation observed with concurrency >= 16. End-to-end latency isn’t affected.

Release 1.3.2#

Summary#

Llama 4 Scout 17B 16E Instruct#

This is an updated release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Supported Hardware#

Refer to the support matrix for the following model:

Llama 4 Scout 17B 16E Instruct

Note

Llama 4 models are not available in the EU.

Changes from 1.3.1#

Llama 4 Scout 17B 16E Instruct
- Tool calling is now supported.

Limitations#

Llama 4 Scout 17B 16E Instruct
- If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.
- The LlamaStack API is not supported.
- PEFT is not supported.
- The default maximum sequence length is 131k
- Following Meta’s guidance, each request supports up to 5 images by default
- Accuracy of text-only requests can be lower on FP8 profiles on Hopper
- When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.3.1#

Summary#

Mistral Small 3.2#

This is the initial release of Mistral Small 3.2. For more information about this model, refer to the model card on build.nvidia.com.

Llama Nemotron Nano VL#

This is an updated release of Llama Nemotron Nano VL. For more information about this model, refer to the model card on build.nvidia.com.

Llama 4 Scout 17B 16E Instruct#

This is an updated release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Changes from 1.3.0#

Llama Nemotron Nano VL
- Fixed performance issues with SFT and FP8
- Improved air-gap deployment of SFT checkpoints
Llama 4 Scout 17B 16E Instruct
- Introduced generic TP2 profiles for deployment on H200 NVL

Limitations#

Mistral Small 3.2 24B Instruct 2506
- The LlamaStack API is not supported.
- Structured generation is not supported.
Llama Nemotron Nano VL
- The LlamaStack API is not supported.
- PEFT is not supported.
Llama 4 Scout 17B 16E Instruct
- The LlamaStack API is not supported.
- PEFT is not supported.
- Tool calling is not supported.
- The default maximum sequence length is 131k
- Following Meta’s guidance, each request supports up to 5 images by default
- Accuracy of text-only requests can be lower on FP8 profiles on Hopper
- When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.3.0#

Summary#

Llama Nemotron Nano VL#

This is the initial release of Llama Nemotron Nano VL. For more information about this model, refer to the model card on build.nvidia.com.

Llama 4 Scout 17B 16E Instruct#

This is the initial release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Limitations#

Llama Nemotron Nano VL
- The LlamaStack API is not supported.
- PEFT is not supported.
Llama 4 Scout 17B 16E Instruct
- The LlamaStack API is not supported.
- PEFT is not supported.
- Tool calling is not supported.
- The default maximum sequence length is 131k
- Following Meta’s guidance, each request supports up to 5 images by default
- Accuracy of text-only requests can be lower on FP8 profiles on Hopper
- When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.2.0#

Summary#

nemoretriever-parse#

This is the initial release of nemoretriever-parse. For more information, refer to the nemoretriever-parse Overview.

Supported Hardware#

Refer to the support matrix for the following model:

nemoretriever-parse

Limitations#

Only one image per request is supported.
Text input is not allowed.
System messages are not allowed.

Release 1.1.1#

Summary#

This patch release fixes CUDA runtime errors seen on AWS and Azure instances.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 3 models are not available in the EU.

Limitations#

PEFT is not supported.
Following Meta’s guidance, function calling is not supported.
Following Meta’s guidance, only one image per request is supported.
Following Meta’s guidance, system messages are not allowed with images.
Following the official vLLM implementation, images are always added to the front of user messages.
Maximum concurrency can be low when using the vLLM backend.
Image and vision encoder Prometheus metrics are not available with the vLLM backend.
With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.

Release 1.1.0#

Summary#

This is the 1.1.0 release of NIM for VLMs.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 3 models are not available in the EU.

Limitations#

PEFT is not supported.
Following Meta’s guidance, function calling is not supported.
Following Meta’s guidance, only one image per request is supported.
Following Meta’s guidance, system messages are not allowed with images.
Following the official vLLM implementation, images are always added to the front of user messages.
Maximum concurrency can be low when using the vLLM backend.
Image and vision encoder Prometheus metrics are not available with the vLLM backend.
With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.
When deploying an optimized profile on AWS A10G, you might encounter the error [TensorRT-LLM][ERROR] ICudaEngine::createExecutionContextWithoutDeviceMemory: Error Code 1: Cuda Runtime (an illegal memory access was encountered). Use the vLLM backend instead as described in Profile Selection.