Release Notes#

Release 1.7.0#

Summary#

Mistral Medium 3.5#

This is the initial release of Mistral Medium 3.5. For more information on this model, see the model card.

NVIDIA Nemotron 3 Nano Omni#

This is the initial release of NVIDIA Nemotron 3 Nano Omni. For more information on this model, see the model card.

Qwen3.6#

This is the initial release of two versions of Qwen3.6. Refer to the following model cards for more information on each version:

Kimi-K2.6#

This is the initial release of Kimi-K2.6. For more information on this model, see the model card.

Cosmos Reason2#

This is an updated release of Cosmos Reason2. This model is available in two sizes: Cosmos Reason2 2B and Cosmos Reason2 8B.

Key features of Cosmos Reason2 include the following:

  • Speculative Decoding: Built-in EAGLE support for faster inference

  • BYOC EAGLE Support: Train and use custom speculative decoding heads

  • Efficient Video Sampling (EVS): Configurable frame pruning for performance optimization

  • Pre-decoded Video Frames: New video_frames content type allows sending pre-decoded JPEG frames as video input, skipping server-side video decode for faster TTFT

For more information about this model, refer to the model cards:

Gemma 4 31B Instruct#

This is the initial release of Gemma 4 31B Instruct. For more information on this model, see the model card.

Mistral-Small-4-119B-2603#

This is the initial release of Mistral-Small-4-119B-2603. For more information on this model, see the model card.

Qwen3.5#

This is the initial release of several versions of Qwen3.5. Refer to the following model cards for more information on each version:

Nemotron-Parse-v1.2#

This is an updated release of Nemotron Parse, now known as Nemotron-Parse-v1.2. The API has changed since the release of Nemotron Parse in version 1.5.0. For more information on Nemotron-Parse-v1.2, refer to the model card on Hugging Face.

Ministral 3 14B Instruct 2512#

This is the initial release of Ministral 3 14B Instruct 2512. For more information about this model, refer to the model card on build.nvidia.com.

Kimi-K2.5#

This is the initial release of Kimi-K2.5. For more information about this model, refer to the model card on build.nvidia.com.

Supported Hardware#

Refer to the support matrix for the following models:

Limitations#

  • All models

    • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.

    • When environment variable NIM_DISABLE_LOG_REQUESTS=0, the prompt_token_ids and prompt fields are empty in Completions API responses; prompt_token_ids is also empty in Chat Completions API responses.

  • Mistral Medium 3.5

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • Video input is not supported.

    • Structured output is not supported.

  • NVIDIA Nemotron 3 Nano Omni

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • Requests containing large videos (> 1GB) may fail with a NanoNemotronVLProcessor error.

  • Qwen3.6-27B

    • Video input is not supported.

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the

  • Kimi-K2.6

    • This model uses the SGLang backend.

    • This model does not support video workloads.

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

  • Qwen3.6-35B-A3B

    • This model uses the SGLang backend.

    • Under high-concurrency video workloads, the model may experience video decoding resource saturation, which can result in performance degradation at elevated concurrency levels.

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the

  • Cosmos Reason2

    • Warnings about pynvml deprecation and transformers version 4.57.1 incompatibility are shown during startup.

    • Inference with very long videos (minutes to hours) and default sampling (4 FPS) may hang.

    • BF16 generic profiles are not supported for L40, RTX 4500, and RTX PRO 4500 GPUs due to memory constraints.

    • Tensor parallelism (TP=2) is not supported. All deployments use TP=1.

    • Efficient Video Sampling (EVS) is only supported for the 8B model. The 2B model does not support EVS.

  • Gemma 4 31B Instruct

    • GPUs built on the ARM architecture are not supported.

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

  • Mistral-Small-4-119B-2603

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • The ARM architecture is not supported.

    • Only text and image inputs are supported.

  • Qwen3.5-35B-A3B

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • The /v1/responses endpoint is currently not supported.

    • Under high-concurrency video workloads, the model may experience video decoding resource saturation, which can result in performance degradation at elevated concurrency levels.

    • When mounting a cache directory from the host system, add -u $(id -u), -e HOME=/tmp, and -e USER=$(id -un) to the docker run command to avoid potential permission issues.

  • Qwen3.5-122B-A10B

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • The /v1/responses endpoint is currently not supported.

    • Under high-concurrency video workloads, the model may experience video decoding resource saturation, which can result in performance degradation at elevated concurrency levels for video workloads.

    • On NVIDIA H100 SXM, the NIM_MAX_BATCH_SIZE parameter is limited to 128. Increasing this value may result in out-of-memory errors in the Mamba layer state cache.

    • Video input support is enabled by default. However, under high-concurrency and long ISL workloads, the model may experience performance degradation due to KV cache saturation. If video input support is not required, disable video input by setting NIM_MAX_VIDEOS_PER_PROMPT=0 for optimal performance.

    • When mounting a cache directory from the host system, add -u $(id -u), -e HOME=/tmp, and -e USER=$(id -un) to the docker run command to avoid potential permission issues.

    • When running on H100 GPUs, manually set the environment variable NIM_MANIFEST_PROFILE.

  • Qwen3.5-397B-A17B

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • The ARM architecture is not supported.

    • Video input is not enabled by default. To enable, set NIM_MAX_VIDEOS_PER_PROMPT=1.

    • Enabling video input can cause performance degradation for high ISL and high concurrency loads when KV cache is saturated.

  • Nemotron-Parse-v1.2

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • Only one image per request is supported.

    • Text input is not supported.

    • System messages are not supported.

    • Video input is not supported.

    • Guided decoding and tool calling are not supported.

    • Multilingual support is not available for the model.

    • Only x86_64 based GPUs are supported.

  • Ministral 3 14B Instruct 2512

    • For Mistral models, you cannot change the guided decoding backend. The environment variable NIM_GUIDED_DECODING_BACKEND is set to guidance by default and is not configurable.

    • The following parameters do not work as expected: chat_template, chat_template_kwargs, and media_io_kwargs.

    • Sending an invalid value for the mm_processor_kwargs parameter results in an HTTP 500 error code.

    • The function role is not supported in the messages field. The supported roles are system, user, assistant, and tool.

    • On L40S GPUs, maximum context length is 100k due to limited GPU memory. Other GPUs support up to 262,144 tokens.

    • To ensure optimal response quality, do not use stop words in tool calling requests.

    • The model may still attempt to make tool calls when tool_choice=none and tool definitions are present in the request.

    • Sending an image in text-only prompts may result in an HTTP 500 error code. Use image_url instead.

  • Kimi-K2.5

    • Guided decoding is not supported.

    • Only text and image inputs are supported.

    • You can only deploy this NIM using Docker. Helm is not supported.

    • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • This model does not support the media_io_kwargs and mm_processor_kwargs parameters.

    • The following metrics are not currently reported in the metrics endpoint:

      • vision_encoder_latency_seconds

      • request_image_count

      • image_size_pixels

      • gpu_cache_usage_perc

Release 1.6.0#

Summary#

Mistral Large 3 675B Instruct 2512#

This is the initial release of Mistral Large 3 675B Instruct 2512. For more information about this model, refer to the model card on build.nvidia.com.

Cosmos Reason2#

This is the initial release of Cosmos Reason2. This model is available in two sizes: Cosmos Reason2 2B and Cosmos Reason2 8B.

For more information about this model, refer to the model cards on Hugging Face:

Nemotron Nano 12B v2 VL#

This is an updated release of Nemotron Nano 12B v2 VL. For more information about this model, refer to the model card on build.nvidia.com.

Supported Hardware#

Refer to the support matrix for the following models:

Changes from 1.5.0#

  • Nemotron Nano 12B v2 VL

    • Added FP4 quantized profiles

    • Enabled prefix caching

Limitations#

  • All models

    • Setting NIM_TOKENIZER_MODE=slow is not supported.

    • When passing an invalid image or video URL, the error code is 500 instead of 4xx.

    • Video models have a minimum frame resolution of 128x128.

    • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.

  • Mistral Large 3 675B Instruct 2512

    • For Mistral models, you cannot change the guided decoding backend. The environment variable NIM_GUIDED_DECODING_BACKEND is set to guidance by default and is not configurable.

    • For Mistral models, you cannot override chat_template at the request level using the chat_template or chat_template_kwargs parameter.

    • For Mistral models, the function role is not supported in the messages field. The supported roles are system, user, assistant, and tool.

    • To ensure optimal response quality, avoid using stop words in tool calling requests.

    • For Mistral visual language models, the mm_processor_kwargs parameter is not supported at the request level.

  • Cosmos Reason2

    • Speculative decoding on Blackwell chips is not supported.

    • Warnings about pynvml deprecation and transformers version 4.57.1 incompatibility are shown during startup.

    • Inference with very long videos (minutes to hours) and default sampling (4 FPS) may hang.

  • Nemotron Nano 12B v2 VL

    • If you set the min_tokens sampling parameter in a request, you should also set the max_tokens sampling parameter. Setting min_tokens alone causes the model to generate repetitive content.

    • For L40S GPU deployments, disable KV cache reuse by setting environment variable NIM_ENABLE_KV_CACHE_REUSE=0 to prevent out of memory errors.

Release 1.5.0#

Summary#

Nemotron Nano 12B v2 VL#

This is the initial release of Nemotron Nano 12B v2 VL. For more information about this model, refer to the model card on build.nvidia.com.

Nemotron Parse#

This is an updated version of nemoretriever-parse, which was originally released in version 1.2.0. The updated model is now known as Nemotron Parse. For more information, refer to the Nemotron Parse Overview.

Supported Hardware#

Refer to the support matrix for the following models:

Limitations#

  • All models

    • Setting NIM_TOKENIZER_MODE=slow is not supported.

    • When passing an invalid image or video URL, the error code will be 500 instead of 4xx.

  • Nemotron Nano 12B v2 VL

    • KV cache reuse between requests is not supported.

  • Nemotron Parse

    • Only one image per request is supported.

    • Text input is not supported.

    • System messages are not supported.

    • Output streaming is not supported.

    • Video input is not supported.

Release 1.4.1#

Summary#

Cosmos Reason1 7B#

This is an updated release for Cosmos Reason1 7B. For more information about this model, refer to the model card on build.nvidia.com.

Supported Hardware#

Refer to the support matrix for the following model:

Changes from 1.4.0#

  • Cosmos Reason1 7B

    • Fixed memory profiling for frames-to-tokens encoding.

    • Fixed error that resulted in wrong decoding backend being used for h265 videos.

    • Fixed memory leaks in video decoding.

    • Added memory profiling for video decoding.

    • Number of input video frames calculated with fps parameter can now be limited by the num_frames parameter.

Limitations#

  • Cosmos Reason1 7B

    • If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • When passing an invalid image URL or parameters, the HTTP error code can be incorrect.

    • Changing the guided decoding backend at runtime is not supported.

    • The best_of parameter is not supported.

    • FP8 profiles are not available on B200 and GB200.

    • TP2 profiles can lead to higher TTFT than TP1 profiles in some cases.

Release 1.4.0#

Summary#

Cosmos Reason1 7B#

This is the initial release of Cosmos Reason1 7B. For more information about this model, refer to the model card on build.nvidia.com.

Llama 4 Maverick 17B 128E Instruct#

This is the initial release of Llama 4 Maverick 17B 128E Instruct. For more information about this model, refer to the model card on build.nvidia.com.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Limitations#

  • Cosmos Reason1 7B

    • If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • When passing an invalid image URL or parameters, the HTTP error code can be incorrect.

    • Changing the guided decoding backend at runtime is not supported

    • The best_of parameter is not supported

    • Sending long videos (multiple minutes) can lead to timeouts

    • FP8 profiles are not available on B200 and GB200

    • TP2 profiles can lead to higher TTFT than TP1 profiles in some cases

  • Llama 4 Maverick 17B 128E Instruct

    • If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.

    • There’s a time to first token (TTFT) degradation observed with concurrency >= 16. End-to-end latency isn’t affected.

Release 1.3.2#

Summary#

Llama 4 Scout 17B 16E Instruct#

This is an updated release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Supported Hardware#

Refer to the support matrix for the following model:

Note

Llama 4 models are not available in the EU.

Changes from 1.3.1#

  • Llama 4 Scout 17B 16E Instruct

    • Tool calling is now supported.

Limitations#

  • Llama 4 Scout 17B 16E Instruct

    • If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • The default maximum sequence length is 131k

    • Following Meta’s guidance, each request supports up to 5 images by default

    • Accuracy of text-only requests can be lower on FP8 profiles on Hopper

    • When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.3.1#

Summary#

Mistral Small 3.2#

This is the initial release of Mistral Small 3.2. For more information about this model, refer to the model card on build.nvidia.com.

Llama Nemotron Nano VL#

This is an updated release of Llama Nemotron Nano VL. For more information about this model, refer to the model card on build.nvidia.com.

Llama 4 Scout 17B 16E Instruct#

This is an updated release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Changes from 1.3.0#

  • Llama Nemotron Nano VL

    • Fixed performance issues with SFT and FP8

    • Improved air-gap deployment of SFT checkpoints

  • Llama 4 Scout 17B 16E Instruct

    • Introduced generic TP2 profiles for deployment on H200 NVL

Limitations#

  • Mistral Small 3.2 24B Instruct 2506

    • The LlamaStack API is not supported.

    • Structured generation is not supported.

  • Llama Nemotron Nano VL

    • The LlamaStack API is not supported.

    • PEFT is not supported.

  • Llama 4 Scout 17B 16E Instruct

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • Tool calling is not supported.

    • The default maximum sequence length is 131k

    • Following Meta’s guidance, each request supports up to 5 images by default

    • Accuracy of text-only requests can be lower on FP8 profiles on Hopper

    • When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.3.0#

Summary#

Llama Nemotron Nano VL#

This is the initial release of Llama Nemotron Nano VL. For more information about this model, refer to the model card on build.nvidia.com.

Llama 4 Scout 17B 16E Instruct#

This is the initial release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Limitations#

  • Llama Nemotron Nano VL

    • The LlamaStack API is not supported.

    • PEFT is not supported.

  • Llama 4 Scout 17B 16E Instruct

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • Tool calling is not supported.

    • The default maximum sequence length is 131k

    • Following Meta’s guidance, each request supports up to 5 images by default

    • Accuracy of text-only requests can be lower on FP8 profiles on Hopper

    • When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.2.0#

Summary#

nemoretriever-parse#

This is the initial release of nemoretriever-parse. For more information, refer to the nemoretriever-parse Overview.

Supported Hardware#

Refer to the support matrix for the following model:

Limitations#

  • Only one image per request is supported.

  • Text input is not allowed.

  • System messages are not allowed.

Release 1.1.1#

Summary#

This patch release fixes CUDA runtime errors seen on AWS and Azure instances.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 3 models are not available in the EU.

Limitations#

  • PEFT is not supported.

  • Following Meta’s guidance, function calling is not supported.

  • Following Meta’s guidance, only one image per request is supported.

  • Following Meta’s guidance, system messages are not allowed with images.

  • Following the official vLLM implementation, images are always added to the front of user messages.

  • Maximum concurrency can be low when using the vLLM backend.

  • Image and vision encoder Prometheus metrics are not available with the vLLM backend.

  • With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.

Release 1.1.0#

Summary#

This is the 1.1.0 release of NIM for VLMs.

Supported Hardware#

Refer to the support matrix for the following models:

Note

Llama 3 models are not available in the EU.

Limitations#

  • PEFT is not supported.

  • Following Meta’s guidance, function calling is not supported.

  • Following Meta’s guidance, only one image per request is supported.

  • Following Meta’s guidance, system messages are not allowed with images.

  • Following the official vLLM implementation, images are always added to the front of user messages.

  • Maximum concurrency can be low when using the vLLM backend.

  • Image and vision encoder Prometheus metrics are not available with the vLLM backend.

  • With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.

  • When deploying an optimized profile on AWS A10G, you might encounter the error [TensorRT-LLM][ERROR] ICudaEngine::createExecutionContextWithoutDeviceMemory: Error Code 1: Cuda Runtime (an illegal memory access was encountered). Use the vLLM backend instead as described in Profile Selection.