Release Notes#

Release 1.7.0#

Summary#

This is the initial release of Gemma 4 31B Instruct. For more information on this model, see the model card.

This is the initial release of Mistral-Small-4-119B-2603. For more information on this model, see the model card.

This is the initial release of several versions of Qwen3.5. Refer to the following model cards for more information on each version:

This is an updated release of Nemotron Parse, now known as Nemotron-Parse-v1.2. The API has changed since the release of Nemotron Parse in version 1.5.0. For more information on Nemotron-Parse-v1.2, refer to the model card on Hugging Face.

This is the initial release of Ministral 3 14B Instruct 2512. For more information about this model, refer to the model card on build.nvidia.com.

This is the initial release of Kimi-K2.5. For more information about this model, refer to the model card on build.nvidia.com.

Visual Language Models#

Refer to the support matrix for the following models:

Limitations#

  • All models

    • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.

    • When environment variable NIM_DISABLE_LOG_REQUESTS=0, the prompt_token_ids and prompt fields are empty in Completions API responses; prompt_token_ids is also empty in Chat Completions API responses.

  • Gemma 4 31B Instruct

    • GPUs built on the ARM architecture are not supported.

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

  • Mistral-Small-4-119B-2603

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • The ARM architecture is not supported.

    • Only text and image inputs are supported.

  • Qwen3.5-35B-A3B

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • The /v1/responses endpoint is currently not supported.

    • Under high-concurrency video workloads, the model may experience video decoding resource saturation, which can result in performance degradation at elevated concurrency levels.

    • When mounting a cache directory from the host system, add -u $(id -u), -e HOME=/tmp, and -e USER=$(id -un) to the docker run command to avoid potential permission issues.

  • Qwen3.5-122B-A10B

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • The /v1/responses endpoint is currently not supported.

    • Under high-concurrency video workloads, the model may experience video decoding resource saturation, which can result in performance degradation at elevated concurrency levels for video workloads.

    • On NVIDIA H100 SXM, the NIM_MAX_BATCH_SIZE parameter is limited to 128. Increasing this value may result in out-of-memory errors in the Mamba layer state cache.

    • Video input support is enabled by default. However, under high-concurrency and long ISL workloads, the model may experience performance degradation due to KV cache saturation. If video input support is not required, disable video input by setting NIM_MAX_VIDEOS_PER_PROMPT=0 for optimal performance.

    • When mounting a cache directory from the host system, add -u $(id -u), -e HOME=/tmp, and -e USER=$(id -un) to the docker run command to avoid potential permission issues.

    • When running on H100 GPUs, manually set the environment variable NIM_MANIFEST_PROFILE.

  • Qwen3.5-397B-A17B

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • The ARM architecture is not supported.

    • Video input is not enabled by default. To enable, set NIM_MAX_VIDEOS_PER_PROMPT=1.

    • Enabling video input can cause performance degradation for high ISL and high concurrency loads when KV cache is saturated.

  • Nemotron-Parse-v1.2

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • Only one image per request is supported.

    • Text input is not supported.

    • System messages are not supported.

    • Video input is not supported.

    • Guided decoding and tool calling are not supported.

    • Multilingual support is not available for the model.

    • Only x86_64 based GPUs are supported.

  • Ministral 3 14B Instruct 2512

    • For Mistral models, you cannot change the guided decoding backend. The environment variable NIM_GUIDED_DECODING_BACKEND is set to guidance by default and is not configurable.

    • The following parameters do not work as expected: chat_template, chat_template_kwargs, and media_io_kwargs.

    • Sending an invalid value for the mm_processor_kwargs parameter results in an HTTP 500 error code.

    • The function role is not supported in the messages field. The supported roles are system, user, assistant, and tool.

    • On L40S GPUs, maximum context length is 100k due to limited GPU memory. Other GPUs support up to 262,144 tokens.

    • To ensure optimal response quality, do not use stop words in tool calling requests.

    • The model may still attempt to make tool calls when tool_choice=none and tool definitions are present in the request.

    • Sending an image in text-only prompts may result in an HTTP 500 error code. Use image_url instead.

  • Kimi-K2.5

    • Guided decoding is not supported.

    • Only text and image inputs are supported.

    • You can only deploy this NIM using Docker. Helm is not supported.

    • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.

    • This model is built with a specialized base container and is subject to limitations. It uses the release tag 1.7.0-variant for the container image. Refer to Notes on NIM Container Variants for more information.

    • This model does not support the media_io_kwargs and mm_processor_kwargs parameters.

    • The following metrics are not currently reported in the metrics endpoint:

      • vision_encoder_latency_seconds

      • request_image_count

      • image_size_pixels

      • gpu_cache_usage_perc

Release 1.6.0#

Summary#

This is the initial release of Mistral Large 3 675B Instruct 2512. For more information about this model, refer to the model card on build.nvidia.com.

This is the initial release of Cosmos Reason2. This model is available in two sizes: Cosmos Reason2 2B and Cosmos Reason2 8B.

For more information about this model, refer to the model cards on Hugging Face:

This is an updated release of Nemotron Nano 12B v2 VL. For more information about this model, refer to the model card on build.nvidia.com.

Visual Language Models#

Refer to the support matrix for the following models:

Changes from 1.5.0#

  • Nemotron Nano 12B v2 VL

    • Added FP4 quantized profiles

    • Enabled prefix caching

Limitations#

  • All models

    • Setting NIM_TOKENIZER_MODE=slow is not supported.

    • When passing an invalid image or video URL, the error code is 500 instead of 4xx.

    • Video models have a minimum frame resolution of 128x128.

    • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. You should filter these characters out of a prompt before submitting the prompt to the VLM.

  • Mistral Large 3 675B Instruct 2512

    • For Mistral models, you cannot change the guided decoding backend. The environment variable NIM_GUIDED_DECODING_BACKEND is set to guidance by default and is not configurable.

    • For Mistral models, you cannot override chat_template at the request level using the chat_template or chat_template_kwargs parameter.

    • For Mistral models, the function role is not supported in the messages field. The supported roles are system, user, assistant, and tool.

    • To ensure optimal response quality, avoid using stop words in tool calling requests.

    • For Mistral visual language models, the mm_processor_kwargs parameter is not supported at the request level.

  • Cosmos Reason2

    • Speculative decoding on Blackwell chips is not supported.

    • Warnings about pynvml deprecation and transformers version 4.57.1 incompatibility are shown during startup.

    • Inference with very long videos (minutes to hours) and default sampling (4 FPS) may hang.

  • Nemotron Nano 12B v2 VL

    • If you set the min_tokens sampling parameter in a request, you should also set the max_tokens sampling parameter. Setting min_tokens alone causes the model to generate repetitive content.

    • For L40S GPU deployments, disable KV cache reuse by setting environment variable NIM_ENABLE_KV_CACHE_REUSE=0 to prevent out of memory errors.

Release 1.5.0#

Summary#

This is the initial release of Nemotron Nano 12B v2 VL. For more information about this model, refer to the model card on build.nvidia.com.

This is an updated version of nemoretriever-parse, which was originally released in version 1.2.0. The updated model is now known as Nemotron Parse. For more information, refer to the Nemotron Parse Overview.

Visual Language Models#

Refer to the support matrix for the following models:

Limitations#

  • All models

    • Setting NIM_TOKENIZER_MODE=slow is not supported.

    • When passing an invalid image or video URL, the error code will be 500 instead of 4xx.

  • Nemotron Nano 12B v2 VL

    • KV cache reuse between requests is not supported.

  • Nemotron Parse

    • Only one image per request is supported.

    • Text input is not supported.

    • System messages are not supported.

    • Output streaming is not supported.

    • Video input is not supported.

Release 1.4.1#

Summary#

This is an updated release for Cosmos Reason1 7B. For more information about this model, refer to the model card on build.nvidia.com.

Visual Language Models#

Refer to the support matrix for the following model:

Changes from 1.4.0#

  • Cosmos Reason1 7B

    • Fixed memory profiling for frames-to-tokens encoding.

    • Fixed error that resulted in wrong decoding backend being used for h265 videos.

    • Fixed memory leaks in video decoding.

    • Added memory profiling for video decoding.

    • Number of input video frames calculated with fps parameter can now be limited by the num_frames parameter.

Limitations#

  • Cosmos Reason1 7B

    • If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • When passing an invalid image URL or parameters, the HTTP error code can be incorrect.

    • Changing the guided decoding backend at runtime is not supported.

    • The best_of parameter is not supported.

    • FP8 profiles are not available on B200 and GB200.

    • TP2 profiles can lead to higher TTFT than TP1 profiles in some cases.

Release 1.4.0#

Summary#

This is the initial release of Cosmos Reason1 7B. For more information about this model, refer to the model card on build.nvidia.com.

This is the initial release of Llama 4 Maverick 17B 128E Instruct. For more information about this model, refer to the model card on build.nvidia.com.

Visual Language Models#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Limitations#

  • Cosmos Reason1 7B

    • If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • When passing an invalid image URL or parameters, the HTTP error code can be incorrect.

    • Changing the guided decoding backend at runtime is not supported

    • The best_of parameter is not supported

    • Sending long videos (multiple minutes) can lead to timeouts

    • FP8 profiles are not available on B200 and GB200

    • TP2 profiles can lead to higher TTFT than TP1 profiles in some cases

  • Llama 4 Maverick 17B 128E Instruct

    • If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.

    • There’s a time to first token (TTFT) degradation observed with concurrency >= 16. End-to-end latency isn’t affected.

Release 1.3.2#

Summary#

This is an updated release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Visual Language Models#

Refer to the support matrix for the following model:

Note

Llama 4 models are not available in the EU.

Changes from 1.3.1#

  • Llama 4 Scout 17B 16E Instruct

    • Tool calling is now supported.

Limitations#

  • Llama 4 Scout 17B 16E Instruct

    • If you set the environment variable NIM_TOKENIZER_MODE=slow, the deployment fails.

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • The default maximum sequence length is 131k

    • Following Meta’s guidance, each request supports up to 5 images by default

    • Accuracy of text-only requests can be lower on FP8 profiles on Hopper

    • When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.3.1#

Summary#

This is the initial release of Mistral Small 3.2. For more information about this model, refer to the model card on build.nvidia.com.

This is an updated release of Llama Nemotron Nano VL. For more information about this model, refer to the model card on build.nvidia.com.

This is an updated release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Visual Language Models#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Changes from 1.3.0#

  • Llama Nemotron Nano VL

    • Fixed performance issues with SFT and FP8

    • Improved air-gap deployment of SFT checkpoints

  • Llama 4 Scout 17B 16E Instruct

    • Introduced generic TP2 profiles for deployment on H200 NVL

Limitations#

  • Mistral Small 3.2 24B Instruct 2506

    • The LlamaStack API is not supported.

    • Structured generation is not supported.

  • Llama Nemotron Nano VL

    • The LlamaStack API is not supported.

    • PEFT is not supported.

  • Llama 4 Scout 17B 16E Instruct

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • Tool calling is not supported.

    • The default maximum sequence length is 131k

    • Following Meta’s guidance, each request supports up to 5 images by default

    • Accuracy of text-only requests can be lower on FP8 profiles on Hopper

    • When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.3.0#

Summary#

This is the initial release of Llama Nemotron Nano VL. For more information about this model, refer to the model card on build.nvidia.com.

This is the initial release of Llama 4 Scout 17B 16E Instruct. For more information, refer to the model card on GitHub.

Visual Language Models#

Refer to the support matrix for the following models:

Note

Llama 4 models are not available in the EU.

Limitations#

  • Llama Nemotron Nano VL

    • The LlamaStack API is not supported.

    • PEFT is not supported.

  • Llama 4 Scout 17B 16E Instruct

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • Tool calling is not supported.

    • The default maximum sequence length is 131k

    • Following Meta’s guidance, each request supports up to 5 images by default

    • Accuracy of text-only requests can be lower on FP8 profiles on Hopper

    • When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.2.0#

Summary#

This is the initial release of nemoretriever-parse. For more information, refer to the nemoretriever-parse Overview.

Visual Language Models#

Refer to the support matrix for the following model:

Limitations#

  • Only one image per request is supported.

  • Text input is not allowed.

  • System messages are not allowed.

Release 1.1.1#

Summary#

This patch release fixes CUDA runtime errors seen on AWS and Azure instances.

Visual Language Models#

Refer to the support matrix for the following models:

Note

Llama 3 models are not available in the EU.

Limitations#

  • PEFT is not supported.

  • Following Meta’s guidance, function calling is not supported.

  • Following Meta’s guidance, only one image per request is supported.

  • Following Meta’s guidance, system messages are not allowed with images.

  • Following the official vLLM implementation, images are always added to the front of user messages.

  • Maximum concurrency can be low when using the vLLM backend.

  • Image and vision encoder Prometheus metrics are not available with the vLLM backend.

  • With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.

Release 1.1.0#

Summary#

This is the 1.1.0 release of NIM for VLMs.

Visual Language Models#

Refer to the support matrix for the following models:

Note

Llama 3 models are not available in the EU.

Limitations#

  • PEFT is not supported.

  • Following Meta’s guidance, function calling is not supported.

  • Following Meta’s guidance, only one image per request is supported.

  • Following Meta’s guidance, system messages are not allowed with images.

  • Following the official vLLM implementation, images are always added to the front of user messages.

  • Maximum concurrency can be low when using the vLLM backend.

  • Image and vision encoder Prometheus metrics are not available with the vLLM backend.

  • With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.

  • When deploying an optimized profile on AWS A10G, you might encounter the error [TensorRT-LLM][ERROR] ICudaEngine::createExecutionContextWithoutDeviceMemory: Error Code 1: Cuda Runtime (an illegal memory access was encountered). Use the vLLM backend instead as described in Profile Selection.