Release Notes#

Release 1.3.0#

Summary#

This is the initial release of Llama Nemotron Nano VL. For more information on this model, see the model card on build.nvidia.com.

This is the initial release of Llama 4. For more information, see the model card on GitHub.

Visual Language Models#

Limitations#

  • Llama Nemotron Nano VL

    • The LlamaStack API is not supported.

    • PEFT is not supported.

  • Llama 4

    • The LlamaStack API is not supported.

    • PEFT is not supported.

    • Tool calling is not supported.

    • The default maximum sequence length is 131k

    • Following Meta’s guidance, each request supports up to 5 images by default

    • Accuracy of text-only requests can be lower on FP8 profiles on Hopper

    • When passing an invalid image URL, the error code will be 500 instead of 4xx

Release 1.2.0#

Summary#

This is the initial release of nemoretriever-parse. For more information, see nemoretriever-parse Overview.

Visual Language Models#

Limitations#

  • Only one image per request is supported.

  • Text input is not allowed.

  • System messages are not allowed.

Release 1.1.1#

Summary#

This patch release fixes CUDA runtime errors seen on AWS and Azure instances.

Visual Language Models#

Limitations#

  • PEFT is not supported.

  • Following Meta’s guidance, function calling is not supported.

  • Following Meta’s guidance, only one image per request is supported.

  • Following Meta’s guidance, system messages are not allowed with images.

  • Following the official vLLM implementation, images are always added to the front of user messages.

  • Maximum concurrency can be low when using the vLLM backend.

  • Image and vision encoder Prometheus metrics are not available with the vLLM backend.

  • With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.

Release 1.1.0#

Summary#

This is the 1.1.0 release of NIM for VLMs.

Visual Language Models#

Limitations#

  • PEFT is not supported.

  • Following Meta’s guidance, function calling is not supported.

  • Following Meta’s guidance, only one image per request is supported.

  • Following Meta’s guidance, system messages are not allowed with images.

  • Following the official vLLM implementation, images are always added to the front of user messages.

  • Maximum concurrency can be low when using the vLLM backend.

  • Image and vision encoder Prometheus metrics are not available with the vLLM backend.

  • With context length larger than 32k, the accuracy of Llama-3.2-90B-Vision-Instruct can be degraded.

  • When deploying an optimized profile on AWS A10G, you might see [TensorRT-LLM][ERROR] ICudaEngine::createExecutionContextWithoutDeviceMemory: Error Code 1: Cuda Runtime (an illegal memory access was encountered). Use the vLLM backend instead as described here.