PB Release Notes for NVIDIA NIM for LLMs#

This page contains the release notes for Production Branch (PB) releases of NVIDIA NIM for LLMs.

Important

PB releases (indicated by the -pb tag) provide stable, secure AI frameworks and SDKs for mission-critical applications. Refer to NVIDIA AI Enterprise Release Branches for more details.

Standard release notes generally apply, with PB-specific updates noted here.

Release 1.14.0-pb5.1#

New Language Models#

New Features#

The following are the new features introduced in 1.14.0-pb5.1:

  • The following container images are now available:

  • Added support for custom chat templates for models deployed with the vLLM and TRT-LLM backends. Clients can override the model’s default chat template by providing a custom template using the chat_template field in Chat Completions API requests.

    • This is a security-sensitive feature and is disabled by default.

    • To enable this feature, set the environment variable NIM_TRUST_REQUEST_CHAT_TEMPLATE to 1 or use the --trust-request-chat-template parameter in your deployment configuration.

    • Usage Examples:

      docker run -it --rm --name=llm-nim \
        --gpus all \
        --shm-size=16GB \
        -e NGC_API_KEY=$NGC_API_KEY \
        -e HF_TOKEN=$HF_TOKEN \
        -e NIM_TENSOR_PARALLEL_SIZE=2 \
        -e NIM_FORCE_TRUST_REMOTE_CODE=1 \
        -e NIM_TRUST_REQUEST_CHAT_TEMPLATE=1 \
        -e NIM_MODEL_NAME="hf://meta-llama/Llama-3.1-8B-Instruct" \
        -e NIM_SERVED_MODEL_NAME="meta/llama-3.1-8b-instruct" \
        -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
        -u $(id -u) \
        -p 8000:8000 \
        nvcr.io/nim/nvidia/llm-nim:1.14.0-pb5.1
      
      docker run -it --rm --name=llama-3.3-nemotron-super-49b-v1.5-pb25h2 \
        --gpus all \
        --shm-size=16GB \
        -e NGC_API_KEY=$NGC_API_KEY \
        -e NIM_TENSOR_PARALLEL_SIZE=2 \
        -e NIM_TRUST_REQUEST_CHAT_TEMPLATE=1 \
        -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
        -u $(id -u) \
        -p 8000:8000 \
        nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5-pb25h2:1.14.0-pb5.1
      
      curl --location 'http://0.0.0.0:8000/v1/chat/completions' \
        --header 'Content-Type: application/json' \
        --data '{
          "model": "nvidia/llm-nim",
          "messages": [
            {
              "role": "user",
              "content": "Hello! who are you?"
            }
          ],
          "chat_template": "{% for message in messages %}{{ message[\"content\"] }}{% endfor %}",
          "max_tokens": 50,
          "temperature": 0
        }'
      
      curl --location 'http://0.0.0.0:8000/v1/chat/completions' \
        --header 'Content-Type: application/json' \
        --data '{
          "model": "nvidia/llama-3.3-nemotron-super-49b-v1.5-pb25h2",
          "messages": [
            {
              "role": "user",
              "content": "Hello! who are you?"
            }
          ],
          "chat_template": "{% for message in messages %}{{ message[\"content\"] }}{% endfor %}",
          "max_tokens": 50,
          "temperature": 0
        }'
      

New Issues#

The following are the new known issues discovered in 1.14.0-pb5.1:

  • Llama-3.1-70B-Instruct PB 25h2

    • On Blackwell GPUs, additionally set -e VLLM_ATTENTION_BACKEND=FLASH_ATTN.

    • Profiles vllm-bf16-tp2-pp1-lora and vllm-bf16-tp2-pp1 are not supported on H100 GPUs.

  • Llama-3.3-Nemotron-Super-49B-v1.5 PB 25h2

    • Tool calling may fail due to invalid JSON or incorrect nesting (for example, parameters appear outside the arguments object) in the response.

    • For profile tensorrt_llm-b200-fp8-2-latency, observed 60% throughput degradation compared to OSS vLLM at conc=1 for ISL/OSL 5k/500.

    • For profile tensorrt_llm-gb200-fp8-2-latency, observed 27% throughput degradation compared to OSS vLLM at conc=1 for ISL/OSL 1k/1k.

    • For profile tensorrt_llm-h200_nvl-fp8-2-latency, observed 30% throughput degradation compared to OSS vLLM.

    • For profile vllm-gb200-bf16-2, observed up to 38% performance degradation compared to OSS vLLM for conc=1.

    • On Blackwell GPUs, additionally set -e VLLM_ATTENTION_BACKEND=FLASH_ATTN.

Release 1.14.0-pb5.0#

Summary#

NVIDIA NIM for LLMs 1.14.0-pb5.0 introduces support for the multi-LLM compatible NIM container and updates the CUDA version.

New Features#

The following are the new features introduced in 1.14.0-pb5.0:

  • Added support for the multi-LLM compatible NIM container. The following container images are now available:

  • Updated the CUDA version from 12.9 to 13.

New Issues#

The following are the new known issues discovered in 1.14.0-pb5.0:

  • SGLang is not supported.