PB Release Notes for NVIDIA NIM for LLMs#

This page contains the release notes for Production Branch (PB) releases of NVIDIA NIM for LLMs.

Important

PB releases (indicated by the -pb tag) provide stable, secure AI frameworks and SDKs for mission-critical applications. Refer to NVIDIA AI Enterprise Release Branches for more details.

Standard release notes generally apply, with PB-specific updates noted here.

Release 1.14.0-pb5.3#

New Features#

The following multi-LLM compatible NIM container images are available in 1.14.0-pb5.3:

AMD64 / ARM64 (correct image is pulled based on machine architecture): 1.14.0-pb5.3
Hardened AMD64 (government-ready): 1.14.0-pb5.3-stig-fips-x86-64

Release 1.14.0-pb5.2#

New Language Models#

Llama-3.1-8B-Instruct PB 25h2

New Features#

The following multi-LLM compatible NIM container images are available in 1.14.0-pb5.2:

AMD64 / ARM64 (correct image is pulled based on machine architecture): 1.14.0-pb5.2
Hardened AMD64 (government-ready): 1.14.0-pb5.2-stig-fips-x86-64

New Issues#

Llama-3.1-70B-Instruct PB 25h2
- The STIG FIPS image 1.14.0-pb5.2-stig-fips-x86-64 does not support the -u parameter for the docker run command (for example, using -u $(id -u) ). Do not use this parameter when deploying the STIG image.

Release 1.14.0-pb5.1#

New Language Models#

New Features#

The following are the new features introduced in 1.14.0-pb5.1:

The following container images are now available:
- AMD64 / ARM64 (correct image is pulled based on machine architecture): 1.14.0-pb5.1
- Hardened AMD64 (government-ready): 1.14.0-pb5.1-stig-fips-x86-64

Added support for custom chat templates for models deployed with the vLLM and TRT-LLM backends. Clients can override the model’s default chat template by providing a custom template using the chat_template field in Chat Completions API requests.

This is a security-sensitive feature and is disabled by default.
To enable this feature, set the environment variable NIM_TRUST_REQUEST_CHAT_TEMPLATE to 1 or use the --trust-request-chat-template parameter in your deployment configuration.

Usage Examples:

Deployment

Multi-LLM NIM

docker run -it --rm --name=llm-nim \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_TENSOR_PARALLEL_SIZE=2 \
  -e NIM_FORCE_TRUST_REMOTE_CODE=1 \
  -e NIM_TRUST_REQUEST_CHAT_TEMPLATE=1 \
  -e NIM_MODEL_NAME="hf://meta-llama/Llama-3.1-8B-Instruct" \
  -e NIM_SERVED_MODEL_NAME="meta/llama-3.1-8b-instruct" \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/llm-nim:1.14.0-pb5.1

LLM-specific NIM

docker run -it --rm --name=llama-3.3-nemotron-super-49b-v1.5-pb25h2 \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_TENSOR_PARALLEL_SIZE=2 \
  -e NIM_TRUST_REQUEST_CHAT_TEMPLATE=1 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5-pb25h2:1.14.0-pb5.1

API Request

Multi-LLM NIM

curl --location 'http://0.0.0.0:8000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "nvidia/llm-nim",
    "messages": [
      {
        "role": "user",
        "content": "Hello! who are you?"
      }
    ],
    "chat_template": "{% for message in messages %}{{ message[\"content\"] }}{% endfor %}",
    "max_tokens": 50,
    "temperature": 0
  }'

LLM-specific NIM

curl --location 'http://0.0.0.0:8000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "nvidia/llama-3.3-nemotron-super-49b-v1.5-pb25h2",
    "messages": [
      {
        "role": "user",
        "content": "Hello! who are you?"
      }
    ],
    "chat_template": "{% for message in messages %}{{ message[\"content\"] }}{% endfor %}",
    "max_tokens": 50,
    "temperature": 0
  }'

New Issues#

The following are the new known issues discovered in 1.14.0-pb5.1:

Llama-3.1-70B-Instruct PB 25h2
- On Blackwell GPUs, additionally set -e VLLM_ATTENTION_BACKEND=FLASH_ATTN.
- Profiles vllm-bf16-tp2-pp1-lora and vllm-bf16-tp2-pp1 are not supported on H100 GPUs.
Llama-3.3-Nemotron-Super-49B-v1.5 PB 25h2
- Tool calling may fail due to invalid JSON or incorrect nesting (for example, parameters appear outside the arguments object) in the response.
- For profile tensorrt_llm-b200-fp8-2-latency, observed 60% throughput degradation compared to OSS vLLM at conc=1 for ISL/OSL 5k/500.
- For profile tensorrt_llm-gb200-fp8-2-latency, observed 27% throughput degradation compared to OSS vLLM at conc=1 for ISL/OSL 1k/1k.
- For profile tensorrt_llm-h200_nvl-fp8-2-latency, observed 30% throughput degradation compared to OSS vLLM.
- For profile vllm-gb200-bf16-2, observed up to 38% performance degradation compared to OSS vLLM for conc=1.
- On Blackwell GPUs, additionally set -e VLLM_ATTENTION_BACKEND=FLASH_ATTN.

Release 1.14.0-pb5.0#

Summary#

NVIDIA NIM for LLMs 1.14.0-pb5.0 introduces support for the multi-LLM compatible NIM container and updates the CUDA version.

New Features#

The following are the new features introduced in 1.14.0-pb5.0:

Added support for the multi-LLM compatible NIM container. The following container images are now available:
- AMD64 / ARM64 (correct image is pulled based on machine architecture): 1.14.0-pb5.0
- Hardened AMD64 (government-ready): 1.14.0-pb5.0-stig-fips-x86-64
Updated the CUDA version from 12.9 to 13.

New Issues#

The following are the new known issues discovered in 1.14.0-pb5.0:

SGLang is not supported.

PB Release Notes for NVIDIA NIM for LLMs#

Release 1.14.0-pb5.3#

New Features#

Release 1.14.0-pb5.2#

New Language Models#

New Features#

New Issues#

Release 1.14.0-pb5.1#

New Language Models#

New Features#

New Issues#

Release 1.14.0-pb5.0#

Summary#

New Features#

New Issues#

Related Topics#