PB Release Notes for NVIDIA NIM for LLMs#
This page contains the release notes for Production Branch (PB) releases of NVIDIA NIM for LLMs.
Important
PB releases (indicated by the -pb tag) provide stable, secure AI frameworks and SDKs for mission-critical applications. Refer to NVIDIA AI Enterprise Release Branches for more details.
Standard release notes generally apply, with PB-specific updates noted here.
Release 1.14.0-pb5.1#
New Language Models#
New Features#
The following are the new features introduced in 1.14.0-pb5.1:
The following container images are now available:
AMD64 / ARM64 (correct image is pulled based on machine architecture):
1.14.0-pb5.1Hardened AMD64 (government-ready):
1.14.0-pb5.1-stig-fips-x86-64
Added support for custom chat templates for models deployed with the vLLM and TRT-LLM backends. Clients can override the model’s default chat template by providing a custom template using the
chat_templatefield in Chat Completions API requests.This is a security-sensitive feature and is disabled by default.
To enable this feature, set the environment variable
NIM_TRUST_REQUEST_CHAT_TEMPLATEto1or use the--trust-request-chat-templateparameter in your deployment configuration.Usage Examples:
docker run -it --rm --name=llm-nim \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY=$NGC_API_KEY \ -e HF_TOKEN=$HF_TOKEN \ -e NIM_TENSOR_PARALLEL_SIZE=2 \ -e NIM_FORCE_TRUST_REMOTE_CODE=1 \ -e NIM_TRUST_REQUEST_CHAT_TEMPLATE=1 \ -e NIM_MODEL_NAME="hf://meta-llama/Llama-3.1-8B-Instruct" \ -e NIM_SERVED_MODEL_NAME="meta/llama-3.1-8b-instruct" \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ nvcr.io/nim/nvidia/llm-nim:1.14.0-pb5.1
docker run -it --rm --name=llama-3.3-nemotron-super-49b-v1.5-pb25h2 \ --gpus all \ --shm-size=16GB \ -e NGC_API_KEY=$NGC_API_KEY \ -e NIM_TENSOR_PARALLEL_SIZE=2 \ -e NIM_TRUST_REQUEST_CHAT_TEMPLATE=1 \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5-pb25h2:1.14.0-pb5.1
curl --location 'http://0.0.0.0:8000/v1/chat/completions' \ --header 'Content-Type: application/json' \ --data '{ "model": "nvidia/llm-nim", "messages": [ { "role": "user", "content": "Hello! who are you?" } ], "chat_template": "{% for message in messages %}{{ message[\"content\"] }}{% endfor %}", "max_tokens": 50, "temperature": 0 }'
curl --location 'http://0.0.0.0:8000/v1/chat/completions' \ --header 'Content-Type: application/json' \ --data '{ "model": "nvidia/llama-3.3-nemotron-super-49b-v1.5-pb25h2", "messages": [ { "role": "user", "content": "Hello! who are you?" } ], "chat_template": "{% for message in messages %}{{ message[\"content\"] }}{% endfor %}", "max_tokens": 50, "temperature": 0 }'
New Issues#
The following are the new known issues discovered in 1.14.0-pb5.1:
Llama-3.1-70B-Instruct PB 25h2
On Blackwell GPUs, additionally set
-e VLLM_ATTENTION_BACKEND=FLASH_ATTN.Profiles
vllm-bf16-tp2-pp1-loraandvllm-bf16-tp2-pp1are not supported on H100 GPUs.
Llama-3.3-Nemotron-Super-49B-v1.5 PB 25h2
Tool calling may fail due to invalid JSON or incorrect nesting (for example, parameters appear outside the
argumentsobject) in the response.For profile
tensorrt_llm-b200-fp8-2-latency, observed 60% throughput degradation compared to OSS vLLM atconc=1for ISL/OSL 5k/500.For profile
tensorrt_llm-gb200-fp8-2-latency, observed 27% throughput degradation compared to OSS vLLM atconc=1for ISL/OSL 1k/1k.For profile
tensorrt_llm-h200_nvl-fp8-2-latency, observed 30% throughput degradation compared to OSS vLLM.For profile
vllm-gb200-bf16-2, observed up to 38% performance degradation compared to OSS vLLM forconc=1.On Blackwell GPUs, additionally set
-e VLLM_ATTENTION_BACKEND=FLASH_ATTN.
Release 1.14.0-pb5.0#
Summary#
NVIDIA NIM for LLMs 1.14.0-pb5.0 introduces support for the multi-LLM compatible NIM container and updates the CUDA version.
New Features#
The following are the new features introduced in 1.14.0-pb5.0:
Added support for the multi-LLM compatible NIM container. The following container images are now available:
AMD64 / ARM64 (correct image is pulled based on machine architecture):
1.14.0-pb5.0Hardened AMD64 (government-ready):
1.14.0-pb5.0-stig-fips-x86-64
Updated the CUDA version from 12.9 to 13.
New Issues#
The following are the new known issues discovered in 1.14.0-pb5.0:
SGLang is not supported.