Release Notes#

v1.14.0#

Features#

The microservice serves a GPU-accelerated LLM model that performs multilingual content moderation for building trustworthy LLM applications. The LLM model detects harmful content in user messages or bot responses.
KV Cache reuse, controlled by environment variable NIM_ENABLE_KV_CACHE_REUSE, is now enabled by default to improve performance. For multi-tenant environments, caching can introduce information leaks. Refer to the Structuring Applications to Secure the KV Cache blog entry for more information about the security implications.

Known Issues#

TensorRT-LLM profiles can experience lower throughput under high load compared to generic model profiles that use the vLLM engine. The performance impact varies by GPU model and workload characteristics such as input/output sequence lengths and concurrency levels.

The following TensorRT-LLM profiles show reduced throughput compared to vLLM:

BF16 Precision Profiles:
- B200, GB200, GH200 (144GB), H100, H100 NVL, H200 NVL: Lower throughput at high concurrency (100-250+ concurrent requests) with longer sequences (1000/1000 and 500/2000 tokens).
- GH200 (480GB), H100 NVL: Lower throughput at very high concurrency (250+ concurrent requests) with longer output sequences (500/2000 tokens).
- H200 (BF16): Not recommended for production use in this release/
FP8 Precision Profiles:
- B200, GH200 (144GB): 25-35% lower throughput at high concurrency (250+ concurrent requests) with longer sequences (1000/1000 and 500/2000 tokens).
- GB200, H100 NVL, H200: Lower throughput at high concurrency (100-250+ concurrent requests) with longer sequences.
- L40S: Lower throughput at low concurrency with very long input sequences (5000 tokens)

For workloads with high concurrency and long sequences, consider using vLLM-based generic model profiles for optimal performance.

The following table identifies the impacted model profiles by configuration and ID:

Profile Name	ID
tensorrt_llm-b200-bf16-2-latency	`7b2460a744467de931be74543f3b1be6a4540edd8f5d3523c31aca253d3ee714`
tensorrt_llm-gb200-bf16-2-latency	`3b86e1c4eafac6dd59612ed5cea6878177867773d567fcc0e0127ad5b2b1221b`
tensorrt_llm-gh200_144gb-bf16-2-latency	`542edd6068b3fee7bb9431ba988f167dfc9f6e5b6dbf005b2e208d42bd17d705`
tensorrt_llm-gh200_480gb-bf16-1-latency	`f41c136a67469ae0bda89f96d78cb8c2b9c01c27d0ac618112248025320817c3`
tensorrt_llm-h100_nvl-bf16-2-latency	`a91cc87e8e98c7e88967319c273392e447fab72dd22aa8231630b573284525b2`
tensorrt_llm-h100-bf16-2-latency	`2f69689bf8fef4118bb018bb07869fc2d4b6eb3185115b2117ad62150f5d0006`
tensorrt_llm-h200_nvl-bf16-2-latency	`dbb457d9b5a45d0a6976c0ba1a8ee6072deb8fe64c49a12e47ba9c71863618d2`
tensorrt_llm-h200-bf16-2-latency	`158a13eff79873eb73689daf87c365fa06946f74856646e54edc69728ef59a8e`
tensorrt_llm-b200-fp8-2-latency	`1c67491281ac66f32ca917bc566808bf4657ad20dec137f2b40c78d95b3a40dd`
tensorrt_llm-gb200-fp8-2-latency	`cfca3a90be399e2fc6b91dfe19aa141fe7db0ad114df365cf43d77c675711049`
tensorrt_llm-gh200_144gb-fp8-2-latency	`052a14156d375521d64c584a0197a00ab3c54ae742b55145f1df091072656de7`
tensorrt_llm-h100_nvl-fp8-2-latency	`5e8a78e4d0c9e2e513466ec23ac181ae8d75ce05bda5c4653eddf8f3a99f2d58`
tensorrt_llm-h200-fp8-2-latency	`5a708fe91514e2aa44438158af955b35d51fab4ca1fb7268e35930e67fce6e08`
tensorrt_llm-l40s-fp8-2-latency	`f282d4039fc42e3ab8a69854daf1a3a9e0fdce7974d06c3924969e3196e4ac08`

Llama 3.1 Nemotron Safety Guard 8B NIM is based on NVIDIA NIM for LLMs v1.14.0. The following known issues are common to all containers built from NIM for LLMs v1.14.0.

With the TensorRT-LLM engine, setting temperature=0 enforces greedy decoding and makes the repetition_penalty argument ineffective.
When calling the /v1/metadata API, the following fields under modelInfo are missing:
- repository_override
- selectedModelProfileId
If you set NIM_GUIDED_DECODING_BACKEND to lm-format-enforcer or a custom backend, you must disable KV cache by setting NIM_ENABLE_KV_CACHE_REUSE=0. Otherwise, an incorrect backend name is treated as a custom backend.