Release Notes#

v1.14.0#

Features#

  • The microservice serves a GPU-accelerated LLM model that performs multilingual content moderation for building trustworthy LLM applications. The LLM model detects harmful content in user messages or bot responses.

  • KV Cache reuse, controlled by environment variable NIM_ENABLE_KV_CACHE_REUSE, is now enabled by default to improve performance. For multi-tenant environments, caching can introduce information leaks. Refer to the Structuring Applications to Secure the KV Cache blog entry for more information about the security implications.

Known Issues#

TensorRT-LLM profiles can experience lower throughput under high load compared to generic model profiles that use the vLLM engine. The performance impact varies by GPU model and workload characteristics such as input/output sequence lengths and concurrency levels.

The following TensorRT-LLM profiles show reduced throughput compared to vLLM:

  • BF16 Precision Profiles:

    • B200, GB200, GH200 (144GB), H100, H100 NVL, H200 NVL: Lower throughput at high concurrency (100-250+ concurrent requests) with longer sequences (1000/1000 and 500/2000 tokens).

    • GH200 (480GB), H100 NVL: Lower throughput at very high concurrency (250+ concurrent requests) with longer output sequences (500/2000 tokens).

    • H200 (BF16): Not recommended for production use in this release/

  • FP8 Precision Profiles:

    • B200, GH200 (144GB): 25-35% lower throughput at high concurrency (250+ concurrent requests) with longer sequences (1000/1000 and 500/2000 tokens).

    • GB200, H100 NVL, H200: Lower throughput at high concurrency (100-250+ concurrent requests) with longer sequences.

    • L40S: Lower throughput at low concurrency with very long input sequences (5000 tokens)

For workloads with high concurrency and long sequences, consider using vLLM-based generic model profiles for optimal performance.

The following table identifies the impacted model profiles by configuration and ID:

Profile Name

ID

tensorrt_llm-b200-bf16-2-latency

7b2460a744467de931be74543f3b1be6a4540edd8f5d3523c31aca253d3ee714

tensorrt_llm-gb200-bf16-2-latency

3b86e1c4eafac6dd59612ed5cea6878177867773d567fcc0e0127ad5b2b1221b

tensorrt_llm-gh200_144gb-bf16-2-latency

542edd6068b3fee7bb9431ba988f167dfc9f6e5b6dbf005b2e208d42bd17d705

tensorrt_llm-gh200_480gb-bf16-1-latency

f41c136a67469ae0bda89f96d78cb8c2b9c01c27d0ac618112248025320817c3

tensorrt_llm-h100_nvl-bf16-2-latency

a91cc87e8e98c7e88967319c273392e447fab72dd22aa8231630b573284525b2

tensorrt_llm-h100-bf16-2-latency

2f69689bf8fef4118bb018bb07869fc2d4b6eb3185115b2117ad62150f5d0006

tensorrt_llm-h200_nvl-bf16-2-latency

dbb457d9b5a45d0a6976c0ba1a8ee6072deb8fe64c49a12e47ba9c71863618d2

tensorrt_llm-h200-bf16-2-latency

158a13eff79873eb73689daf87c365fa06946f74856646e54edc69728ef59a8e

tensorrt_llm-b200-fp8-2-latency

1c67491281ac66f32ca917bc566808bf4657ad20dec137f2b40c78d95b3a40dd

tensorrt_llm-gb200-fp8-2-latency

cfca3a90be399e2fc6b91dfe19aa141fe7db0ad114df365cf43d77c675711049

tensorrt_llm-gh200_144gb-fp8-2-latency

052a14156d375521d64c584a0197a00ab3c54ae742b55145f1df091072656de7

tensorrt_llm-h100_nvl-fp8-2-latency

5e8a78e4d0c9e2e513466ec23ac181ae8d75ce05bda5c4653eddf8f3a99f2d58

tensorrt_llm-h200-fp8-2-latency

5a708fe91514e2aa44438158af955b35d51fab4ca1fb7268e35930e67fce6e08

tensorrt_llm-l40s-fp8-2-latency

f282d4039fc42e3ab8a69854daf1a3a9e0fdce7974d06c3924969e3196e4ac08

Llama 3.1 Nemotron Safety Guard 8B NIM is based on NVIDIA NIM for LLMs v1.14.0. The following known issues are common to all containers built from NIM for LLMs v1.14.0.

  • With the TensorRT-LLM engine, setting temperature=0 enforces greedy decoding and makes the repetition_penalty argument ineffective.

  • When calling the /v1/metadata API, the following fields under modelInfo are missing:

    • repository_override

    • selectedModelProfileId

  • If you set NIM_GUIDED_DECODING_BACKEND to lm-format-enforcer or a custom backend, you must disable KV cache by setting NIM_ENABLE_KV_CACHE_REUSE=0. Otherwise, an incorrect backend name is treated as a custom backend.