Release Notes#
v1.14.0#
Features#
The microservice serves a GPU-accelerated LLM model that performs multilingual content moderation for building trustworthy LLM applications. The LLM model detects harmful content in user messages or bot responses.
KV Cache reuse, controlled by environment variable
NIM_ENABLE_KV_CACHE_REUSE, is now enabled by default to improve performance. For multi-tenant environments, caching can introduce information leaks. Refer to the Structuring Applications to Secure the KV Cache blog entry for more information about the security implications.
Known Issues#
TensorRT-LLM profiles can experience lower throughput under high load compared to generic model profiles that use the vLLM engine. The performance impact varies by GPU model and workload characteristics such as input/output sequence lengths and concurrency levels.
The following TensorRT-LLM profiles show reduced throughput compared to vLLM:
BF16 Precision Profiles:
B200, GB200, GH200 (144GB), H100, H100 NVL, H200 NVL: Lower throughput at high concurrency (100-250+ concurrent requests) with longer sequences (1000/1000 and 500/2000 tokens).
GH200 (480GB), H100 NVL: Lower throughput at very high concurrency (250+ concurrent requests) with longer output sequences (500/2000 tokens).
H200 (BF16): Not recommended for production use in this release/
FP8 Precision Profiles:
B200, GH200 (144GB): 25-35% lower throughput at high concurrency (250+ concurrent requests) with longer sequences (1000/1000 and 500/2000 tokens).
GB200, H100 NVL, H200: Lower throughput at high concurrency (100-250+ concurrent requests) with longer sequences.
L40S: Lower throughput at low concurrency with very long input sequences (5000 tokens)
For workloads with high concurrency and long sequences, consider using vLLM-based generic model profiles for optimal performance.
The following table identifies the impacted model profiles by configuration and ID:
Profile Name  | 
ID  | 
|---|---|
tensorrt_llm-b200-bf16-2-latency  | 
  | 
tensorrt_llm-gb200-bf16-2-latency  | 
  | 
tensorrt_llm-gh200_144gb-bf16-2-latency  | 
  | 
tensorrt_llm-gh200_480gb-bf16-1-latency  | 
  | 
tensorrt_llm-h100_nvl-bf16-2-latency  | 
  | 
tensorrt_llm-h100-bf16-2-latency  | 
  | 
tensorrt_llm-h200_nvl-bf16-2-latency  | 
  | 
tensorrt_llm-h200-bf16-2-latency  | 
  | 
tensorrt_llm-b200-fp8-2-latency  | 
  | 
tensorrt_llm-gb200-fp8-2-latency  | 
  | 
tensorrt_llm-gh200_144gb-fp8-2-latency  | 
  | 
tensorrt_llm-h100_nvl-fp8-2-latency  | 
  | 
tensorrt_llm-h200-fp8-2-latency  | 
  | 
tensorrt_llm-l40s-fp8-2-latency  | 
  | 
Llama 3.1 Nemotron Safety Guard 8B NIM is based on NVIDIA NIM for LLMs v1.14.0. The following known issues are common to all containers built from NIM for LLMs v1.14.0.
With the TensorRT-LLM engine, setting
temperature=0enforces greedy decoding and makes therepetition_penaltyargument ineffective.When calling the
/v1/metadataAPI, the following fields undermodelInfoare missing:repository_overrideselectedModelProfileId
If you set
NIM_GUIDED_DECODING_BACKENDtolm-format-enforceror a custom backend, you must disable KV cache by settingNIM_ENABLE_KV_CACHE_REUSE=0. Otherwise, an incorrect backend name is treated as a custom backend.