Release Notes for NVIDIA NIM for LLMs#
This documentation contains the release notes for NVIDIA NIM for large language models (LLMs).
Release 1.10.0#
New Features in 1.10.0#
The following are the new features in 1.10.0:
Added host-based KV cache offloading support to improve memory efficiency when KV cache reuse is enabled (available only with the TensorRT-LLM backend). This feature increases the likelihood of KV cache reuse by copying reusable blocks to a buffer in host memory instead of evicting them.
On GH200 and GB200 systems, this feature leverages the unified memory architecture for more efficient memory management between the host and device.
Configuration options (see Configuration for more details):
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
: Set to1
to enable host-based KV cache offloading.NIM_KV_CACHE_HOST_MEM_FRACTION
: Controls the fraction of free host memory to use (default:0.1
).NIM_SDK_MAX_PARALLEL_DOWNLOAD_REQUESTS
: The maximum number of parallel download requests when downloading models (default:1
).
Added support for running NIM behind an SSL forward proxy. See Getting Started for Docker deployment and Deploy with Helm for Helm deployment.
Added SGLang backend for high-performance inference serving.
Added reward model support with custom reward string and logits range.
Support for NVIDIA Blackwell GPUs with NVFP4 quantization for improved performance and efficiency.
Support for deterministic generation.
The scheduler policy for the TRTLLM backend can now be set using
NIM_SCHEDULER_POLICY
. See Configuration for more details.Deprecated the NIM option
parallel_tool_calls
in favor of supporting parallel tool calling functionality on a per-model basis.Deprecated on-the-fly quantization of checkpoints, eliminating the need to mount external datasets.
Added support for deploying quantized
tensorrt_llm
checkpoints. See Fine-Tuned Model Support in NVIDIA NIM for LLMs for more details.
Known Issues Fixed in 1.10.0#
The following are the previous known issues that were fixed in 1.10.0:
(FIXED) Response Delay in Tool Calling (Incomplete Type Information): Tool calls might take over 30 seconds if descriptions for
array
types lackitems
specifications, or if descriptions forobject
types lackproperties
specifications. To prevent delays, ensure these details (items
forarray
,properties
forobject
) are included in tool descriptions.(FIXED) Response Freezing in Tool Calling (Too Many Parameters): Tool calls will freeze the NIM if a tool description includes a function with more than 8 parameters. To avoid this, ensure that functions defined in tool descriptions use 8 or fewer parameters. If freezing occurs, the NIM must be restarted.
New Known Issues in 1.10.0#
The following are the new known issues discovered in 1.10.0:
Tip
For related information, see Troubleshoot NVIDIA NIM for LLMs.
The following are the known issues with function calling:
Format enforcement is not guaranteed by default. The
tool_choice
parameter no longer supportsrequired
as a value, despite its presence in the OpenAPI spec. This might impact the accuracy of tool calling for some models.Function calling no longer uses guided decoding, resulting in lower accuracy for smaller models like Llama 3.2 1B/3B Instruct.
The following are the known issues with the custom guided decoding backend:
The
fast_outlines
backend is deprecated.Guided decoding now defaults to
xgrammar
instead ofoutlines
. For more information, refer to Structured Generation with NVIDIA NIM for LLMs.The guided decoding backend cannot be accessed without using constraint fields. Set
guided_regex
to".*"
to act as a minimal trigger for the guided decoding backend.The
outlines
guided decoding backend is not supported forsglang
profiles.Custom guided decoding requires backends that implement the
set_custom_guided_decoding_parameters
method, as defined in the backend file.Guided decoding does not work for TP > 1 for
sglang
profiles.Deepseek R1 may produce less accurate results when using guided decoding.
You can’t deploy
fp8
quantized engines on H100-NVL GPUs with deterministic generation mode on. For more information, refer to Deterministic Generation Mode in NVIDIA NIM for LLMs.INT4/INT8 quantized profiles are not supported for Blackwell GPUs.
When using Native TLS Stack to download the model, you should set
--ulimit nofile=1048576
in the docker run command. If a Helm deployment is run behind the proxy, the limit must be increased on host nodes or a custom command must be provided. See Deploying Behind a TLS Proxy for details.Air Gap Deployments of a model like Llama 3.3 Nemotron Super 49B by using the model directory option might not work if the model directory is in the HuggingFace format. Switch to using
NIM_FT_MODEL
in those cases. For more information, refer to Air Gap Deployment.Llama-3.1-Nemotron-Ultra-253B-v1 does not work on H100s and A100s. Use H200s and B200s to deploy successfully.
DeepSeek models do not support tool calling.
LoRA does not work for
mistral-nemo-12b-instruct
.
Previous Releases#
The following are links to the previous release notes.
All Current Known Issues#
The following are the current (unfixed) known issues from all previous versions:
The vLLM backend is not supported on Llama Nemotron models.
All models return a 500 when setting
logprobs=2
,echo=true
, andstream=false
; they should return a 200.Deploying with KServe can require changing permissions for the cache directory. See the Serving models from local assets section for details.
Empty metrics values on multi-GPU TensorRT-LLM model. Metrics items
gpu_cache_usage_perc
,num_request_max
,num_requests_running
,num_requests_waiting
, andprompt_tokens_total
won’t be reported for multi-GPU TensorRT-LLM model, because TensorRT-LLM currently doesn’t expose iteration statistics in orchestrator mode.Filenames should not contain spaces if a custom fine-tuned model directory is provided.
Function calling and structured generation is not supported for pipeline parallelism greater than 1.
GET v1/metrics
API is missing from the docs page (http://HOST-IP:8000/docs
, whereHOST-IP
is the IP address of your host).GH200 NVIDIA driver <560.35.03 can cause a segmentation fault or hanging during deployment. Fixed in GPU driver 560.35.03
Locally-built fine tuned models are not supported with FP8 profiles.
Logarithmic Probabilities (
logprobs
) support with echo:TRTLLM engine needs to be built explicitly with
--gather_generation_logits
Enabling this may impact model throughput and inter-token latency.
NIM_MODEL_NAME must be set to the generated model repository.
logit_bias
is not available for any model using the TRT-LLM backend.logprobs=2
is only supported for TRT-LLM (optimized) configurations for Reward models; this option is supported for the vLLM (non-optimized) configurations for all models.NIM does not support Multi-instance GPU mode (MIG).
NIM with vLLM backend may intermittently enter a state where the API return a “Service in unhealthy” message. This is a known issue with vLLM (vllm-project/vllm#5060). You must restart the NIM in this case.
No tokenizer found error when running PEFT. This warning can be safely ignored.
OOB with
tensorrt_llm-local_build
is 8K. Use the NIM_MAX_MODEL_LEN environment variable to modify the sequence length within the range of values supported by a model.Optimized engines (TRT-LLM) aren’t supported with NVIDIA vGPU. To use optimized engines, use GPU Passthrough.
Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. NVIDIA recommends that you filter these characters out of prompts before submitting the prompt to an LLM.
P-Tuning isn’t supported.
Some stop words might not work as expected and might appear in the output.
The container may crash when building local TensorRT LLM engines if there isn’t enough host memory. If that happens, try setting
NIM_LOW_MEMORY_MODE=1
.The model quantization is
fp8
, but the logs incorrectly display it asbf16
.The maximum supported context length may decrease based on memory availability.
The structured generation of regular expressions results may have unexpected responses. We recommend that you provide a strict answer format, such as
\\boxed{}
, to get the correct response.vGPU related issues:
trtllm_buildable
profiles might encounter an Out of Memory (OOM) error on vGPU systems, which can be fixed viaNIM_LOW_MEMORY_MODE=1
flag.When using vGPU systems with
trtllm_buildable
profiles, you might still encounter a broken connection error. For example,client_loop: send disconnect: Broken pipe
.
vLLM for A100 and H200 is not supported.
vllm + LoRA profiles for long context models (
model_max_len
> 65528) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel,max_num_batched_tokens
must be <= 65528 when LoRA is enabled. As a workaround you can setNIM_MAX_MODEL_LEN=65525
or lowerWhen
repetition_penalty=2
, the response time for larger models is greater. Userepetition_penalty=1
on larger models.
All Current Known Issues for Specific Models#
The following are the current (unfixed) known issues from all previous versions, that are specific to a model:
Tip
For related information, see Troubleshoot NVIDIA NIM for LLMs.
Code Llama
FP8 profiles are not released due to accuracy degradations.
LoRA is not supported.
Deepseek
The
min_p
sampling parameter is not compatible with Deepseek and will be set to0.0
The following are not supported for DeepSeek models:
LoRA
Guided Decoding
FT (fine-tuning)
DeepSeek models require setting
--trust-remote-code
. This is handled automatically in DeepSeek NIMs.Only profiles matching the following hardware topologies are supported for the DeepSeek R1 model:
2 nodes of 8xH100
1 node of 8xH200
DeepSeek-R1 profiles disable DP attention by default to avoid crashes at higher concurrency. To turn on DP attention you can set
NIM_ENABLE_DP_ATTENTION
.
-
This model does not include pre-built engines for TP8, A10G, and H100.
To deploy, set
-e NIM_MAX_MODEL_LEN = 131072
-
BF16 profiles require at least 64GB GPU memory to launch. For example,
vllm-bf16-tp1-pp1
profile does not launch successfully on a single L20 or other supported GPUs with GPU memory less than 80GB.Structured generation has unexpected behavior due to CoT output. Despite this,
guided_json
parameter exhibits normal functionality when used with a JSON schema prompt.When running vLLM engine with GPU that has smaller memory, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set
NIM_MAX_MODEL_LEN = 32768
or less when using vLLM profile.Using a trtllm_buildable profile with a fine-tuned model can crash on H100.
Recommend at least 80GB of CPU memory.
-
When running vLLM engine with GPU memory less than 48GB, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set
NIM_MAX_MODEL_LEN = 32768
to enable vLLM profile.
-
When running vLLM engine with A10G, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set
NIM_MAX_MODEL_LEN = 32768
to enable vLLM profile.kv_cache_reuse
is not supported.suffix
parameter is not supported in API call.
-
LoRA not supported
-
Does not support the System role in a chat or completions API call.
Llama 3.3 Nemotron Super 49B V1
The model might occasionally bypass its typical thinking patterns for certain queries, especially in multi-turn conversations (for example,
\n\n ).You cannot deploy this model using KServe.
Listing the profiles for this model when the local cache is enabled can result in log warnings, which do not impact NIM functionality.
Logs for this model can contain spurious warnings. You can safely ignore them.
Avoid using the
logit_bias
parameter with this model because the results are unpredictable.
-
At least 400GB of CPU memory is required.
Concurrent requests are blocked when running NIM with the
-e NIM_MAX_MODEL_LENGTH
option and a largemax_tokens
value in the request.Accuracy was noted to be lower than the expected range with profiles
vllm-bf16-tp4-pp1-lora
andvllm-bf16-tp8-pp1
.The
suffix
parameter isn’t supported in API calls.Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Make sure the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.
gather_context_logits
is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using thetrtllm_buildable
feature by setting the environment variableNIM_ENABLE_PROMPT_LOGPROBS
.
-
Performance degradation observed (compared to OS vLLM) for the following TRT-LLM LoRA profiles:
tensorrt_llm-b200-fp8-tp1-pp1-throughput-lora
andtensorrt_llm-b200-bf16-tp1-pp1-throughput-lora
.Performance degradation observed (compared to OS vLLM) for the following vLLM profiles:
vllm-b200-bf16-1
andvllm-a100_sxm4_40gb-bf16-1
.
-
Parallel tool calling is not supported.
Performance degradation observed for profile
tensorrt_llm-h100-fp8-1-throughput
.Currently, TRT-LLM profiles with LoRA enabled show performance degradation compared to vLLM-LoRA profiles at low concurrencies (1 and 5).
When making requests that consume the maximum sequence length generation (such as using
ignore_eos: True
), generation time might be significantly longer and can exhaust the available KV cache, causing future requests to stall. In this scenario, we recommend that you reduce concurrency.gather_context_logits
is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using thetrtllm_buildable
feature by setting the environment variableNIM_ENABLE_PROMPT_LOGPROBS
.
-
Currently, LoRA is not supported for this model.
Currently, tool calling is not supported.
Accuracy degradation observed for the following profiles:
vllm-a100-bf16-1
andvllm-h200-bf16-2
.
Llama 3.1 Swallow 8B Instruct v0.1
LoRA not supported
-
TRT-LLM BF16 TP16 buildable profile cannot be deployed on A100.
LoRA is not supported.
Throughput optimized profiles are not supported on A100 FP16 and H100 FP16.
vLLM profiles are not supported.
-
Concurrent requests are blocked when running NIM with the
-e NIM_MAX_MODEL_LENGTH
option and a largemax_tokens
value in the request.vLLM profiles are not supported
Accuracy was noted to be lower than the expected range with the following profiles:
vllm-l40s-bf16-8
,vllm-l40s-bf16-4
,vllm-h200-bf16-8
,vllm-h200-bf16-2
,vllm-h100-bf16-8
,vllm-h100-bf16-2
,vllm-h100_nvl-bf16-8
, andvllm-h100_nvl-bf16-4
.The
suffix
parameter isn’t supported in API calls.Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Verify that the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.
LoRA A10G TP8 for both vLLM and TRTLLM not supported due to insufficient memory.
The performance of vLLM LoRA on L40s TP88 is significantly suboptimal.
Deploying with KServe fails. As a workaround, try increasing the CPU memory to at least 77GB in the runtime YAML file.
Buildable TRT-LLM BF16 TP4 LoRA profiles on A100 and H100 can fail due to not enough host memory. You can work around this problem by setting
NIM_LOW_MEMORY_MODE=1
.
-
vLLM profiles are not supported.
-
Create chat completion with non-existing model returns a 500 when it should return a 404.
-
vLLM profiles are not supported.
LoRA is not supported on L40S with TRT-LLM.
H100 and L40s LoRA profiles can hang with high (>2000) ISL values.
For the LoRA enabled profiles, TTFT can be worse with the pre-built engines compared to the vLLM fallback while throughput is better. If TTFT is critical, please consider using the vLLM fallback.
For requests that consume the maximum sequence length generation (for example, requests that use
ignore_eos: True
), generation time can be very long and the request can consume the available KV cache causing future requests to stall. You should reduce concurrency under these conditions.
Llama 3.1 models
vLLM profiles fail with
ValueError: Unknown RoPE scaling type extended
.
Llama 3.1 FP8
requires NVIDIA driver version >= 550
-
LoRA isn’t supported on 8 x GPU configuration
-
The
vllm-fp16-tp2
profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other GPUs might encounter a “CUDA out of memory” issue.
Mistral NeMo Minitron 8B 8K Instruct
Tool calling is not supported.
LoRA is not supported.
vLLM TP4 or TP8 profiles are not available.
-
With optimized TRT-LLM profiles has lower performance compared to the OpenSource vLLM.
-
Does not support function calling and structured generation on vLLM profiles. See vLLM #9433 for details.
LoRA is not supported with TRTLLM backend for MoE models
vLLM LoRA profiles return an internal server error/500. Set
NIM_MAX_LORA_RANK=256
to use LoRA with vLLM.vLLM profiles do not support function calling and structured generation. See vLLM #9433.
If you enable
NIM_ENABLE_KV_CACHE_REUSE
with the L40S FP8 TP4 Throughput profile, deployment fails.
Nemotron4 models
Require use of ‘slow’ tokenizers. ‘fast’ tokenizers causes accuracy degradation.
-
LoRA is not supported
Tool calling is not supported
Phind Codellama 34B V2 Instruct
LoRA is not supported
Tool calling is not supported
-
Setting
NIM_TOKENIZER_MODE=slow
is not supported.
-
The alternative option to use vLLM is not supported due to poor performance. For the GPUs that have no optimized version, use the
trtllm_buildable
feature to build the TRT-LLM on the fly.For all pre-built engines,
gather_context_logits
is not enabled. If users require logits output, specify it in your own TRT-LLM configuration when you use thetrtllm_buildable
featureThe
tool_choice
is not supported.Deploying NIM with
NIM_LOG_LEVEL=CRITICAL
causes the start process to hang. UseWARNING
,DEBUG
orINFO
instead.
-
A pre-built TRT-LLM engine for L20 is available, but it is not fully optimized for different use cases.
LoRA is not supported.
The
tool_choice
parameter is not supported.Deploying NIM with NIM_LOG_LEVEL=CRITICAL causes the start process to hang.
May have performance issue at a specific use case, when using vLLM backend on L20.
-
Tool calling is not supported.
The
suffix
parameter is not supported in API calls.The
stream_options
parameter is not supported in API calls.The
logprobs
parameter is not supported whenstream=true
in API calls.This model requires at least 48GB of VRAM but cannot be launched on a single 48GB GPU such as L40S. Single-GPU deployment is only supported on GPUs with 80GB or more of VRAM (for example, A100 80GB or H100 80GB).
-
This model is optimized for Arabic language contexts. While the model does process input in other languages, you may experience inconsistencies or reduced accuracy in content generated for non-Arabic languages.
The
suffix
parameter isn’t supported in API calls.
-
Does not support the chat endpoint.
-
Deployment fails on H100 with vLLM (TP1, PP1) at 250 concurrent requests.