Release Notes for NVIDIA NIM for LLMs#
This page contains the release notes for NVIDIA NIM for Large Language Models (LLMs).
Release 1.15.0#
Summary#
NVIDIA NIM for LLMs 1.15.0 introduces reliability and performance improvements for the TRT-LLM PyTorch backend, which is now stable and enabled by default. This release also expands DGX Spark compatibility, prioritizes hardware-specific profiles for automatic selection, offers opt-in NIM telemetry (experimental), and implements prompt embeddings for privacy-preserving inference (experimental).
New Features#
The following are the new features in 1.15.0:
Added reliability and performance improvements for the TRT-LLM PyTorch backend, which is now stable and enabled by default. If you need to use the legacy TRT‑LLM backend for compatibility, refer to TRT-LLM Backend (Legacy/PyTorch).
Added support for hardware-specific profiles in automatic profile selection. The profile selection logic now gives higher priority to profiles that contain a
gputag matching the system’s hardware over profiles that do not, to ensure better performance and reduce the risk of OOM errors.Expanded support for DGX Spark hardware to include the multi-LLM compatible NIM container.
Added support for prompt embeds, which let you use secure, pre-computed embeddings instead of raw text prompts. For configuration and usage instructions, refer to Prompt Embeddings with NVIDIA NIM for LLMs.
Updated the CUDA version from 12.9 to 13.0.
Added optional NIM Telemetry to collect minimal, anonymous system and NIM metadata to help improve performance, reliability, and compatibility across deployments.
To enable baseline collection, set
NIM_TELEMETRY_MODE=1. Refer to NIM Telemetry for more configuration options.
Fixed Issues#
The following are the previous known issues that were fixed in 1.15.0:
(FIXED)
logit_biasis not available for any model using the TRT-LLM backend.
New Issues#
The following are the new issues discovered in 1.15.0:
Performance degradation observed for larger models (≥ 49 billion parameters). We recommend using the vLLM backend for optimal performance.
FP8quantized profiles using the vLLM backend are not supported on RTX 6000 Pro Blackwell GPUs.INT4_AWQquantization is not supported by the TRT‑LLM PyTorch backend.The
nparameter is not supported by the TRT-LLM PyTorch backend.The
v1/metricsendpoint does not reportgpu_cache_usage_percfor the TRT-LLM backend.NIM_TELEMETRY_ENABLE_LOGGINGis not supported.When
NIM_DISABLE_LOG_REQUESTS=0, theprompt_token_idsandpromptfields are empty in Completions API responses;prompt_token_idsis also empty in Chat Completions responses.-
For multi-LLM NIM deployments:
These models can only be deployed using the vLLM backend.
On Blackwell GPUs, set
-e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1and remove-u $(id -u)from the launch command.Tool calling is supported sequentially using the Chat Completions API
For tool calling, set
NIM_ENABLE_AUTO_TOOL_CHOICE=1andNIM_TOOL_CALL_PARSER=openai.For L40S GPUs, set
NIM_KVCACHE_PERCENT=0.8to resolve out of memory (OOM) errors.For H100 GPUs, set
NIM_MAX_MODEL_LEN=100000to resolve out of memory (OOM) errors.The Responses API is not supported.
Custom guided decoding is not supported.
Llama 3.3 Nemotron Super 49B V1.5
For multi-LLM NIM deployments:
Set
-e NIM_FORCE_TRUST_REMOTE_CODE=1LoRA does not work with the TRT-LLM backend.
-
For multi-LLM NIM deployments:
Set
-e NIM_FORCE_TRUST_REMOTE_CODE=1On Blackwell GPUs, additionally set
-e VLLM_ATTENTION_BACKEND=FLASH_ATTN.
Previous Releases#
The following are links to the previous release notes.
1.14 | 1.13 | 1.12 | 1.11 | 1.10 | 1.8 | 1.7 | 1.6 | 1.5 |1.4 | 1.3 | 1.2 | 1.1 | 1.0
All Current Known Issues#
The following are the current (unfixed) known issues from all previous versions:
Tip
For related information, see Troubleshoot NVIDIA NIM for LLMs.
General#
Security Advisory: A Server-Side Request Forgery vulnerability exists in the vLLM library’s multimodal image processing that could allow HTTP redirects to bypass domain restrictions. To mitigate this issue, set the environment variable
VLLM_MEDIA_URL_ALLOW_REDIRECTS=0in your deployment. Refer to vLLM Environment Variables for more details.The
top_logprobsparameter is not supported.All models return a 500 when setting
logprobs=2,echo=true, andstream=false; they should return a 200.Filenames should not contain spaces if a custom fine-tuned model directory is provided.
Some stop words might not work as expected and might appear in the output.
The maximum supported context length may decrease based on memory availability.
The structured generation of regular expressions results may have unexpected responses. We recommend that you provide a strict answer format, such as
\\boxed{}, to get the correct response.The model quantization is
fp8, but the logs incorrectly display it asbf16.Some top-level parameters can trigger log warnings.
Deployment and Environment#
Deploying with KServe can require changing permissions for the cache directory. See the Serving models from local assets section for details.
GH200 NVIDIA driver <560.35.03 can cause a segmentation fault or hanging during deployment. Fixed in GPU driver 560.35.03
Optimized engines (TRT-LLM) aren’t supported with NVIDIA vGPU. To use optimized engines, use GPU Passthrough.
Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. NVIDIA recommends that you filter these characters out of prompts before submitting the prompt to an LLM.
The container may crash when building local TensorRT LLM engines if there isn’t enough host memory. If that happens, try setting
NIM_LOW_MEMORY_MODE=1.Out-of-Bounds (OOB) sequence length with
tensorrt_llm-local_buildis 8K. Use the NIM_MAX_MODEL_LEN environment variable to modify the sequence length within the range of values supported by a model.vGPU related issues:
trtllm_buildableprofiles might encounter an Out of Memory (OOM) error on vGPU systems, which can be fixed viaNIM_LOW_MEMORY_MODE=1flag.When using vGPU systems with
trtllm_buildableprofiles, you might still encounter a broken connection error. For example,client_loop: send disconnect: Broken pipe.
vLLM for A100 and H200 is not supported.
NIM with vLLM backend may intermittently enter a state where the API returns a “Service in unhealthy” message. This is a known issue with vLLM (vllm-project/vllm#5060). You must restart the NIM in this case.
You can’t deploy
fp8quantized engines on H100-NVL GPUs with deterministic generation mode on. For more information, refer to Deterministic Generation Mode in NVIDIA NIM for LLMs.INT4/INT8 quantized profiles are not supported for Blackwell GPUs.
When using Native TLS Stack to download the model, you should set
--ulimit nofile=1048576in the docker run command. If a Helm deployment is run behind the proxy, the limit must be increased on host nodes or a custom command must be provided. See Deploying Behind a TLS Proxy for details.Air Gap Deployments of a model like Llama 3.3 Nemotron Super 49B that uses the model directory option might not work if the model directory is in the HuggingFace format. Switch to using
NIM_FT_MODELin those cases. For more information, refer to Air Gap Deployment.Llama-3.1-Nemotron-Ultra-253B-v1 does not work on H100s and A100s. Use H200s and B200s to deploy successfully.
Models with 8 billion parameters require
NIM_KVCACHE_PERCENT=0.8fortp=1profiles.NIM_ENABLE_PROMPT_LOGPROBS=1is not supported for the TRTLLM backend.LoRA deployments on TRT-LLM and vLLM backends can experience significant latency and throughput degradation. The performance impact can vary by model size and configuration.
PCI Access Control Services (ACS) must be disabled for NCCL to function properly in multi-GPU deployments (
tp > 1). For instructions on how to disable PCI ACS, refer to the NCCL troubleshooting guide.When deploying models, you may get
Download Error - Too many open files. To resolve the error, do the following:For Docker deployments, add the
--ulimit nofile=65536:65536option.For Kubernetes deployments, run the following commands:
sudo mkdir -p /etc/systemd/system/containerd.service.d echo "[Service]" | sudo tee /etc/systemd/system/containerd.service.d/override.conf echo "LimitNOFILE=65536" | sudo tee -a /etc/systemd/system/containerd.service.d/override.conf sudo systemctl daemon-reload sudo systemctl restart containerd sudo systemctl restart kubelet
SGLang profiles may run out of memory (OOM) under high load. Set
NIM_KVCACHE_PERCENT=0.7to help mitigate the issue.For all version 1.14 NIMs, you must disable KV cache by setting
NIM_ENABLE_KV_CACHE_REUSE=0whenNIM_GUIDED_DECODING_BACKENDis set tolm-format-enforceror a custom backend. An incorrect backend name is treated as a custom backend.
Model Support and Functionality#
Many models return a 500 error when using structured generation with context-free grammar.
Function calling and structured generation is not supported for pipeline parallelism greater than 1.
Locally-built fine tuned models are not supported with FP8 profiles.
P-Tuning isn’t supported.
No tokenizer found error when running PEFT. This warning can be safely ignored.
vllm + LoRA profiles for long context models (
model_max_len> 65528) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel,max_num_batched_tokensmust be <= 65528 when LoRA is enabled. As a workaround you can setNIM_MAX_MODEL_LEN=65525or lowerWhen
repetition_penalty=2, the response time for larger models is greater. Userepetition_penalty=1on larger models.The following are the known issues with function calling:
Format enforcement is not guaranteed by default. The
tool_choiceparameter no longer supportsrequiredas a value, despite its presence in the OpenAPI spec. This might impact the accuracy of tool calling for some models.Function calling no longer uses guided decoding, resulting in lower accuracy for smaller models like Llama 3.2 1B/3B Instruct.
Smaller parameter models (<= 8 billion parameters) can have tool calling enabled but are highly inaccurate due to their limited parameter count. We don’t recommend using such models for tool calling use cases.
The following are the known issues with the custom guided decoding backend:
The
fast_outlinesbackend is deprecated.Guided decoding now defaults to
xgrammarinstead ofoutlines. For more information, refer to Structured Generation with NVIDIA NIM for LLMs.The
outlinesguided decoding backend is not supported forsglangprofiles.Custom guided decoding requires backends that implement the
set_custom_guided_decoding_parametersmethod, as defined in the backend file.Guided decoding does not work for TP > 1 for
sglangprofiles.Deepseek R1 may produce less accurate results when using guided decoding.
DeepSeek models do not support tool calling.
LoRA does not work for
mistral-nemo-12b-instruct.The vLLM backend is not supported on Llama Nemotron models.
On the TRT-LLM backend, setting
temperature=0enforces greedy decoding, makingrepetition_penaltyineffective.
API and Metrics#
GET v1/metricsAPI is missing from the docs page (http://HOST-IP:8000/docs, whereHOST-IPis the IP address of your host).Logarithmic Probabilities (
logprobs) support with echo:TRTLLM engine needs to be built explicitly with
--gather_generation_logitsEnabling this may impact model throughput and inter-token latency.
NIM_MODEL_NAME must be set to the generated model repository.
logprobs=2is only supported for TRT-LLM (optimized) configurations for Reward models; this option is supported for the vLLM (non-optimized) configurations for all models.Empty metrics values on multi-GPU TensorRT-LLM model. Metrics items
gpu_cache_usage_perc,num_request_max,num_requests_running,num_requests_waiting, andprompt_tokens_totalwon’t be reported for multi-GPU TensorRT-LLM model, because TensorRT-LLM currently doesn’t expose iteration statistics in orchestrator mode.If
ignore_eos=true, the model ignores EOS tokens and keeps generating until a custom stop token is encountered or the max token limit is reached (if not set, the default context window size is 128 tokens). For VLLM and simple queries, we recommend usingignore_eos=false(default).When calling the
v1/metadataAPI, the following fields undermodelInfoare missing:repository_overrideandselectedModelProfileId.
All Current Known Issues for Specific Models#
The following are the current (unfixed) known issues from all previous versions, that are specific to a model:
Tip
For related information, see Troubleshoot NVIDIA NIM for LLMs.
Code Llama
FP8 profiles are not released due to accuracy degradations.
LoRA is not supported.
Deepseek
The
min_psampling parameter is not compatible with Deepseek and will be set to0.0The following are not supported for DeepSeek models:
LoRA
Guided Decoding
FT (fine-tuning)
DeepSeek models require setting
--trust-remote-code. This is handled automatically in DeepSeek NIMs.Only profiles matching the following hardware topologies are supported for the DeepSeek R1 model:
2 nodes of 8xH100
1 node of 8xH200
DeepSeek-R1 profiles disable DP attention by default to avoid crashes at higher concurrency. To turn on DP attention you can set
NIM_ENABLE_DP_ATTENTION.
DeepSeek Coder V2 Lite Instruct does not support
kv_cache_reusefor vLLM.-
This model does not include pre-built engines for TP8, A10G, and H100.
To deploy, set
-e NIM_MAX_MODEL_LEN = 131072
-
BF16 profiles require at least 64GB GPU memory to launch. For example,
vllm-bf16-tp1-pp1profile does not launch successfully on a single L20 or other supported GPUs with GPU memory less than 80GB.Structured generation has unexpected behavior due to CoT output. Despite this,
guided_jsonparameter exhibits normal functionality when used with a JSON schema prompt.When running vLLM engine with GPU that has smaller memory, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set
NIM_MAX_MODEL_LEN = 32768or less when using vLLM profile.Using a trtllm_buildable profile with a fine-tuned model can crash on H100.
Recommend at least 80GB of CPU memory.
-
When running vLLM engine with GPU memory less than 48GB, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set
NIM_MAX_MODEL_LEN = 32768to enable vLLM profile.
-
When running vLLM engine with A10G, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set
NIM_MAX_MODEL_LEN = 32768to enable vLLM profile.kv_cache_reuseis not supported.suffixparameter is not supported in API call.
-
LoRA not supported
-
Does not support the System role in a chat or completions API call.
Gemma2 9B CPT Sahabat-AI v1 Instruct
gather_context_logitsis not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using thetrtllm_buildablefeature by setting the environment variableNIM_ENABLE_PROMPT_LOGPROBS.Logs for this model can contain spurious Python errors. You can safely ignore them.
-
These models can only be deployed using the vLLM backend.
Custom decoding is not supported.
Tool calling is done sequentially (not in parallel) using the Chat Completions API.
For tool calling,
tool_choiceis always set toauto.For L40S GPUs, set
NIM_KVCACHE_PERCENT=0.8to resolve out of memory (OOM) errors.The Responses API is experimental.
The
/v1/responses(POST) endpoint immediately returns the complete response.The
/v1/responses/<response-id>(GET) endpoint retrieves stored responses, and the/v1/responses/<response-id>/cancel(POST) endpoint cancels ongoing background requests. For both of these operations, you must set environment variableVLLM_ENABLE_RESPONSES_API_STORE=1to enable response storage and management.
Important
Enabling response storage causes a memory leak because responses are not automatically cleaned up. Responses remain in memory until server restart. This memory leak can result in out of memory (OOM) errors in production. Stored responses are not persisted to disk, so all stored data is lost on server restart. The cancel endpoint can only be used for background requests (when
"background": trueis set); immediate responses can’t be canceled.When passing the payload using the Responses API, background fill is disabled.
A Harmony parser error occurs intermittently during streaming with tool calling. The model occasionally generates unexpected channel transition tokens (for example, commentary, analysis, or final) during tool invocation, causing the parser to fail when it receives token 200002 instead of expected token 200006. As a workaround, use non-streaming mode (
stream: false) for tool calling scenarios with the Chat Completions API. Refer to the related GitHub issue.OOM may occur when running GPT-OSS-120B on the H100 SXM 80GB GPU (TP1). Set
-e NIM_RELAX_MEM_CONSTRAINTS=1to relax memory constraints and allow the model to run even when GPU memory utilization is near capacity.
-
Thinking is not supported.
Tool calling is not supported.
-
Setting
NIM_TOKENIZER_MODE=slowis not supported.The server returns a 500 status code (or a 200 status code and a
BadRequesterror) whenlogprobsis set to0in the request.
Llama 3.3 Nemotron Super 49B V1
If you set
NIM_MANIFEST_ALLOW_UNSAFEto1, deployment fails.Throughput and latency degradation observed for BF16 profiles in the 5–10% range compared to previous NIM releases and slight degradation compared to OS vLLM specifically for ISL/OSL=5k/500 at concurrencies > 100. You should set
NIM_DISABLE_CUDA_GRAPH=1when running BF16 profiles.Caching engines built for supervised fine-tuning (SFT) models don’t work.
The model might occasionally bypass its typical thinking patterns for certain queries, especially in multi-turn conversations (for example,
<think> \n\n </think>).Listing the profiles for this model when the local cache is enabled can result in log warnings, which do not impact NIM functionality.
Logs for this model can contain spurious warnings. You can safely ignore them.
Avoid using the
logit_biasparameter with this model because the results are unpredictable.If you send more than 15 concurrent requests with detailed thinking on, the container may crash.
Llama 3.3 Nemotron Super 49B V1.5
The log indicates errors when listing profiles with the NIM cache disabled. This doesn’t impact NIM functionality.
By default, the model responds in reasoning ON mode. Set
/no_thinkin the system prompt to enable reasoning OFF mode.Performance degradation observed (compared to OS vLLM) for the profile
vllm-gb200-bf16-2, especially at concurrency=1 and all ISL/OSL combinations.Latency and throughput degradation observed for
vllm-gb200-bf16-2andvllm-gh200_144gb-bf16-2profiles.This model is not supported with vLLM TP1 on GH200 96GB.
Performance degradation observed for the following profiles:
tensorrt_llm-gh200_144gb-fp8-2-latency,tensorrt_llm-h200_nvl-fp8-2-latency,tensorrt_llm-h200-fp8-2-latency,tensorrt_llm-gb200-fp8-2-latency, andtensorrt_llm-gb200-nvfp4-2-latency.You do not need to deploy profiles
tensorrt_llm-a10g-bf16-tp4-pp2-latencyandtensorrt_llm-a10g-bf16-tp4-pp2-throughputon multiple nodes; a single A10G x 8 node is sufficient.Accuracy degradation of 3.5% observed for profile
tensorrt_llm-rtx6000_blackwell_sv-nvfp4-2-latency.Performance degradation observed for profile
vllm-b200-bf16-8.
-
At least 400GB of CPU memory is required.
Concurrent requests are blocked when running NIM with the
-e NIM_MAX_MODEL_LENGTHoption and a largemax_tokensvalue in the request.Accuracy was noted to be lower than the expected range with profiles
vllm-bf16-tp4-pp1-loraandvllm-bf16-tp8-pp1.The
suffixparameter isn’t supported in API calls.Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Make sure the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.
gather_context_logitsis not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using thetrtllm_buildablefeature by setting the environment variableNIM_ENABLE_PROMPT_LOGPROBS.Performance degradation observed for the following profiles:
tensorrt_llm-h100-bf16-8,tensorrt_llm-h100_nvl-bf16-8, andtensorrt_llm-h100-bf16-8-latency.Tool calling may have accuracy issues.
Performance degradation observed at higher concurrencies >= 50 with TRT-LLM engines compared to the previous release.
Performance degradation observed (compared to OS vLLM) for vLLM profiles
vllm-h200_nvl-bf16-4andvllm-h200-bf16-4.Profile
vllm-bf16-tp2-pp1-lora-32is not supported on H100 GPUs due to memory constraints.Profile
vllm-bf16-tp8-pp1-lora-32is not supported on A10G GPUs due to memory constraints.For profile
tensorrt_llm-gb200-fp8-1-latency, observed 25–30% performance degradation compared to OSS vLLM atconc=1.For profile
tensorrt_llm-h100-fp8-4-latency, observed 10% degradation atconc=1and 40% degradation atconc=250for ISL/OSL 500/2k compared to the previous release.To use profile
tensorrt_llm-a10g-bf16-tp8-pp1-throughput-lora-lora-2237, set the following environment variables:-e NIM_MAX_LORA_RANK=16 -e NIM_RELAX_MEM_CONSTRAINTS=1 -e NIM_MAX_GPU_LORAS=1
-
LoRA is not supported for vLLM and TRT-LLM buildable.
Accuracy degradation observed for profiles
tensorrt_llm-h200-fp8-2-latencyandtensorrt_llm-l40s-fp8-tp1-pp1-throughput-lora.Performance degradation observed on the following profiles:
tensorrt_llm-b200-fp8-2-latency,tensorrt_llm-a100-bf16-tp1-pp1-throughput-lora,tensorrt_llm-h100-fp8-tp1-pp1-throughput-lora, and on all non-LoRA vLLM profiles (vllm-a100-bf16-2,vllm-a10g-bf16-2,vllm-b200-bf16-2,vllm-h100_nvl-bf16-2,vllm-h100-bf16-2,vllm-h200-bf16-2,vllm-l40s-bf16-2,vllm-gh200_480gb-bf16-1, andvllm-rtx4090-bf16-1). SetNIM_DISABLE_CUDA_GRAPHSto check for improved performance.If you provide an invalid value for
chat_templatein a chat API call, the server returns a 200 status code rather than a 400 status code.Degradation on the following vLLM profiles:
vllm-a100_sxm4_40gb-bf16-2,vllm-a100-bf16-2,vllm-a10g-bf16-2,vllm-b200-bf16-2,vllm-gb200-bf16-2,vllm-gh200_144gb-bf16-2,vllm-h100-bf16-2,vllm-l40s-bf16-2, andvllm-rtx6000_blackwell_sv-bf16-2.Performance degradation observed at higher concurrencies >= 50 with TRT-LLM engines compared to previous release.
-
Parallel tool calling is not supported.
Performance degradation observed for profiles
tensorrt_llm-h100-fp8-1-throughputandvllm-gh200_480gb-bf16-1.Currently, TRT-LLM profiles with LoRA enabled show performance degradation compared to vLLM-LoRA profiles at low concurrencies (1 and 5).
When making requests that consume the maximum sequence length generation (such as using
ignore_eos: True), generation time might be significantly longer and can exhaust the available KV cache, causing future requests to stall. In this scenario, we recommend that you reduce concurrency.gather_context_logitsis not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using thetrtllm_buildablefeature by setting the environment variableNIM_ENABLE_PROMPT_LOGPROBS.Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Make sure the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.
This NIM doesn’t include support for TRT-LLM buildable profiles.
Deploying a fine-tuned model fails for some TRT-LLM profiles when TP is greater than 1.
Llama 3.1 Nemotron Nano 4B V1.1
Accuracy degradation observed for profile
tensorrt_llm-trtllm_buildable-bf16-tp2-pp1-lora-A100.LoRA is not supported for vLLM profiles.
Performance degradation observed for the following vLLM profiles:
vllm-h200-bf16-2,vllm-gh200_480gb-bf16-1,vllm-a10g-bf16-2, andvllm-l40s-bf16-2.Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Verify that the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.
-
Accuracy degradation observed for profile
vllm-a10g-bf16-4.Performance degradation observed for the following profiles:
vllm-l40s-bf16-4,vllm-bf16-tp2-pp1-loraandtensorrt_llm-trtllm_buildable-bf16-tp2-pp1-lora.
Llama 3.1 Nemotron Ultra 253B V1
Accuracy degradation observed on the B200 GPU.
Accuracy degradation observed for the following prebuilt profiles:
tensorrt_llm-h100-fp8-8-throughput,tensorrt_llm-h200-fp8-8-throughput, andtensorrt_llm-h100_nvl-fp8-8-throughputAccuracy degradation observed for the following buildable profile:
tensorrt_llm-h200-bf16-8.Performance degradation observed (compared to OS vLLM) for the following profiles when ISL>OSL and concurrency is >= 50:
tensorrt_llm-b200-fp8-8-throughputandtensorrt_llm-b200-bf16-8.TRT-LLM BF16 TP8 buildable profile cannot be deployed on A100 or H100.
Fine-tuned models with input vLLM checkpoints cannot be deployed on H100 GPUs due to out-of-memory (OOM) issues.
Tool calling is not supported if you set the
nvextextension in the request.Logs for this model can contain spurious warnings. You can safely ignore them.
The
suffixparameter isn’t supported in API calls.Observed degradation for
vllm-h200-bf16-8in throughput compared to OSS vLLM, specifically when ISL>OSL and concurrency is greater than 100.Observed degradation for
vllm-h100-bf16-8. You should use thetrtllm-h100-bf16-8-latencyprofile instead.
Llama 3.1 Swallow 8B Instruct v0.1
LoRA not supported
Llama 3.1 Typhoon 2 8B Instruct
Performance degradation observed on TRT-LLM profiles when ISL>OSL and concurrency is 100 or 250 for the following GPUs: H200, A100, and L40S.
The
/v1/healthand/v1/metricsAPI endpoints return incorrect response values and empty response schemas instead of the expected health status and metrics data.
-
TRT-LLM BF16 TP16 buildable profile cannot be deployed on A100.
LoRA is not supported.
Throughput optimized profiles are not supported on A100 FP16 and H100 FP16.
vLLM profiles are not supported.
-
Performance degradation observed for vLLM profiles on the following GPUs:
B200
H200
H200 NVL
H100
H100 NVL
A100
A100 40GB
L20
Performance degradation observed for TRT-LLM profiles on the following GPUs:
H200
H200 NVL
H100
Accuracy degradation observed for the following profiles:
H200 TRT-LLM
B200 FP8, TP2, LoRa
Concurrent requests are blocked when running NIM with the
-e NIM_MAX_MODEL_LENGTHoption and a largemax_tokensvalue in the request.The
suffixparameter isn’t supported in API calls.Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Verify that the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.
LoRA A10G TP8 for both vLLM and TRTLLM not supported due to insufficient memory.
The performance of vLLM LoRA on L40s TP88 is significantly suboptimal.
Deploying with KServe fails. As a workaround, try increasing the CPU memory to at least 77GB in the runtime YAML file.
Buildable TRT-LLM BF16 TP4 LoRA profiles on A100 and H100 can fail due to not enough host memory. You can work around this problem by setting
NIM_LOW_MEMORY_MODE=1.Performance degradation observed on profile
tensorrt_llm-h100-bf16-8-latencycompared to vLLM baseline at concurrency > 100.Profile
vllm-bf16-tp2-pp1-lorais not supported on H100 and A100 GPUs.Profile
vllm-bf16-tp8-pp1-lorais not supported on A10G GPUs.For profile
tensorrt_llm-gh200_480gb-fp8-1-latency, observed ~10% degradation compared to prior release atconc=250for ISL/OSL 500/2k and 1k/1k.For profile
tensorrt_llm-h100_nvl-fp8-4-latency, observed ~20% degradation compared to prior release atconc=100for ISL/OSL 20k/200.For profile
tensorrt_llm-h100-fp8-4-latency, observed ~5% degradation compared to prior release atconc=1for ISL/OSL 1k/1k, 500/2k and 20k/2k, and atconc=250for ISL/OSL 500/2k.For profile
tensorrt_llm-h200_nvl-fp8-2-latency, observed 10% - 20% degradation compared to prior release atconc=250for 500/2k and 1k/1k.For profile
tensorrt_llm-h200-fp8-2-latency, observed ~6% degradation compared to prior release atconc=1for ISL/OSL 500/2k and 1k/1k.
-
vLLM profiles are not supported.
-
Create chat completion with non-existing model returns a 500 when it should return a 404.
-
Performance degradation observed for the following profiles:
vllm-b200-bf16-1andvllm-b200-bf16-2.LoRA is not supported on L40S with TRT-LLM.
H100 and L40s LoRA profiles can hang with high (>2000) ISL values.
For the LoRA enabled profiles, TTFT can be worse with the pre-built engines compared to the vLLM fallback while throughput is better. If TTFT is critical, please consider using the vLLM fallback.
For requests that consume the maximum sequence length generation (for example, requests that use
ignore_eos: True), generation time can be very long and the request can consume the available KV cache causing future requests to stall. You should reduce concurrency under these conditions.Performance degradation observed for BF16 profiles for ISL=5000 OSL=500 when concurrency > 100.
Performance degradation observed at higher concurrencies >= 50 with TRT-LLM engines compared to previous release.
Performance degradation observed (compared to OS vLLM) for vLLM profiles
vllm-h200-bf16-2andvllm-rtx6000_blackwell_sv-bf16-2.
Llama-3.1-8b-Instruct-DGX-Spark
Tool calling is not supported.
Guided decoding is not supported.
Maximum GPU memory usage is set to 60 GB by default (
NIM_GPU_MEM_FRACTION = 0.5). To change this limit, update the value of theNIM_GPU_MEM_FRACTIONenvironment variable. For example, setNIM_GPU_MEM_FRACTION = 0.6for 72 GB orNIM_GPU_MEM_FRACTION = 0.9for 108 GB.A 20% degradation in throughput observed for vLLM at FP8 precision compared to OSS vLLM at concurrency=10 across multiple ISL/OSL combinations (2000/200, 200/2000, and 1000/1000).
This NIM is only intended to be operated with concurrency < 10.
This NIM returns a 500 error if the request uses an invalid value for the
logprobsparameter.The
repetition_penaltyparameter is validated for its data type but not its value range. Unexpected behavior may result when using a value outside the specified range.This NIM is built with a different base container and is subject to limitations. Refer to Notes on NIM Container Variants for more information.
Llama 3.1 models
vLLM profiles fail with
ValueError: Unknown RoPE scaling type extended.
Llama 3.1 FP8
requires NVIDIA driver version >= 550
-
LoRA isn’t supported on 8 x GPU configuration
-
The
vllm-fp16-tp2profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other GPUs might encounter a “CUDA out of memory” issue.
Mistral NeMo Minitron 8B 8K Instruct
Tool calling is not supported.
LoRA is not supported.
vLLM TP4 or TP8 profiles are not available.
Mistral Small 24b Instruct 2501
Tool calling is not supported.
suffixparameter is not supported in API calls.This model requires at least 48GB of VRAM but cannot be launched on a single 48GB GPU such as L40S. Single-GPU deployment is only supported on GPUs with 80GB or more of VRAM (for example, A100 80GB or H100 80GB).
Setting
NIM_TOKENIZER_MODE=slowis not supported.
-
With optimized TRT-LLM profiles has lower performance compared to the OpenSource vLLM.
For smaller parameter models (less than or equal to 3 billion parameters), tool calling is highly inaccurate due to the limited parameter count. Do not use these models for tool calling use cases.
Multi-turn tool calling is not supported.
-
Does not support function calling and structured generation on vLLM profiles. See vLLM #9433 for details.
LoRA is not supported with TRTLLM backend for MoE models
vLLM LoRA profiles return an internal server error/500. Set
NIM_MAX_LORA_RANK=256to use LoRA with vLLM.vLLM profiles do not support function calling and structured generation. See vLLM #9433.
If you enable
NIM_ENABLE_KV_CACHE_REUSEwith the L40S FP8 TP4 Throughput profile, deployment fails.Performance degradation observed for the following vLLM profiles:
vllm-b200-bf16-2,vllm-a10g-bf16-8, andvllm-l40s-bf16-4.Performance degradation observed for profiles
vllm-rtx6000_blackwell_sv-bf16-8andvllm-h200_nvl-bf16-4.Performance degradation observed with higher concurrencies >= 50 with TRT-LLM engines compared to the LLM NIM version 1.8.4 release.
-
Setting
NIM_TOKENIZER_MODE=slowis not supported.KV cache reuse is not supported.
You can set
NIM_MAMBA_SSM_CACHE_DTYPEtofloat32(default) orautoonly.To load the model on a single, less performant GPU (like the A10G), set
NIM_MAX_MODEL_LENto the context size of131072for this model.The default context length allocation might result in Out of Memory (OOM) errors. For the A100 40GB GPU, set
NIM_MAX_NUM_SEQSto64. For the A10G GPU, setNIM_MAX_NUM_SEQSto4. Other GPUs have suitable default values. Decrease the value forNIM_MAX_NUM_SEQSif OOM errors persist.
NVIDIA-Nemotron-Nano-9B-v2-DGX-Spark
This NIM is vLLM-based. SGLang is not supported.
The
repetition_penaltyparameter is validated for its data type but not its value range. Unexpected behavior may result when using a value outside the specified range.Tool calling is not supported when the
streamparameter is set totrue.logit_biasis not supported.If you specify an invalid value for
rolein the request, the endpoint may respond with an error message that is not compliant with the OpenAI API error message for invalid roles.This NIM is built with a different base container and is subject to limitations. Refer to Notes on NIM Container Variants for more information.
Nemotron4 models
Require use of ‘slow’ tokenizers. ‘fast’ tokenizers causes accuracy degradation.
-
LoRA is not supported
Tool calling is not supported
-
When reasoning mode is enabled, the thinking results are returned in the response without
<think>and</think>markers.
Phind Codellama 34B V2 Instruct
LoRA is not supported
Tool calling is not supported
-
You must set
NIM_ENABLE_MTPto1to enable the LLM to generate several tokens at once.Thinking budget is not supported because the origin model does not support this feature.
For tool calling, we recommend that you use
"tool_choice": "auto". Settingtool_choicetofunctioncauses endless token generation.This NIM is built with a different base container and is subject to limitations. Refer to Notes on NIM Container Variants for more information.
-
You cannot deploy this model using KServe or Kubernetes.
Maximum GPU memory usage is set to 108 GB by default (
NIM_GPU_MEM_FRACTION = 0.9). To change this limit, update the value of theNIM_GPU_MEM_FRACTIONenvironment variable. For example, setNIM_GPU_MEM_FRACTION = 0.6for 72 GB.This NIM is built with a different base container and is subject to limitations. Refer to Notes on NIM Container Variants for more information.
-
Setting
NIM_TOKENIZER_MODE=slowis not supported.
-
The alternative option to use vLLM is not supported due to poor performance. For the GPUs that have no optimized version, use the
trtllm_buildablefeature to build the TRT-LLM on the fly.For all pre-built engines,
gather_context_logitsis not enabled. If users require logits output, specify it in your own TRT-LLM configuration when you use thetrtllm_buildablefeatureThe
tool_choiceis not supported.Deploying NIM with
NIM_LOG_LEVEL=CRITICALcauses the start process to hang. UseWARNING,DEBUGorINFOinstead.
-
A pre-built TRT-LLM engine for L20 is available, but it is not fully optimized for different use cases.
LoRA is not supported.
The
tool_choiceparameter is not supported.Deploying NIM with
NIM_LOG_LEVEL=CRITICALcauses the start process to hang.May have performance issue at a specific use case, when using vLLM backend on L20.
-
Tool calling is not supported.
The
suffixparameter is not supported in API calls.The
stream_optionsparameter is not supported in API calls.The
logprobsparameter is not supported whenstream=truein API calls.This model requires at least 48GB of VRAM but cannot be launched on a single 48GB GPU such as L40S. Single-GPU deployment is only supported on GPUs with 80GB or more of VRAM (for example, A100 80GB or H100 80GB).
-
This model is optimized for Arabic language contexts. While the model does process input in other languages, you may experience inconsistencies or reduced accuracy in content generated for non-Arabic languages.
The
suffixparameter isn’t supported in API calls.
-
Does not support the chat endpoint.
-
Deployment fails on H100 with vLLM (TP1, PP1) at 250 concurrent requests.
Deployment fails for vLLM profiles when
NIM_ENABLE_KV_CACHE_REUSE=1.Using FP32 checkpoints for the
NIM_FT_MODELvariable or local build isn’t supported.You must disable KV cache by setting
NIM_ENABLE_KV_CACHE_REUSE=0when running this NIM withNIM_GUIDED_DECODING_BACKENDset tolm-format-enforceror a custom backend. An incorrect backend name is treated as a custom backend.Observed a 30–50% performance degradation compared to OS vLLM at conc=250 for ISL/OSL 500/2k and 1k/1k for the following profiles:
tensorrt_llm-h100-bf16-2-latency,tensorrt_llm-h200-bf16-2-latency,tensorrt_llm-buildable-h100-bf16-2-latency, andtensorrt_llm-buildable-h200-2-latency.
Gemma models
Setting
NIM_TOKENIZER_MODE=slowis not supported.The SGLang backend is not supported.
LoRA is not supported.