Release Notes#

All Known Issues#

Click to expand

These are the known issues, by release. If an issue is fixed, the release in which it was fixed is listed in bold.

1.8.0#

  • Llama 3.1 8B Instruct

    • For the LoRA enabled profiles, TTFT can be worse with the pre-built engines compared to the vLLM fallback while throughput is better. If TTFT is critical, please consider using the vLLM fallback.

    • For requests that consume the maximum sequence length generation (for example, requests that use ignore_eos: True), generation time can be very long and the request can consume the available KV cache causing future requests to stall. You should reduce concurrency under these conditions.

  • Llama 3.1 70B Instruct

    • Concurrent requests are blocked when running NIM with the -e NIM_MAX_MODEL_LENGTH option and a large max_tokens value in the request.

    • Accuracy was noted to be lower than the expected range with the following profiles: vllm-l40s-bf16-8, vllm-l40s-bf16-4, vllm-h200-bf16-8, vllm-h200-bf16-2, vllm-h100-bf16-8, vllm-h100-bf16-2, vllm-h100_nvl-bf16-8, and vllm-h100_nvl-bf16-4.

    • The suffix parameter isn’t supported in API calls.

    • Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Verify that the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.

  • Llama 3.1 Nemotron Nano 8B V1

    • Currently, LoRA is not supported for this model.

    • Currently, tool calling is not supported.

    • Accuracy degradation observed for the following profiles: vllm-a100-bf16-1 and vllm-h200-bf16-2.

  • Llama 3.2 3B Instruct

    • When making requests that consume the maximum sequence length generation (such as using ignore_eos: True), generation time might be significantly longer and can exhaust the available KV cache, causing future requests to stall. In this scenario, we recommend that you reduce concurrency.

    • gather_context_logits is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using the trtllm_buildable feature by setting the environment variable NIM_ENABLE_PROMPT_LOGPROBS.

  • Llama 3.3 Nemotron Super 49B V1

    • The model might occasionally bypass its typical thinking patterns for certain queries, especially in multi-turn conversations (for example, \n\n ).

    • Tool calling is not currently supported.

    • The A10G GPU is not currently supported.

    • You cannot deploy this model using KServe.

    • Listing the profiles for this model when the local cache is enabled can result in log warnings, which do not impact NIM functionality.

    • Logs for this model can contain spurious warnings. You can safely ignore them.

    • Avoid using the logit_bias parameter with this model because the results are unpredictable.

  • Llama 3.3 70B Instruct

    • At least 400GB of CPU memory is required.

    • Concurrent requests are blocked when running NIM with the -e NIM_MAX_MODEL_LENGTH option and a large max_tokens value in the request.

    • Accuracy was noted to be lower than the expected range with profiles vllm-bf16-tp4-pp1-lora and vllm-bf16-tp8-pp1.

    • The suffix parameter isn’t supported in API calls.

    • Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Make sure the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.

    • gather_context_logits is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using the trtllm_buildable feature by setting the environment variable NIM_ENABLE_PROMPT_LOGPROBS.

  • The maximum supported context length may decrease based on memory availability.

  • All models return a 500 when setting logprobs=2, echo=true, and stream=false; they should return a 200.

  • Llama 3.1 8B Instruct RTX create chat completion with non-existing model returns a 500 when it should return a 404.

  • StarCoder2 7B model deployment fails on H100 with vLLM (TP1, PP1) at 250 concurrent requests.

  • On GH200 NVIDIA driver <560.35.03 can cause a segmentation fault or hanging during deployment. Fixed in GPU driver 560.35.03

  • DeepSeek-R1-Distill-Qwen-32B

    • BF16 profiles require at least 64GB GPU memory to launch. For example, vllm-bf16-tp1-pp1 profile does not launch successfully on a single L20 or other supported GPUs with GPU memory less than 80GB.

    • Structured generation has unexpected behavior due to CoT output. Despite this, guided_json parameter exhibits normal functionality when used with a JSON schema prompt.

    • When running vLLM engine with GPU that has smaller memory, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set NIM_MAX_MODEL_LEN = 32768 or less when using vLLM profile.

    • Using a trtllm_buildable profile with a fine-tuned model can crash on H100.

    • Recommend at least 80GB of CPU memory.

  • Some stop words might not work as expected and might appear in the output.

  • vLLM for A100 and H200 is not supported.

  • The structured generation of regular expressions results may have unexpected responses. We recommend that you provide a strict answer format, such as \\boxed{}, to get the correct response.

1.7.0#

  • The min_p sampling parameter is not compatible with Deepseek and will be set to 0.0

  • The following are not supported for DeepSeek models:

    • LoRA

    • Guided Decoding

    • FT (fine-tuning)

  • DeepSeek models require setting --trust-remote-code. This is handled automatically in DeepSeek NIMs.

  • Only profiles matching the following hardware topologies are supported for the DeepSeek R1 model:

    • 2 nodes of 8xH100

    • 1 node of 8xH200

  • DeepSeek-R1 profiles disable DP attention by default to avoid crashes at higher concurrency. To turn on DP attention you can set NIM_ENABLE_DP_ATTENTION.

  • The model quantization is fp8, but the logs incorrectly display it as bf16.

1.6.0#

  • StarCoder2 7B might return a KV cache “no new block” error. Set NIM_MAX_MODEL_LEN = 4096 to enable all profiles. Fixed in 1.8.0.

  • DeepSeek-R1-Distill-Qwen-14B

    • When running vLLM engine with GPU memory less than 48GB, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set NIM_MAX_MODEL_LEN = 32768 to enable vLLM profile.

  • Llama 3.2 3B Instruct does not support GH200 96GB profile.

1.5.0#

  • The maximum supported context length may decrease based on memory availability.

  • Filenames should not contain spaces if a custom fine-tuned model directory is provided.

  • The "fast_outlines" guided decoding backend will fail with requests that force the model to generate emoji.

  • StarCoderBase 15.5B does not support the chat endpoint.

  • Llama 3.3 70B Instruct requires at least 400GB of CPU memory.

  • DeepSeek-R1-Distill-Llama-70B

    • This model does not include pre-built engines for TP8, A10G, and H100.

    • To deploy, set -e NIM_MAX_MODEL_LEN = 131072.

  • DeepSeek-R1-Distill-Qwen-7B

    • When running vLLM engine with A10G, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set NIM_MAX_MODEL_LEN = 32768 to enable vLLM profile.

    • kv_cache_reuse is not supported.

    • suffix parameter is not supported in API call.

1.4.0#

  • LoRA is not supported for the following models:

  • Gemma-2-2b does not support the System role in a chat or completions API call.

  • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. NVIDIA recommends that you filter these characters out of prompts before submitting the prompt to an LLM.

  • Deploying with KServe can require changing permissions for the cache directory. See the Serving models from local assets section for details.

  • Qwen2.5-72B-Instruct

    • The alternative option to use vLLM is not supported due to poor performance. For the GPUs that have no optimized version, use the trtllm_buildable feature to build the TRT-LLM on the fly.

    • For all pre-built engines, gather_context_logits is not enabled. If users require logits output, specify it in your own TRT-LLM configuration when you use the trtllm_buildable feature

    • The tool_choice is not supported.

    • Deploying NIM with NIM_LOG_LEVEL=CRITICAL causes the start process to hang. Use WARNING, DEBUG or INFO instead.

  • Qwen2.5-7B-Instruct

    • There is a pre-built TRT-LLM engine for L20, but it is not fully optimized for different use cases.

    • LoRA is not supported.

    • The tool_choice parameter is not supported.

    • May have a performance issue when using vLLM on L20.

1.3.0#

  • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. NVIDIA recommends that you filter these characters out of prompts before submitting the prompt to an LLM.

  • All models return a 500 when setting logprobs=2, echo=true, and stream=false; they should return a 200.

  • Llama 3.1 70B Instruct:

    • LoRA A10G TP8 for both vLLM and TRTLLM not supported due to insufficient memory.

    • The performance of vLLM LoRA on L40s TP88 is significantly suboptimal.

    • Deploying with KServe fails. As a workaround, try increasing the CPU memory to at least 77GB in the runtime YAML file.

    • There’s an incorrect warning regarding checksums when running the 1.3 NIM. Fixed in 1.4.

    • Buildable TRT-LLM BF16 TP4 LoRA profiles on A100 and H100 can fail due to not enough host memory. You can work around this problem by setting NIM_LOW_MEMORY_MODE=1.

  • Llama 3.1 405B Instruct TRT-LLM BF16 TP16 buildable profile cannot be deployed on A100.

  • Mistral 7B Instruct V0.3 with optimized TRT-LLM profiles has lower performance compared to the OpenSource vLLM.

  • Mixtral 8x7B Instruct v0.1

    • Does not support function calling and structured generation on vLLM profiles. See vLLM #9433 for details.

    • LoRA is not supported with TRTLLM backend for MoE models

    • vLLM LoRA profiles return an internal server error/500. Set NIM_MAX_LORA_RANK=256 to use LoRA with vLLM.

    • If you enable NIM_ENABLE_KV_CACHE_REUSE with the L40S FP8 TP4 Throughput profile, deployment fails.

  • Nemotron 4 340B Instruct 128K does not support buildable TRT-LLM profiles.

  • The container may crash when building local TensorRT LLM engines if there isn’t enough host memory. If that happens, try setting NIM_LOW_MEMORY_MODE=1.

  • Function calling and structured generation is not supported for pipeline parallelism greater than 1.

  • Locally-built fine tuned models are not supported with FP8 profiles.

  • Logarithmic Probabilities (logprobs) support with echo:

    • TRTLLM engine needs to be built explicitly with --gather_generation_logits

    • Enabling this may impact model throughput and inter-token latency.

    • NIM_MODEL_NAME must be set to the generated model repository.

  • vGPU related issues:

    • trtllm_buildable profiles might encounter an Out of Memory (OOM) error on vGPU systems, which can be fixed via NIM_LOW_MEMORY_MODE=1 flag.

    • When using vGPU systems with trtllm_buildable profiles, you might still encounter a broken connection error. For example, client_loop: send disconnect: Broken pipe.

  • OOB with tensorrt_llm-local_build is 8K. Use the NIM_MAX_MODEL_LEN environment variable to modify the sequence length within the range of values supported by a model.

  • The GET v1/metrics API is missing from the docs page (http://HOST-IP:8000/docs, where HOST-IP is the IP address of your host).

1.2.3#

  • Code Llama models:

    • FP8 profiles are not released due to accuracy degradations

    • LoRA is not supported

  • Llama 3.1 8B Instruct does not support LoRA on L40S with TRT-LLM.

  • Mistral NeMo Minitron 8B 8K Instruct:

    • Tool calling is not supported

    • LoRA is not supported

    • vLLM TP4 or TP8 profiles are not available.

  • Mixtral 8x7b Instruct v0.1 vLLM profiles do not support function calling and structured generation. See vLLM #9433.

  • Phi 3 Mini 4K Instruct models:

    • LoRA is not supported

    • Tool calling is not supported

  • Phind Code Llama 34B v2 Instruct:

    • LoRA is not supported

    • Tool calling is not supported

  • logprobs=2 is only supported for TRT-LLM (optimized) configurations for Reward models; this option is supported for the vLLM (non-optimized) configurations for all models. Refer to the Supported Models section for details.

  • NIM with vLLM backend may intermittently enter a state where the API return a “Service in unhealthy” message. This is a known issue with vLLM (vllm-project/vllm#5060). You must restart the NIM in this case.

1.2.1#

  • vllm + LoRA profiles for long context models (model_max_len > 65528) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel, max_num_batched_tokens must be <= 65528 when LoRA is enabled. As a workaround you can set NIM_MAX_MODEL_LEN=65525 or lower

  • LoRA is not supported on Llama 3.1 8B Instruct on L40S with TRT-LLM.

  • logit_bias is not available for any model using the TRT-LLM backend.

1.2.0#

  • NIM does not support Multi-instance GPU mode (MIG).

  • Nemotron4 models require use of ‘slow’ tokenizers. ‘fast’ tokenizers causes accuracy degradation.

  • LoRA is not supported for Llama 3.1 405B Instruct.

  • vLLM profiles are not supported for Llama 3.1 405B Instruct.

  • Optimized engines (TRT-LLM) aren’t supported with NVIDIA vGPU. To use optimized engines, use GPU Passthrough.

  • When repetition_penalty=2, the response time for larger models is greater. Use repetition_penalty=1 on larger models.

  • Llama 3.1 8B Instruct H100 and L40s LoRA profiles can hang with high (>2000) ISL values.

1.1.2#

  • LoRA is not supported for Llama 3.1 405B Instruct

  • vLLM profiles are not supported for Llama 3.1 405B Instruct

  • Throughput optimized profiles are not supported on A100 FP16 and H100 FP16 for Llama 3.1 405B Instruct

  • Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile. Fixed in 1.2.0.

  • CUDA out of memory issue for Llama2 70b v1.0.3 The vllm-fp16-tp2 profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other types of GPUs might encounter a “CUDA out of memory” issue.

  • Llama 3.1 FP8 requires NVIDIA driver version >= 550

1.1.1#

  • vLLM profiles are not supported for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

1.1.0#

  • vLLM profiles for Llama 3.1 models will fail with ValueError: Unknown RoPE scaling type extended.

  • NIM does not support Multi-instance GPU mode (MIG).

1.0#

  • All models return a 500 when setting logprobs=2, echo=true, and stream=false; they should return a 200.

  • Llama3 70b v1.0.3 - LoRA isn’t supported on 8 x GPU configuration

  • LLama2 70B vLLM FP16 TP2 profile restriction NVIDIA has validated Llama2 70B on various configurations of H100, A100, and L40S GPUs. Llama2 70B runs on tp4 (four GPU) and tp8 (eight GPU) versions of H100, A100, and L40s; however, the tp2 (2 GPU) of L40S does not have enough memory to run Llama2 70B, and any attempt to run it on that platform can encounter a CUDA “out of memory” issue.

  • P-Tuning isn’t supported.

  • Empty metrics values on multi-GPU TensorRT-LLM model Metrics items gpu_cache_usage_perc, num_request_max, num_requests_running, num_requests_waiting, and prompt_tokens_total won’t be reported for multi-GPU TensorRT-LLM model, because TensorRT-LLM currently doesn’t expose iteration statistics in orchestrator mode.

  • No tokenizer found error when running PEFT This warning can be safely ignored.

Release 1.8.0#

New Language Models#

New Features#

  • The introduction of reasoning models, which are post-trained with two unique system prompts. Llama 3.3 Nemotron Super 49B V1 is the first of these models and can toggle between two behaviors by modifying the system prompt, with no additional scaffolding required:

    • detailed thinking on: Generates long chain-of-thought style responses with explicit thinking tokens.

    • detailed thinking off: Generates more concise responses without extended chain-of-thought or thinking tokens.

  • Buildable profiles now set default maximum sequence length from HF model configuration.

  • Support for RTX 5090, 5080, 4090, and 4080 GPUs.

Known Issues#

  • Llama 3.1 8B Instruct

    • For the LoRA enabled profiles, TTFT can be worse with the pre-built engines compared to the vLLM fallback while throughput is better. If TTFT is critical, please consider using the vLLM fallback.

    • For requests that consume the maximum sequence length generation (for example, requests that use ignore_eos: True), generation time can be very long and the request can consume the available KV cache causing future requests to stall. You should reduce concurrency under these conditions.

  • Llama 3.1 70B Instruct

    • Concurrent requests are blocked when running NIM with the -e NIM_MAX_MODEL_LENGTH option and a large max_tokens value in the request.

    • Accuracy was noted to be lower than the expected range with the following profiles: vllm-l40s-bf16-8, vllm-l40s-bf16-4, vllm-h200-bf16-8, vllm-h200-bf16-2, vllm-h100-bf16-8, vllm-h100-bf16-2, vllm-h100_nvl-bf16-8, and vllm-h100_nvl-bf16-4.

    • The suffix parameter isn’t supported in API calls.

    • Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Verify that the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.

  • Llama 3.1 Nemotron Nano 8B V1

    • Currently, LoRA is not supported for this model.

    • Currently, tool calling is not supported.

    • Accuracy degradation observed for the following profiles: vllm-a100-bf16-1 and vllm-h200-bf16-2.

  • Llama 3.2 3B Instruct

    • When making requests that consume the maximum sequence length generation (such as using ignore_eos: True), generation time might be significantly longer and can exhaust the available KV cache, causing future requests to stall. In this scenario, we recommend that you reduce concurrency.

    • gather_context_logits is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using the trtllm_buildable feature by setting the environment variable NIM_ENABLE_PROMPT_LOGPROBS.

  • Llama 3.3 Nemotron Super 49B V1

    • The model might occasionally bypass its typical thinking patterns for certain queries, especially in multi-turn conversations (for example, \n\n ).

    • Tool calling is not currently supported.

    • The A10G GPU is not currently supported.

    • You cannot deploy this model using KServe.

    • Listing the profiles for this model when the local cache is enabled can result in log warnings, which do not impact NIM functionality.

    • Logs for this model can contain spurious warnings. You can safely ignore them.

    • Avoid using the logit_bias parameter with this model because the results are unpredictable.

  • Llama 3.3 70B Instruct

    • At least 400GB of CPU memory is required.

    • Concurrent requests are blocked when running NIM with the -e NIM_MAX_MODEL_LENGTH option and a large max_tokens value in the request.

    • Accuracy was noted to be lower than the expected range with profiles vllm-bf16-tp4-pp1-lora and vllm-bf16-tp8-pp1.

    • The suffix parameter isn’t supported in API calls.

    • Insufficient memory for KV cache and LoRA cache might result in Out of Memory (OOM) errors. Make sure the hardware is appropriately sized based on the memory requirements for the workload. Long context and LoRA workloads should use larger TP configurations.

    • gather_context_logits is not enabled by default. If you require logits output, specify it in your TRT-LLM configuration when using the trtllm_buildable feature by setting the environment variable NIM_ENABLE_PROMPT_LOGPROBS.

  • The maximum supported context length may decrease based on memory availability.

  • All models return a 500 when setting logprobs=2, echo=true, and stream=false; they should return a 200.

  • Llama 3.1 8B Instruct RTX create chat completion with non-existing model returns a 500 when it should return a 404.

  • StarCoder2 7B model deployment fails on H100 with vLLM (TP1, PP1) at 250 concurrent requests.

  • On GH200 NVIDIA driver <560.35.03 can cause a segmentation fault or hanging during deployment. Fixed in GPU driver 560.35.03

  • DeepSeek-R1-Distill-Qwen-32B

    • BF16 profiles require at least 64GB GPU memory to launch. For example, vllm-bf16-tp1-pp1 profile does not launch successfully on a single L20 or other supported GPUs with GPU memory less than 80GB.

    • Structured generation has unexpected behavior due to CoT output. Despite this, guided_json parameter exhibits normal functionality when used with a JSON schema prompt.

    • When running vLLM engine with GPU that has smaller memory, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set NIM_MAX_MODEL_LEN = 32768 or less when using vLLM profile.

    • Observe the trtllm_buildable profile and the fine-tuned model launch crash on H100.

    • Recommend at least 80GB of CPU memory.

  • Some stop words might not work as expected.

  • vLLM for A100 and H200 is not supported.

  • The structured generation of regular expressions results may have unexpected responses. We recommend that you provide a strict answer format, such as \\boxed{}, to get the correct response.

Caveats#

  • Certain models, will need more host/CPU memory than the previous 1.6.0 release. For example, StarCoder2 7B on LLM NIM 1.8.0 will require 60 GB, instead of 50 GB on LLM NIM 1.6.0, of host memory.

Fixed Issues#

  • Fixed StarCoder2 7B might run into KV cache “no new block” error issue from release 1.6.0.

Release 1.7.0#

New Language Models#

New Features#

  • Added new SGLang backend for serving LLMs in addition to vLLM and TensorRT-LLM backends

Known Issues#

  • The min_p sampling parameter is not compatible with Deepseek and will be set to 0.0

  • The following are not supported for DeepSeek models:

    • LoRA

    • Guided Decoding

    • FT (fine-tuning)

  • DeepSeek models require setting --trust-remote-code. This is handled automatically in DeepSeek NIMs.

  • Only profiles matching the following hardware topologies are supported for the DeepSeek R1 model:

    • 2 nodes of 8xH100

    • 1 node of 8xH200

  • DeepSeek-R1 profiles disable DP attention by default to avoid crashes at higher concurrency. To turn on DP attention you can set NIM_ENABLE_DP_ATTENTION.

  • The model quantization is fp8, but the logs incorrectly display it as bf16.

Release 1.6.0#

New Language Models#

New Features#

Known Issues#

  • StarCoder2 7B might run into KV cache “no new block” error. Set NIM_MAX_MODEL_LEN = 4096 to enable all profiles.

  • DeepSeek-R1-Distill-Qwen-14B

    • When running vLLM engine with GPU memory less than 48GB, may run into ValueError of model max sequence length larger than maximum KV cache storage. Set NIM_MAX_MODEL_LEN = 32768 to enable vLLM profile.

  • Llama 3.2 3B Instruct does not support GH200 96GB profile.

Known Issues#

Release 1.5.0#

New Language Models#

New Features#

  • Support for A100 SXM 40GB

  • Added opt-in setting for guided decoding backend (NIM_GUIDED_DECODING_BACKEND) to reduce TTFT. Note: requires a GPU driver version compatible with PTX 8.5.

Known Issues#

  • The maximum supported context length may decrease based on memory availability.

  • Filenames should not contain spaces if a custom fine-tuned model directory is provided.

  • The "fast_outlines" guided decoding backend will fail with requests that force the model to generate emoji.

  • StarCoderBase 15.5B does not support the chat endpoint.

  • Llama 3.3 70B Instruct requires at least 400GB of CPU memory.

  • DeepSeek-R1-Distill-Llama-70B

    • This model does not include pre-built engines for TP8, A10G, and H100.

    • To deploy, set -e NIM_MAX_MODEL_LEN = 131072.

Release 1.4.0#

New Models#

New Features#

  • Various performance improvements and bug fixes.

Fixed Issues#

  • The issue that “There’s an incorrect warning regarding checksums when running the 1.3 NIM “ is fixed.

Known Issues#

  • LoRA is not supported for the following models:

  • Gemma-2-2b does not support the System role in a chat or completions API call.

  • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. NVIDIA recommends that you filter these characters out of prompts before submitting the prompt to an LLM.

  • Deploying with KServe can require changing permissions for the cache directory. See the Serving models from local assets section for details.

  • Qwen2.5-72B-Instruct

    • The alternative option to use vLLM is not supported due to poor performance. For the GPUs that have no optimized version, use the trtllm_buildable feature to build the TRT-LLM on the fly.

    • For all pre-built engines, gather_context_logits is not enabled. If users require logits output, specify it in your own TRT-LLM configuration when you use the trtllm_buildable feature

    • The tool_choice is not supported.

    • Deploying NIM with NIM_LOG_LEVEL=CRITICAL causes the start process to hang. Use WARNING, DEBUG or INFO instead.

  • Qwen2.5-7B-Instruct

    • There is a pre-built TRT-LLM engine for L20, but it is not fully optimized for different use cases.

    • LoRA is not supported.

    • The tool_choice parameter is not supported.

    • May have a performance issue when using vLLM on L20.

Release 1.3.0#

New Language Models#

New Features#

  • Custom fine-tuned model support. See FT support for more details.

  • The introduction of tensorrt_llm-local_build profiles, which enable the use of the TensorRT-LLM runtime on GPUs without pre-built optimized engines. See the Model Profiles page for more details.

  • Caching of locally-built and fine-tuned engines to work seamlessly with regular LLM NIM workflow.

  • Implemented key-value cache to speed up inference when the initial prompt is identical across multiple requests. Refer to KV Cache for details.

Users with systems that do not have pre-built optimized engines available should see substantial speed ups over previous versions of NIM, but may experience slower start times on first deployment due to the local compilation process.

Known Issues#

  • Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. NVIDIA recommends that you filter these characters out of prompts before submitting the prompt to an LLM.

  • All models return a 500 when setting logprobs=2, echo=true, and stream=false; they should return a 200.

  • Llama 3.1 70B Instruct:

    • LoRA A10G TP8 for both vLLM and TRTLLM not supported due to insufficient memory.

    • The performance of vLLM LoRA on L40s TP88 is significantly suboptimal.

    • Deploying with KServe fails. As a workaround, try increasing the CPU memory to at least 77GB in the runtime YAML file.

    • There’s an incorrect warning regarding checksums when running the 1.3 NIM. For example: Profile 0462612f0f2de63b2d423bc3863030835c0fbdbc13b531868670cc416e030029 is not fully defined with checksums. It is safe to ignore this warning.

    • Buildable TRT-LLM BF16 TP4 LoRA profiles on A100 and H100 can fail due to not enough host memory. You can work around this problem by setting NIM_LOW_MEMORY_MODE=1.

  • Llama 3.1 405B Instruct TRT-LLM BF16 TP16 buildable profile cannot be deployed on A100.

  • Mistral 7B Instruct V0.3 with optimized TRT-LLM profiles has lower performance compared to the OpenSource vLLM.

  • Mixtral 8x7B Instruct v0.1

    • Does not support function calling and structured generation on vLLM profiles. See vLLM #9433 for details.

    • LoRA is not supported with TRTLLM backend for MoE models

    • vLLM LoRA profiles return an internal server error/500. Set NIM_MAX_LORA_RANK=256 to use LoRA with vLLM.

    • If you enable NIM_ENABLE_KV_CACHE_REUSE with the L40S FP8 TP4 Throughput profile, deployment fails.

  • Nemotron 4 340B Instruct 128K does not support buildable TRT-LLM profiles.

  • The container may crash when building local TensorRT LLM engines if there isn’t enough host memory. If that happens, try setting NIM_LOW_MEMORY_MODE=1.

  • Function calling and structured generation is not supported for pipeline parallelism greater than 1.

  • Locally-built fine tuned models are not supported with FP8 profiles.

  • Logarithmic Probabilities (logprobs) support with echo:

    • TRTLLM engine needs to be built explicitly with --gather_generation_logits

    • Enabling this may impact model throughput and inter-token latency.

    • NIM_MODEL_NAME must be set to the generated model repository.

  • vGPU related issues:

    • trtllm_buildable profiles might encounter an Out of Memory (OOM) error on vGPU systems, which can be fixed via NIM_LOW_MEMORY_MODE=1 flag.

    • When using vGPU systems with trtllm_buildable profiles, you might still encounter a broken connection error. For example, client_loop: send disconnect: Broken pipe.

  • OOB with tensorrt_llm-local_build is 8K. Use the NIM_MAX_MODEL_LEN environment variable to modify the sequence length within the range of values supported by a model.

  • The GET v1/metrics API is missing from the docs page (http://HOST-IP:8000/docs, where HOST-IP is the IP address of your host).

Software requirements updated#

Release 1.3.0 is based on CUDA 12.6.1 which requires NVIDIA Driver release 560 or later. However, if you are running on a data center GPU (for example, A100 or any other data center GPU), you can use NVIDIA driver release 470.57 (or later R470), 535.86 (or later R535), or 550.54 (or later R550)

Release 1.2.3#

New Language Models#

Known Issues#

  • Code Llama models:

    • FP8 profiles are not released due to accuracy degradations

    • LoRA is not supported

  • Llama 3.1 8B Instruct does not support LoRA on L40S with TRT-LLM.

  • Mistral NeMo Minitron 8B 8K Instruct:

    • Tool calling is not supported

    • LoRA is not supported

    • vLLM TP4 or TP8 profiles are not available.

  • Mixtral 8x7b Instruct v0.1 vLLM profiles do not support function calling and structured generation. See vLLM #9433.

  • Phi 3 Mini 4K Instruct models:

    • LoRA is not supported

    • Tool calling is not supported

  • Phind Code Llama 34B v2 Instruct:

    • LoRA is not supported

    • Tool calling is not supported

  • logprobs=2 is only supported for TRT-LLM (optimized) configurations for Reward models; this option is supported for the vLLM (non-optimized) configurations for all models. Refer to the Supported Models section for details.

  • NIM with vLLM backend may intermittently enter a state where the API return a “Service in unhealthy” message. This is a known issue with vLLM (vllm-project/vllm#5060). You must restart the NIM in this case.

Release 1.2.1#

New Models#

Known Issues#

  • vllm + LoRA profiles for long context models (model_max_len > 65528) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel, max_num_batched_tokens must be <= 65528 when LoRA is enabled. As a workaround you can set NIM_MAX_MODEL_LEN=65525 or lower

  • LoRA is not supported on Llama 3.1 8B Instruct on L40S with TRT-LLM.

  • logit_bias is not available for any model using the TRT-LLM backend.

Release 1.2.0#

New Language Models#

For a list of all supported models refer to the Supported Models topic.

New Features#

  • Add vGPU support by improving device selector. Refer to Supported Models for vGPU details.

    • With UVM and an optimized engine available, the model runs on TRT-LLM.

    • Otherwise, the model runs on vLLM.

  • Add OpenTelemetry support for tracing and metrics in the API server. Refer to Configuration for details including NIM_ENABLE_OTEL, NIM_OTEL_TRACES_EXPORTER, NIM_OTEL_METRICS_EXPORTER,NIM_OTEL_EXPORTER_OTLP_ENDPOINT and NIM_OTEL_SERVICE_NAME.

  • Enabled ECHO request in completion API to align with OpenAI specifications. Refer to NIM OpenAPI Schema for details.

  • Add logprob support for ECHO mode which return logprobs for full context including both prompt and output tokens.

  • Add FP8 engine support with FP16 lora. Refer to PEFT for details about lora usage.

Fixed Issues#

  • Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile

Known Issues#

  • NIM does not support Multi-instance GPU mode (MIG).

  • Nemotron4 models require use of ‘slow’ tokenizers. ‘fast’ tokenizers causes accuracy degradation.

  • LoRA is not supported for Llama 3.1 405B Instruct.

  • vLLM profiles are not supported for Llama 3.1 405B Instruct.

  • Optimized engines (TRT-LLM) aren’t supported with NVIDIA vGPU. To use optimized engines, use GPU Passthrough.

  • When repetition_penalty=2, the response time for larger models is greater. Use repetition_penalty=1 on larger models.

  • Llama 3.1 8B Instruct H100 and L40s LoRA profiles can hang with high (>2000) ISL values.

Release 1.1.2#

New Language Models#

  • Llama 3.1 405B Instruct

    • Note: Due to the large size of this model, it is only supported on a subset of GPUs and optimization targets. Refer to Supported Models for details.

New Features#

  • Added support for vLLM fallback profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Known Issues#

LoRA is not supported for Llama 3.1 405B Instruct

vLLM profiles are not supported for Llama 3.1 405B Instruct

Throughput optimized profiles are not supported on A100 FP16 and H100 FP16 for Llama 3.1 405B Instruct

Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile
Users deploying a cache into an air-gapped system or read-only volume and intending to use the multi-GPU vLLM profile must create the following JSON file from the system used to initially download and generate the cache:

echo '{
    "0->0": false,
    "0->1": true,
    "1->0": true,
    "1->1": false
}' > $NIM_CACHE_PATH/vllm/cache/gpu_p2p_access_cache_for_0,1.json file

CUDA out of memory issue for Llama2 70b v1.0.3
The vllm-fp16-tp2 profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other types of GPUs might encounter a “CUDA out of memory” issue.

Llama 3.1 FP8 requires NVIDIA driver version >= 550

Release 1.1.1#

Known Issues#

  • vLLM profiles are not supported for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Release 1.1.0#

New Language Models#

  • Llama 3.1 8B Base

  • Llama 3.1 8B Instruct

  • Llama 3.1 70B Instruct

New Features#

Known Issues#

  • vLLM profiles for Llama 3.1 models will fail with ValueError: Unknown RoPE scaling type extended.

  • NIM does not support Multi-instance GPU mode (MIG).

Release 1.0#

  • Release notes for Release 1.0 are located in the 1.0 documentation.