Release Notes#
Note that issues listed in the Known Issues sections are valid until they are fixed in a subsequent release.
Release 1.4.0#
Summary#
This is the latest version of NIM.
New Models#
New Features#
Various performance improvements and bug fixes.
Fixed Issues#
The issue that “There’s an incorrect warning regarding checksums when running the 1.3 NIM “ is fixed.
Known Issues#
LoRA is not supported for the following models:
Gemma-2-2b does not support the System role in a chat or completions API call.
Prompts with Unicode characters in the range from 0x0e0020 to 0x0e007f can produce unpredictable responses. NVIDIA recommends that you filter these characters out of prompts before submitting the prompt to an LLM.
Deploying with KServe can require changing permissions for the cache directory. See the Serving models from local assets section for details.
Release 1.3.0#
Summary#
This is an update of NIM.
New Language Models#
New Features#
Custom fine-tuned model support. See FT support for more details.
The introduction of
tensorrt_llm-local_build
profiles, which enable the use of the TensorRT-LLM runtime on GPUs without pre-built optimized engines. See the Model Profiles page for more details.Caching of locally-built and fine-tuned engines to work seamlessly with regular LLM NIM workflow.
Implemented key-value cache to speed up inference when the initial prompt is identical across multiple requests. Refer to KV Cache for details.
Users with systems that do not have pre-built optimized engines available should see substantial speed ups over previous versions of NIM, but may experience slower start times on first deployment due to the local compilation process.
Known Issues#
-
Returns a 500 when setting
logprobs=2
,echo=true
, andstream=false
; it should return 200.LoRA A10G TP8 for both vLLM and TRTLLM not supported due to insufficient memory.
The performance of vLLM LoRA on L40s TP88 is significantly suboptimal.
Deploying with KServe fails. As a workaround, try increasing the CPU memory to at least 77GB in the runtime YAML file.
There’s an incorrect warning regarding checksums when running the 1.3 NIM. For example:
Profile 0462612f0f2de63b2d423bc3863030835c0fbdbc13b531868670cc416e030029 is not fully defined with checksums
. It is safe to ignore this warning.Buildable TRT-LLM BF16 TP4 LoRA profiles on A100 and H100 can fail due to not enough host memory. You can work around this problem by setting
NIM_LOW_MEMORY_MODE=1
.
Llama 3.1 405B Instruct TRT-LLM BF16 TP16 buildable profile cannot be deployed on A100.
Mistral 7B Instruct V0.3 with optimized TRT-LLM profiles has lower performance compared to the OpenSource vLLM.
-
Does not support function calling and structured generation on vLLM profiles. See vLLM #9433 for details.
LoRA is not supported with TRTLLM backend for MoE models
vLLM LoRA profiles return an internal server error/500. Set
NIM_MAX_LORA_RANK=256
to use LoRA with vLLM.If you enable
NIM_ENABLE_KV_CACHE_REUSE
with the L40S FP8 TP4 Throughput profile, deployment fails.
Nemotron 4 340B Instruct 128K does not support buildable TRT-LLM profiles.
The container may crash when building local TensorRT LLM engines if there isn’t enough host memory. If that happens, try setting
NIM_LOW_MEMORY_MODE=1
.Function calling and structured generation is not supported for pipeline parallelism greater than 1.
Locally-built fine tuned models are not supported with FP8 profiles.
Logarithmic Probabilities (
logprobs
) support with echo:TRTLLM engine needs to be built explicitly with
--gather_generation_logits
Enabling this may impact model throughput and inter-token latency.
NIM_MODEL_NAME must be set to the generated model repository.
vGPU related issues:
trtllm_buildable
profiles might encounter an Out of Memory (OOM) error on vGPU systems, which can be fixed viaNIM_LOW_MEMORY_MODE=1
flag.When using vGPU systems with
trtllm_buildable
profiles, you might still encounter a broken connection error. For example,client_loop: send disconnect: Broken pipe
.
OOB with
tensorrt_llm-local_build
is 8K. Use the NIM_MAX_MODEL_LEN environment variable to modify the sequence length within the range of values supported by a model.The
GET v1/metrics
API is missing from the docs page (http://HOST-IP:8000/docs
, whereHOST-IP
is the IP address of your host).
Software requirements updated#
Release 1.3.0 is based on CUDA 12.6.1 which requires NVIDIA Driver release 560 or later. However, if you are running on a data center GPU (for example, A100 or any other data center GPU), you can use NVIDIA driver release 470.57 (or later R470), 535.86 (or later R535), or 550.54 (or later R550)
Release 1.2.3#
Summary#
This is an update of NIM.
New Language Models#
Known Issues#
Code Llama models:
FP8 profiles are not released due to accuracy degradations
LoRA is not supported
Llama 3.1 8B Instruct does not support LoRA on L40S with TRT-LLM.
Mistral NeMo Minitron 8B 8K Instruct:
Tool calling is not supported
LoRA is not supported
vLLM TP4 or TP8 profiles are not available.
Mixtral 8x7b Instruct v0.1 vLLM profiles do not support function calling and structured generation. See vLLM #9433.
Phi 3 Mini 4K Instruct models:
LoRA is not supported
Tool calling is not supported
Phind Code Llama 34B v2 Instruct:
LoRA is not supported
Tool calling is not supported
logprobs=2
is only supported for TRT-LLM (optimized) configurations for Reward models; this option is supported for the vLLM (non-optimized) configurations for all models. Refer to the Support Matrix section for details.NIM with vLLM backend may intermittently enter a state where the API return a “Service in unhealthy” message. This is a known issue with vLLM (https://github.com/vllm-project/vllm/issues/5060). You must restart the NIM in this case.
Release 1.2.1#
Summary#
This is an update of NIM.
New Models#
Known Issues#
vllm + LoRA profiles for long context models (
model_max_len
> 65528) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel,max_num_batched_tokens
must be <= 65528 when LoRA is enabled. As a workaround you can setNIM_MAX_MODEL_LEN=65525
or lowerLoRA is not supported on Llama 3.1 8B Instruct on L40S with TRT-LLM.
logit_bias
is not available for any model using the TRT-LLM backend.
Release 1.2.0#
Summary#
This is an update of NIM.
New Language Models#
Nemotron 4 340B Reward
For a list of all supported models refer to the Models topic.
New Features#
Support for Nemotron 4 340B Reward model. Refer to Reward Model and Model Card for details.
Add vGPU support by improving device selector. Refer to Support Matrix for vGPU details.
With UVM and an optimized engine available, the model runs on TRT-LLM.
Otherwise, the model runs on vLLM.
Add OpenTelemetry support for tracing and metrics in the API server. Refer to Configuration for details including
NIM_ENABLE_OTEL
,NIM_OTEL_TRACES_EXPORTER
,NIM_OTEL_METRICS_EXPORTER
,NIM_OTEL_EXPORTER_OTLP_ENDPOINT
andNIM_OTEL_SERVICE_NAME
.Enabled ECHO request in completion API to align with OpenAI specifications. Refer to NIM OpenAPI Schema for details.
Add
logprob
support for ECHO mode which returnlogprobs
for full context including both prompt and output tokens.Add FP8 engine support with FP16
lora
. Refer to PEFT for details aboutlora
usage.
Fixed Issues#
Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile
Known Issues#
NIM does not support Multi-instance GPU mode (MIG).
Nemotron4 models require use of ‘slow’ tokenizers. ‘fast’ tokenizers causes accuracy degradation.
LoRA is not supported for Llama 3.1 405B Instruct.
vLLM profiles are not supported for Llama 3.1 405B Instruct.
Optimized engines (TRT-LLM) aren’t supported with NVIDIA vGPU. To use optimized engines, use GPU Passthrough.
When
repetition_penalty=2
, the response time for larger models is greater. Userepetition_penalty=1
on larger models.Llama 3.1 8B Instruct H100 and L40s LoRA profiles can hang with high (>2000) ISL values.
Release 1.1.2#
Summary#
Enable fallback for Llama 3.1 architectures to NIM model support list.
New Language Models#
Llama 3.1 405B Instruct
Note: Due to the large size of this model, it is only supported on a subset of GPUs and optimization targets. Refer to Support Matrix for details.
New Features#
Added support for vLLM fallback profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Known Issues#
LoRA is not supported for Llama 3.1 405B Instruct
vLLM profiles are not supported for Llama 3.1 405B Instruct
Throughput optimized profiles are not supported on A100 FP16 and H100 FP16 for Llama 3.1 405B Instruct
Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile
Users deploying a cache into an air-gapped system or read-only volume and intending to use the multi-GPU vLLM profile must create the following JSON file from the system used to initially download and generate the cache:
echo '{
"0->0": false,
"0->1": true,
"1->0": true,
"1->1": false
}' > $NIM_CACHE_PATH/vllm/cache/gpu_p2p_access_cache_for_0,1.json file
CUDA out of memory issue for Llama2 70b v1.0.3
The vllm-fp16-tp2
profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other types of GPUs might encounter a “CUDA out of memory” issue.
Llama 3.1 FP8 requires NVIDIA driver version >= 550
Release 1.1.1#
Summary#
Removed incompatible vllm profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Known Issues#
vLLM profiles are not supported for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Release 1.1.0#
Summary#
This is an update of NIM.
New Language Models#
Llama 3.1 8B Base
Llama 3.1 8B Instruct
Llama 3.1 70B Instruct
New Features#
Chunked pre-fill
Experimental support for Llama Stack API
Changed the format of FP8 from PTQ to Meta-FP8 to improve accuracy
Known Issues#
vLLM profiles for Llama 3.1 models will fail with
ValueError: Unknown RoPE scaling type extended
.NIM does not support Multi-instance GPU mode (MIG).
Release 1.0#
Release notes for Release 1.0 are located in the 1.0 documentation.