Release Notes#

Note that issues listed in the Known Issues sections are valid until they are fixed in a subsequent release.

Release 1.2.3#

Summary#

This is the latest version of NIM.

New Language Models#

Known Issues#

Code Llama models:
- FP8 profiles are not released due to accuracy degradations
- LoRA is not supported
Llama 3.1 8B Instruct does not support LoRA on L40S with TRT-LLM.
Mistral NeMo Minitron 8B 8K Instruct:
- Tool calling is not supported
- LoRA is not supported
- vLLM TP4 or TP8 profiles are not available.
Mixtral 8x7b Instruct v0.1 vLLM profiles do not support function calling and structured generation. See vLLM #9433.
Nemotron 4 340B Reward does not run on the vLLM platform and does not support LoRA.
Phi 3 Mini 4K Instruct models:
- LoRA is not supported
- Tool calling is not supported
Phind Code Llama 34B v2 Instruct:
- LoRA is not supported
- Tool calling is not supported
logprobs=2 is only supported for TRT-LLM (optimized) configurations for Reward models; this option is supported for the vLLM (non-optimized) configurations for all models. Refer to the Support Matrix section for details.
vllm + LoRA profiles for long context models (model_max_len > 65528) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel, max_num_batched_tokens must be <= 65528 when LoRA is enabled. As a workaround you can set NIM_MAX_MODEL_LEN=65525 or lower
NIM with vLLM backend may intermittently enter a state where the API return a “Service in unhealthy” message. This is a known issue with vLLM (https://github.com/vllm-project/vllm/issues/5060). You must restart the NIM in this case.

Release 1.2.1#

Summary#

This is an update of NIM.

New Models#

Known Issues#

vllm + LoRA profiles for long context models (model_max_len > 65528) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel, max_num_batched_tokens must be <= 65528 when LoRA is enabled. As a workaround you can set NIM_MAX_MODEL_LEN=65525 or lower
Nemotron 4 340B Reward does not run on the vLLM platform and does not support LoRA.
LoRA is not supported on Llama 3.1 8B Instruct on L40S with TRT-LLM.
logit_bias is not available for any model using the TRT-LLM backend.

Release 1.2.0#

Summary#

This is an update of NIM.

New Language Models#

Nemotron 4 340B Reward

For a list of all supported models refer to the Models topic.

New Features#

Support for Nemotron 4 340B Reward model. Refer to Reward Model and Model Card for details.
Add vGPU support by improving device selector. Refer to Support Matrix for vGPU details.
- With UVM and an optimized engine available, the model runs on TRT-LLM.
- Otherwise, the model runs on vLLM.
Add OpenTelemetry support for tracing and metrics in the API server. Refer to Configuration for details including NIM_ENABLE_OTEL, OTEL_TRACES_EXPORTER, OTEL_METRICS_EXPORTER,OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME.
Enabled ECHO request in completion API to align with OpenAI specifications. Refer to NIM OpenAPI Schema for details.
Add logprob support for ECHO mode which return logprobs for full context including both prompt and output tokens.
Add FP8 engine support with FP16 lora. Refer to PEFT for details about lora usage.

Fixed Issues#

Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile

Known Issues#

NIM does not support Multi-instance GPU mode (MIG).
Nemotron4 models require use of ‘slow’ tokenizers. ‘fast’ tokenizers causes accuracy degradation.
LoRA is not supported for Llama 3.1 405B Instruct.
vLLM profiles are not supported for Llama 3.1 405B Instruct.
Optimized engines (TRT-LLM) aren’t supported with NVIDIA vGPU. To use optimized engines, use GPU Passthrough.
When repetition_penalty=2, the response time for larger models is greater. Use repetition_penalty=1 on larger models.
Llama 3.1 8B Instruct H100 and L40s LoRA profiles can hang with high (>2000) ISL values.

Release 1.1.2#

Summary#

Enable fallback for Llama 3.1 architectures to NIM model support list.

New Language Models#

Llama 3.1 405B Instruct
- Note: Due to the large size of this model, it is only supported on a subset of GPUs and optimization targets. Refer to Support Matrix for details.

New Features#

Added support for vLLM fallback profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Known Issues#

LoRA is not supported for Llama 3.1 405B Instruct

vLLM profiles are not supported for Llama 3.1 405B Instruct

Throughput optimized profiles are not supported on A100 FP16 and H100 FP16 for Llama 3.1 405B Instruct

Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile
Users deploying a cache into an air-gapped system or read-only volume and intending to use the multi-GPU vLLM profile must create the following JSON file from the system used to initially download and generate the cache:

echo '{
    "0->0": false,
    "0->1": true,
    "1->0": true,
    "1->1": false
}' > $NIM_CACHE_PATH/vllm/vllm/gpu_p2p_access_cache_for_0,1.json file

CUDA out of memory issue for Llama2 70b v1.0.3
The vllm-fp16-tp2 profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other types of GPUs might encounter a “CUDA out of memory” issue.

Llama 3.1 FP8 requires NVIDIA driver version >= 550

Release 1.1.1#

Summary#

Removed incompatible vllm profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Known Issues#

vLLM profiles are not supported for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Release 1.1.0#

Summary#

This is an update of NIM.

New Language Models#

Llama 3.1 8B Base
Llama 3.1 8B Instruct
Llama 3.1 70B Instruct

New Features#

Multi-node
Function Calling and Tool Use
Chunked pre-fill
Experimental support for Llama Stack API
AirGapped support
Changed the format of FP8 from PTQ to Meta-FP8 to improve accuracy

Known Issues#

vLLM profiles for Llama 3.1 models will fail with ValueError: Unknown RoPE scaling type extended.
NIM does not support Multi-instance GPU mode (MIG).

Release 1.0#

Release notes for Release 1.0 are located in the 1.0 documentation.