Release Notes
Summary
This is the latest version of NIM.
Known Issues
vllm + LoRA profiles for long context models (
model_max_len > 65528
) will not load resulting in ValueError: Due to limitations of the custom LoRA CUDA kernel,max_num_batched_tokens
must be <= 65528 when LoRA is enabled. As a workaround you can setNIM_MAX_MODEL_LEN=65525
or lowerNemotron 4 340B Reward does not run on the vLLM platform and does not support LoRA.
Summary
This is an update of NIM.
New Language Models
Nemotron 4 340B Reward
For a list of all supported models refer to the Models topic.
New Features
Support for Nemotron 4 340B Reward model. Refer to Reward Model and Model Card for details.
Add vGPU support by improving device selector. Refer to Support Matrix for vGPU details.
With UVM and an optimized engine available, the model runs on TRT-LLM.
Otherwise, the model runs on vLLM.
Add OpenTelemetry support for tracing and metrics in the API server. Refer to Configuration for details including
NIM_ENABLE_OTEL
,OTEL_TRACES_EXPORTER
,OTEL_METRICS_EXPORTER
,OTEL_EXPORTER_OTLP_ENDPOINT
andOTEL_SERVICE_NAME
.Enabled ECHO request in completion API to align with OpenAI specifications. Refer to NIM OpenAPI Schema for details.
Add
logprob
support for ECHO mode which returnlogprobs
for full context including both prompt and output tokens.Add FP8 engine support with FP16
lora
. Refer to PEFT for details aboutlora
usage.
Fixed Issues
Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile
Known Issues
NIM does not support Multi-instance GPU mode (MIG).
Nemotron4 models require use of ‘slow’ tokenizers. ‘fast’ tokenizers causes accuracy degradation.
LoRA is not supported for Llama 3.1 405B Instruct
vLLM profiles are not supported for Llama 3.1 405B Instruct
Summary
Enable fallback for Llama 3.1 architectures to NIM model support list.
New Language Models
Llama 3.1 405B Instruct
Note: Due to the large size of this model, it is only supported on a subset of GPUs and optimization targets. Refer to Support Matrix for details.
New Features
Added support for vLLM fallback profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Known Issues
LoRA is not supported for Llama 3.1 405B Instruct
vLLM profiles are not supported for Llama 3.1 405B Instruct
Throughput optimized profiles are not supported on A100 FP16 and H100 FP16 for Llama 3.1 405B Instruct
Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile
Users deploying a cache into an air-gapped system or read-only volume and intending to use the multi-GPU vLLM profile must create the following JSON file from the system used to initially download and generate the cache:
echo '{
"0->0": false,
"0->1": true,
"1->0": true,
"1->1": false
}' > $NIM_CACHE_PATH/vllm/vllm/gpu_p2p_access_cache_for_0,1.json file
CUDA out of memory issue for Llama2 70b v1.0.3
The vllm-fp16-tp2
profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other types of GPUs might encounter a “CUDA out of memory” issue.
Llama 3.1 FP8 requires NVIDIA driver version >= 550
Summary
Removed incompatible vllm profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Known Issues
vLLM profiles are not supported for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Summary
This is an update of NIM.
New Language Models
Llama 3.1 8B Base
Llama 3.1 8B Instruct
Llama 3.1 70B Instruct
New Features
Chunked pre-fill
Experimental support for Llama Stack API
Known Issues
vLLM profiles for Llama 3.1 models will fail with
ValueError: Unknown RoPE scaling type extended
.NIM does not support Multi-instance GPU mode (MIG).
Release notes for Release 1.0 are located in the 1.0 documentation.