Release Notes
Summary
This is the latest version of NIM.
Language Models
Llama 3 Swallow 70B Instruct V0.1
Llama 3 Taiwan 70B Instruct
Note: For the H100 TP8 FP8 Latency Profile, we have intermittently observed higher TTFT values at low concurrency values.
Llama 3.1 405B Instruct
Note: Due to the large size of this model, it is only supported on a subset of GPUs and optimization targets. See the Support Matrix for more details.
Mistral-NeMo-12B-Instruct
Nemotron 4 340B Instruct
New Features
Added support for vLLM fallback profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Known Issues
LoRA is not supported for Llama 3.1 405B Instruct
vLLM profiles are not supported for Llama 3.1 405B Instruct
Throughput optimized profiles are not supported on A100 FP16 and H100 FP16 for Llama 3.1 405B Instruct
Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile
Users deploying a cache into an air-gapped system or read-only volume and intending to use the multi-GPU vLLM profile must create the following JSON file from the system used to initially download and generate the cache:
echo '{
"0->0": false,
"0->1": true,
"1->0": true,
"1->1": false
}' > $NIM_CACHE_PATH/vllm/vllm/gpu_p2p_access_cache_for_0,1.json file
CUDA out of memory issue for Llama2 70b v1.0.3
The vllm-fp16-tp2
profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other types of GPUs might encounter a “CUDA out of memory” issue.
Llama 3.1 FP8 requires NVIDIA driver version >= 550
Summary
Removed incompatible vllm profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Known Issues
vLLM profiles are not supported for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct
Summary
This is an update of NIM.
Language Models
Llama 3.1 8B Base
Llama 3.1 8B Instruct
Llama 3.1 70B Instruct
New Features
Chunked pre-fill
Experimental support for Llama Stack API
Known Issues
vLLM profiles for Llama 3.1 models will fail with
ValueError: Unknown RoPE scaling type extended
.NIM does not support Multi-instance GPU mode (MIG).
Release notes for Release 1.0 are located in the 1.0 documentation.