Release Notes#

Release 1.1.2#

Summary#

This is the latest version of NIM.

Language Models#

Llama 3 Swallow 70B Instruct V0.1
Llama 3 Taiwan 70B Instruct
- Note: For the H100 TP8 FP8 Latency Profile, we have intermittently observed higher TTFT values at low concurrency values.
Llama 3.1 405B Instruct
- Note: Due to the large size of this model, it is only supported on a subset of GPUs and optimization targets. See the Support Matrix for more details.
Mistral-NeMo-12B-Instruct
Nemotron 4 340B Instruct

New Features#

Added support for vLLM fallback profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Known Issues#

LoRA is not supported for Llama 3.1 405B Instruct

vLLM profiles are not supported for Llama 3.1 405B Instruct

Throughput optimized profiles are not supported on A100 FP16 and H100 FP16 for Llama 3.1 405B Instruct

Cache deployment fails for air-gapped system or read-only volume for multi-GPU vLLM profile
Users deploying a cache into an air-gapped system or read-only volume and intending to use the multi-GPU vLLM profile must create the following JSON file from the system used to initially download and generate the cache:

echo '{
    "0->0": false,
    "0->1": true,
    "1->0": true,
    "1->1": false
}' > $NIM_CACHE_PATH/vllm/vllm/gpu_p2p_access_cache_for_0,1.json file

CUDA out of memory issue for Llama2 70b v1.0.3
The vllm-fp16-tp2 profile has been validated and is known to work on H100 x 2 and A100 x 2 configurations. Other types of GPUs might encounter a “CUDA out of memory” issue.

Llama 3.1 FP8 requires NVIDIA driver version >= 550

Release 1.1.1#

Summary#

Removed incompatible vllm profiles for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Known Issues#

vLLM profiles are not supported for Llama 3.1 8B Base, Llama 3.1 8B Instruct, and Llama 3.1 70B Instruct

Release 1.1.0#

Summary#

This is an update of NIM.

Language Models#

Llama 3.1 8B Base
Llama 3.1 8B Instruct
Llama 3.1 70B Instruct

New Features#

Multi-node
Function Calling and Tool Use
Chunked pre-fill
Experimental support for Llama Stack API
AirGapped support

Known Issues#

vLLM profiles for Llama 3.1 models will fail with ValueError: Unknown RoPE scaling type extended.
NIM does not support Multi-instance GPU mode (MIG).

Release 1.0#

Release notes for Release 1.0 are located in the 1.0 documentation.